Zhao et al (in preparation) includes references to the following material:
GSearch is an ultra-fast and scalable (e.g., millions or billions of genomes) microbial genomic search program based on probabilistic data structures and graph-based nearest neighbor search (e.g., Hierarchical Navigable Small World Graph, HNSW). For example, MinHash-like data structures such as SuperMinHash or ProbMinHash, or HyperLogLog-like data structures such as HyperLogLog or SetSketch are implemented for speed/accuracy and space/accuracy, respectively. GSearch will be even faster compared to other tools with larger database size due to O(log(N)) time complexity. Details on how to install and use GSearch can be found here: GSearch.
The download links below are pre-built databases for different microbial genomes. You can skip steps to build database but use the pre-built databases below.
Name | Description | Files |
---|---|---|
GTDBv207_gsearch | Graph database for GTDBv207, v2023 | GTDBv207_v2023.tar.gz (7.9Gb) |
IMG_VR4_gsearch | Graph database for IMG VR4, v2023 | IMG_VR4_v2023.tar.gz (27.5Gb) |
Mycocosm_gsearch | Graph database for Mycocosm, v2023 | Mycocosm_v2023.tar.gz (4.0Gb) |
NCBI_RefSeq_gsearch | Graph database for NCBI_RefSeq, v2023 | NCBI_RefSeq_v2023.tar.gz (35.5Gb) |
GSearch_optdens.tar.gz | Graph database for RefSeq_gsearch_optdens | GSearch_optdens.tar.gz (34Gb) |
Test_genome_Tara8466 | Test genome for Tara Ocean Project | Test_genome_Tara8466.tar.gz (5.7Gb) |