Public data

from Kostas lab

Zhao et al (2024) includes references to the following material:

GSearch is an ultra-fast and scalable (e.g., millions or billions of genomes) microbial genomic search program based on probabilistic data structures and graph-based nearest neighbor search (e.g., Hierarchical Navigable Small World Graph, HNSW). For example, MinHash-like data structures such as SuperMinHash or ProbMinHash, or HyperLogLog-like data structures such as HyperLogLog or SetSketch are implemented for speed/accuracy and space/accuracy, respectively. GSearch will be even faster compared to other tools with larger database size due to O(log(N)) time complexity. Details on how to install and use GSearch can be found here: GSearch.

The download links below are pre-built databases for different microbial genomes. You can skip steps to build database but use the pre-built databases below.


Name Description Files
GTDBv207_gsearch Graph database for GTDBv207, v2023 GTDBv207_v2023.tar.gz (7.9Gb)
IMG_VR4_gsearch Graph database for IMG VR4, v2023 IMG_VR4_v2023.tar.gz (27.5Gb)
Mycocosm_gsearch Graph database for Mycocosm, v2023 Mycocosm_v2023.tar.gz (4.0Gb)
NCBI_RefSeq_gsearch Graph database for NCBI_RefSeq, v2023 NCBI_RefSeq_v2023.tar.gz (35.5Gb)
GSearch_optdens.tar.gz Graph database for RefSeq_gsearch_optdens GSearch_optdens.tar.gz (34Gb)
Test_genome_Tara8466 Test genome for Tara Ocean Project Test_genome_Tara8466.tar.gz (5.7Gb)