Kostas lab | Data

Zhao et al (2024) includes references to the following material:

GSearch is an ultra-fast and scalable (e.g., millions or billions of genomes) microbial genomic search program based on probabilistic data structures and graph-based nearest neighbor search (e.g., Hierarchical Navigable Small World Graph, HNSW). For example, MinHash-like data structures such as SuperMinHash or ProbMinHash, or HyperLogLog-like data structures such as HyperLogLog or SetSketch are implemented for speed/accuracy and space/accuracy, respectively. GSearch will be even faster compared to other tools with larger database size due to O(log(N)) time complexity. Details on how to install and use GSearch can be found here: GSearch.

The download links below are pre-built databases for different microbial genomes. You can skip steps to build database but use the pre-built databases below.

Name	Description	Files
GTDBv207_gsearch	Graph database for GTDBv207, v2023	GTDBv207_v2023.tar.gz (7.9Gb)
IMG_VR4_gsearch	Graph database for IMG VR4, v2023	IMG_VR4_v2023.tar.gz (27.5Gb)
Mycocosm_gsearch	Graph database for Mycocosm, v2023	Mycocosm_v2023.tar.gz (4.0Gb)
NCBI_RefSeq_gsearch	Graph database for NCBI_RefSeq, v2023	NCBI_RefSeq_v2023.tar.gz (35.5Gb)
GSearch_optdens.tar.gz	Graph database for RefSeq_gsearch_optdens	GSearch_optdens.tar.gz (34Gb)
Test_genome_Tara8466	Test genome for Tara Ocean Project	Test_genome_Tara8466.tar.gz (5.7Gb)

Public data

from Kostas lab