Public data

from Kostas lab

Chirag et al (2018), includes references to the following material:

Name	Description	Files
D1	Dataset 1 (NCBI RefSeq)	Sequences (1.6 Gb) IDs* (36 Kb)
D2	Dataset 2 (Bacillus cereus)	Sequences (911 Mb) IDs* (113 Kb)
D3	Dataset 3 (Escherichia coli)	Sequences (6.2 Gb) IDs* (2.2 Mb)
D4	Dataset 4 (Bacillus anthracis)	Sequences (670 Mb) IDs* (3.1 Kb)
D5	Dataset 5 (Parks et al MAGs)	Sequences (5.8 Gb) IDs* (2.4 Mb)
NCBI_Prok	NCBI Genome - Prokaryotic section	Sequences (95 Gb) FastANI matrix (6.2 Gb) IDs* (30 Mb)

* The ID files are gzipped tab-delimited raw text files with the following columns:

Name of the dataset as used in the manuscript.
IDs in the NCBI nuccore database separated by commas, except for D2 in which some datasets contain identifiers from the Center for Disease Control and Prevention, Division of High-Consequence Pathogens and Pathology (prefixed with CDC:DHCPP:).
When available, links to the publicly available dataset in MiGA.