All available genomes from the Bacillus cereus sensu lato clade, including the species:
This project combines together metagenome-assembled genomes (MAGs) from different collections:
This project does not currently include Parks8k.
Exploration of all available genomes belonging to genus Marinobacter.
All available genomes from the Mycobacterium tuberculosis complex, including the species:
This database contains non-redundant (99.5% ANI) prokaryotic genomes with complete and chromosome status in the NCBI Genome database.
Apr-25-2019: Removed legacy redundancy in the database. Non-type datasets with close relatives in the database (ANI ≥ 99.5%) removed. Removed a total of 5,317 datasets.
Apr-20-2019: Database update
Mar-15-2019: Indexing last update. 148 datasets inactivated will be included in the next update.
Feb-12-2019: Database update
Ralstonia_mannitolilytica_NZ_CM009147, which contains
only chromosome II.
Jan-14-2019: 610 datasets were unlinked to rebuild the index. These datasets are mainly composed of Escherichia coli (n=185), Klebsiella pneumoniae (n=119), and Salmonella enterica (n=203). These and all new datasets will be subjected to a redudancy pre-filter starting in the next update to avoid including new datasets with high identity to existing entries.
Dec-10-2018: Database update
Dec-07-2018: 59 datasets with estimated completeness below 40% removed.
Nov-27-2018: Blacklisted dataset
Escherichia_coli_CP028576 with only two
Nov-23-2018: Blacklisted dataset
"chromosome: I" but clearly a plasmid.
293 datasets temporarily unlinked to rebuild index.
Oct-05-2018: Database update.
Oct-03-2018: 305 datasets soft-unlinked to update index.
Jul-17-2018: Database update.
Silvanigrella_aquatica_CP017838taxid:1912593 to taxid:1915309.
Staphylococcus_epidermidis_CP018841taxid:1929941 to taxid:1282.
Staphylococcus_aureus_subsp__aureus_CP025490, single plasmids, as well as
Cupriavidus_sp__NH9_NZ_CP017758, missing chromosome I.
Mar-06-2018: Blacklisted all Ca Tremblaya princeps and Ca Tremblaya phenacola datasets, with 50-100 predicted proteins and 100-200 kbp, which "cannot be considered a living organism".
Feb-14-2018: Initiated indexing by groups (for new genomes) to reduce impact on query datasets.
Feb-12-2018: Database update. Total reference genomes: 11,232. This update
is the first using the
miga ncbi_get method.
Jan-19-2018: Soft-unlinked the following datasets for indexing (will be included in the next update):
Jan-15-2018: Blacklisted 1 dataset composed exclusively of plasmid sequences
Halobacterium_salinarum_NC_002121. Total reference genomes: 10,637.
Dec-11-2017: Database update. 381 datasets eliminated and 884 datasets added. Total reference genomes: 10,638.
Nov-07-2017: The following datasets were soft-unlinked for indexing, but will be included in the next update:
Oct-12-2017: Database update. 129 datasets eliminated and 549 added. Total reference genomes: 10,224 post-update.
Aug-30-2017: Manually modified taxonomic rank
dataset in dataset
which was confusing MiGA into thinking it had a registered kingdom.
Aug-30-2017: A filesystem error caused an interruption of the following datasets, which will be unlinked for this update and re-downloaded in the next update:
Aug-15-2017: Database update. 75 datasets eliminated and 573 added. Total reference genomes: 9,559 (post-update).
Jul-07-2017: The following datasets were temporarily unlinked to complete
Natrialbaceae_archaeon_JW_NM_HA_15_NZ_CP019893. These datasets will be
included in the next update.
Jul-02-2017: The dataset
Stenotrophomonas_maltophilia_NC_001383 is only
composed of plasmid sequences and was manually removed.
Jun-23-2017: Database update. 91 datasets eliminated and 430 added. Total reference genomes: ,8724 (pre-update) - 9,063 (post-update).
May-12-2017: Database update. 222 datasets eliminated and 324 added. Total reference genomes: 8,622 (pre-update) - 8,724 (post-update).
Apr-25-2017: Database update. 7 datasets eliminated and 73 added. Total reference genomes: 8,557 (pre-update) - 8,622 (post-update). Database not indexed for this update.
Apr-21-2017: The following datasets were composed only of plasmids and were eliminated:
Candidatus_Tremblaya_princeps_LN998829: This dataset has a sequence named chromosome I, but it only contains 51 genes (140Kbp), so it's likely a plasmid.
Burkholderia_pseudomallei_NZ_CM007659 dataset only contains
the second chromosome of B. pseudomallei, resulting in a completeness of
2.7% (3 essential genes), it was therefore removed. The current database has
8,562 reference datasets.
Apr-17-2017: Database update. 87 datasets were eliminated and 218 datasets added. The following datasets were eliminated based on the previous update or completeness report (<1% and no 16S):
Total reference genomes: 8,437 (pre-update) - 8,563 (post-update).
Apr-16-2017: Manually modified domain in the taxonomy of
Apr-15-2017: Note for next update: Check out
seems to be composed only of plasmids. Evaluate completeness to clean the
Mar-06-2017: Database update. 239 datasets were eliminated and 663
datasets added. The dataset
Mycobacterium_tuberculosis_NC_025025 is only a
plasmid with 6,898 bp and no chromosome sequence, and was manually removed.
Total reference genomes: 8,015 (pre-update) -> 8,438 (post-update). The
Legionella_fallonii_LLAP_10_NZ_LN614827 was manually removed because
of a corrupt database file (it'll be incorporated in the next update),
resulting in 8,437 datasets.
complete project so it can be used in the website, but I'll keep running the
distances of this dataset in the meantime.
This project hosts a set of 7,901 metagenome-assembled genomes (MAGs) from the diverse metagenomes in NCBI SRA, as described by Parks et al, 2017, Nat Microb.
8 datasets were excluded from the index due to unreliable AAI values using Diamond:
The taxonomy of the datasets was obtained from NCBI.
Exploration of genomes classified as "Candidatus Pelagibacter ubique".
This database contains all reference prokaryotic genomes in the NCBI RefSeq database.
Acetobacter_aceti, which only had from family-down (perhaps a network issue while downloading?).
The RefSoil collection is a manually-curated set of genomes derived from NCBI's RefSeq database containing only organisms previously shown to be associated with soils, as described in Choi et al, 2017, ISME J.
This project hosts a set of 957 metagenome-assembled genomes (MAGs) from the TARA Oceans metagenomes, as described by Delmont et al, 2018, Nat Microb. All public data in the study can be found at Recovering HBDs from TARA Oceans Metagenomes.
The taxonomy of the datasets was inferred by MiGA using NCBI Prok as a reference with p-value < 0.05.
Exploration of available genomic data from the phylum Thaumarchaeota.
This project hosts metagenome-assembled genomes (MAGs) from multiple collections compiled by Tsementzi et al (in preparation). The set included here is the "high-quality set", with genome quality > 50%, based on CheckM estimates as:
quality = completeness - 5 x contamination.
The taxonomy of the datasets was inferred by MiGA using NCBI Prok as a reference at p-value < 0.05.
This database contains assemblies from type material in Archaea and Bacteria as flagged by NCBI, including both complete and draft genomes.
High-quality Metagenome-Assembled Genomes (MAGs) from 5 Lakes and 2 estuarine locations along the Chattahoochee River, Southwest USA.
MAGs obtained from a collection of 100 metagenomes using Subtractive Iterative Binning.
All available genomes from the genus Xanthomonas.
This project excludes the dataset
which includes a
large scaffold (119 Kbp)
almost entirely covered by a single homopolymer (poly-T).