Expected processing time: about 3 hours (last updated about 1 year).
This database contains non-redundant (99.5% ANI) prokaryotic genomes with complete and chromosome status in the NCBI Genome database.
Apr-25-2019: Removed legacy redundancy in the database. Non-type datasets with close relatives in the database (ANI ≥ 99.5%) removed. Removed a total of 5,317 datasets.
Apr-20-2019: Database update
Mar-15-2019: Indexing last update. 148 datasets inactivated will be included in the next update.
Feb-12-2019: Database update
Ralstonia_mannitolilytica_NZ_CM009147, which contains
only chromosome II.
Jan-14-2019: 610 datasets were unlinked to rebuild the index. These datasets are mainly composed of Escherichia coli (n=185), Klebsiella pneumoniae (n=119), and Salmonella enterica (n=203). These and all new datasets will be subjected to a redudancy pre-filter starting in the next update to avoid including new datasets with high identity to existing entries.
Dec-10-2018: Database update
Dec-07-2018: 59 datasets with estimated completeness below 40% removed.
Nov-27-2018: Blacklisted dataset
Escherichia_coli_CP028576 with only two
Nov-23-2018: Blacklisted dataset
"chromosome: I" but clearly a plasmid.
293 datasets temporarily unlinked to rebuild index.
Oct-05-2018: Database update.
Oct-03-2018: 305 datasets soft-unlinked to update index.
Jul-17-2018: Database update.
Silvanigrella_aquatica_CP017838taxid:1912593 to taxid:1915309.
Staphylococcus_epidermidis_CP018841taxid:1929941 to taxid:1282.
Staphylococcus_aureus_subsp__aureus_CP025490, single plasmids, as well as
Cupriavidus_sp__NH9_NZ_CP017758, missing chromosome I.
Mar-06-2018: Blacklisted all Ca Tremblaya princeps and Ca Tremblaya phenacola datasets, with 50-100 predicted proteins and 100-200 kbp, which "cannot be considered a living organism".
Feb-14-2018: Initiated indexing by groups (for new genomes) to reduce impact on query datasets.
Feb-12-2018: Database update. Total reference genomes: 11,232. This update
is the first using the
miga ncbi_get method.
Jan-19-2018: Soft-unlinked the following datasets for indexing (will be included in the next update):
Jan-15-2018: Blacklisted 1 dataset composed exclusively of plasmid sequences
Halobacterium_salinarum_NC_002121. Total reference genomes: 10,637.
Dec-11-2017: Database update. 381 datasets eliminated and 884 datasets added. Total reference genomes: 10,638.
Nov-07-2017: The following datasets were soft-unlinked for indexing, but will be included in the next update:
Oct-12-2017: Database update. 129 datasets eliminated and 549 added. Total reference genomes: 10,224 post-update.
Aug-30-2017: Manually modified taxonomic rank
dataset in dataset
which was confusing MiGA into thinking it had a registered kingdom.
Aug-30-2017: A filesystem error caused an interruption of the following datasets, which will be unlinked for this update and re-downloaded in the next update:
Aug-15-2017: Database update. 75 datasets eliminated and 573 added. Total reference genomes: 9,559 (post-update).
Jul-07-2017: The following datasets were temporarily unlinked to complete
Natrialbaceae_archaeon_JW_NM_HA_15_NZ_CP019893. These datasets will be
included in the next update.
Jul-02-2017: The dataset
Stenotrophomonas_maltophilia_NC_001383 is only
composed of plasmid sequences and was manually removed.
Jun-23-2017: Database update. 91 datasets eliminated and 430 added. Total reference genomes: ,8724 (pre-update) - 9,063 (post-update).
May-12-2017: Database update. 222 datasets eliminated and 324 added. Total reference genomes: 8,622 (pre-update) - 8,724 (post-update).
Apr-25-2017: Database update. 7 datasets eliminated and 73 added. Total reference genomes: 8,557 (pre-update) - 8,622 (post-update). Database not indexed for this update.
Apr-21-2017: The following datasets were composed only of plasmids and were eliminated:
Candidatus_Tremblaya_princeps_LN998829: This dataset has a sequence named chromosome I, but it only contains 51 genes (140Kbp), so it's likely a plasmid.
Burkholderia_pseudomallei_NZ_CM007659 dataset only contains
the second chromosome of B. pseudomallei, resulting in a completeness of
2.7% (3 essential genes), it was therefore removed. The current database has
8,562 reference datasets.
Apr-17-2017: Database update. 87 datasets were eliminated and 218 datasets added. The following datasets were eliminated based on the previous update or completeness report (<1% and no 16S):
Total reference genomes: 8,437 (pre-update) - 8,563 (post-update).
Apr-16-2017: Manually modified domain in the taxonomy of
Apr-15-2017: Note for next update: Check out
seems to be composed only of plasmids. Evaluate completeness to clean the
Mar-06-2017: Database update. 239 datasets were eliminated and 663
datasets added. The dataset
Mycobacterium_tuberculosis_NC_025025 is only a
plasmid with 6,898 bp and no chromosome sequence, and was manually removed.
Total reference genomes: 8,015 (pre-update) -> 8,438 (post-update). The
Legionella_fallonii_LLAP_10_NZ_LN614827 was manually removed because
of a corrupt database file (it'll be incorporated in the next update),
resulting in 8,437 datasets.
complete project so it can be used in the website, but I'll keep running the
distances of this dataset in the meantime.