Enveomics collection

A toolbox for microbial genomics and metagenomics

The collection

§ Sequence similarity search

Statistics

BedGraph.tad.rb
Estimates the truncated average sequencing depth (TAD) from a BedGraph file.
BedGraph.window.rb
Estimates the sequencing depth per windows from a BedGraph file.
BlastPairwise.AAsubs.pl
Counts the different AA substitutions in the best hit blast alignments, from a BLASTP pairwise format output (-outfmt 0 in BLAST+, -m 0 in legacy BLAST).
BlastTab.advance.bash
Calculates the percentage of a partial BLAST result. The value produced slightly subestimates the actual advance, due to un-flushed output and trailing queries that could be processed but generate no results.
BlastTab.recplot2.R
Produce recruitment plot objects provided that BlastTab.catsbj.pl has been previously executed.
BlastTab.seqdepth.pl
Estimates the sequencing depth of subject sequences.
BlastTab.seqdepth_nomedian.pl
Estimates the sequencing depth of subject sequences. The values reported by this script may differ from those of BlastTab.seqdepth.pl, because this script uses the aligned length of the read while BlastTab.seqdepth.pl uses the aligned length of the subject sequence.
BlastTab.seqdepth_ZIP.pl
Estimates the average sequencing depth of subject sequences (genes or contigs) assuming a Zero-Inflated Poisson distribution (ZIP) to correct for non-covered positions. It uses the corrected method of moments estimators (CMMEs) as described by Beckett et al [1]. Note that [1] has a mistake in eq. (2.4), that should be: pi-hat-MM = 1 - (X-bar / lambda-hat-MM). Also note that a more elaborated mixture distribution can arise from coverage histograms (e.g., see [2] for an additional correction called 'tail distribution' and mixtures involving negative binomial) so take these results cum grano salis. [1] http://anisette.ucs.louisiana.edu/Academic/Sciences/MATH/stage/stat2012.pdf [2] Lindner et al, Bioinformatics, 2013.
BlastTab.sumPerHit.pl
Sums the weights of all the queries hitting each subject. Often (but not necessarily) the BLAST files contain only best matches. The weights can be any number, but a common use of this Script is to add up counts (weights are integers). For example, in a BLAST of predicted genes vs some annotation source, the weights could be the number of reads recruited by each gene.
FastQ.test-error.rb
Compares the estimated error of sequencing reads (Q-score) with observed mismatches (identity against a know reference sequence).
RecPlot2.compareIdentities.R
Calculates the difference between identity distributions of two recruitment plots.

Manipulation

BlastTab.addlen.rb
Appends an extra column to a tabular BLAST with the length of the query or the subject sequence.
BlastTab.best_hit_sorted.pl
Filters a tabular BLAST to retain only the best matches.
BlastTab.catsbj.pl
Generates a list of hits from a BLAST result concatenating the subject sequences. This can be used, e.g., to analyze BLAST results against draft genomes. This script creates two files using <map.bls> as prefix with extensions .rec (for the recruitment plot) and .lim (for the limits of the different sequences in <seq.fa>).
BlastTab.cogCat.rb
Replaces the COG gene IDs in a BLAST for the COG category.
BlastTab.filter.pl
Extracts a subset of hits (queries or subjects) from a tabular BLAST.
BlastTab.kegg_pep2path_rest.pl
Takes a BLAST against KEGG_PEP (or KO) and retrieves the pathways in which the subject peptides are involved.
BlastTab.pairedHits.rb
Identifies the best hits of paired-reads.
BlastTab.subsample.pl
Filters a BLAST output including only the hits produced by any of the given sequences as query.
BlastTab.taxid2taxrank.pl
Takes a BLAST with NCBI Taxonomy IDs as subjects and replaces them by names at a given taxonomic rank.
BlastTab.topHits_sorted.rb
Reports the top-N best hits of a BLAST.
sam.filter.rb
Filters a SAM or BAM file by target sequences and/or identity.

Execution

aai.rb
Calculates the Average Amino acid Identity between two genomes.
ani.rb
Calculates the Average Nucleotide Identity between two genomes.
anir.rb
Estimates ANIr: the Average Nucleotide Identity of reads against a genome.
HMM.haai.rb
Estimates Average Amino Acid Identity (AAI) from the essential genes extracted and aligned by HMM.essential.rb (see Alignments).
rbm.rb
Finds the reciprocal best matches between two sets of sequences.

§ Sequence analyses

Statistics

FastA.gc.pl
Estimates the G+C content of sequences.
FastA.length.pl
Returns the length of sequences in (multi-)FastA.
FastA.N50.pl
Calculates the N50 value of a set of sequences. Alternatively, it can calculate other N** values. It also calculates the total number of sequences, the total added length, and the longest sequence length.
FastA.qlen.pl
Calculates the quartiles of the length in a set of sequences. The Q2 is also known as the median. Q0 is the minimum length, and Q4 is the maximum length. It also calculates TOTAL, the added length of the sequences in the file, and AVG, the average length.
FastQ.test-error.rb
Compares the estimated error of sequencing reads (Q-score) with observed mismatches (identity against a know reference sequence).

Manipulation

FastA.extract.rb
Extracts a list of sequences and/or coordinates from multi-FastA files.
FastA.filter.pl
Extracts a subset of sequences from a FastA file.
FastA.filterLen.pl
Filters a multi-FastA file by length.
FastA.filterN.pl
Filter sequences by N-content and presence of long homopolymers.
FastA.fragment.rb
Simulates incomplete (fragmented) drafts from complete genomes.
FastA.interpose.pl
Interpose sequences in FastA format from two files into one output file. If more than two files are provided, the script will interpose all the input files.
FastA.mask.rb
Mask sequence region(s) in a FastA file.
FastA.per_file.pl
Extracts all the sequences in a multi-FastA into multiple single-FastA files.
FastA.rename.pl
Renames a set of sequences in FastA format.
FastA.revcom.pl
Reverse-complement sequences in FastA format.
FastA.sample.rb
Samples a random set of sequences from a multi-FastA file.
FastA.slider.pl
Slices sequences in fixed- or variable-length windows.
FastA.split.pl
Splits a FastA file into two or more files.
FastA.split.rb
Evenly splits a multi-FastA file into multiple multi-FastA files.
FastA.subsample.pl
Subsamples a set of sequences.
FastA.tag.rb
Generates easy-to-parse tagged reads from FastA files.
FastA.toFastQ.rb
Creates a FastQ-compliant file from a FastA file.
FastA.wrap.rb
Wraps sequences in a FastA to a given line length.
FastQ.filter.pl
Extracts a subset of sequences from a FastQ file.
FastQ.interpose.pl
Interposes sequences in FastQ format from two files into one output file. If more than two files are provided, the script will interpose all the input files.
FastQ.maskQual.rb
Masks low-quality bases in a FastQ file.
FastQ.offset.pl
There are several FastQ formats. This script takes a FastQ in any of them, identifies the type of FastQ (this is, the offset), and generates a FastQ with the given offset.
FastQ.split.pl
Splits a FastQ file into several FastQ files. This script can be used to separate interposed sister reads using any even number of output files.
FastQ.tag.rb
Generates easy-to-parse tagged reads from FastQ files.
FastQ.toFastA.awk
Translates FastQ files into FastA.

§ Diversity

Community

AlphaDiversity.pl
Takes a table of OTU abundance in one or more samples and calculates the Rao (Q_alpha), Rao-Jost (Q_alpha_eqv), Shannon (Hprime), and inverse Simpson (1_lambda) indices of alpha diversity for each sample.
Chao1.pl
Takes a table of OTU abundance in one or more samples and calculates the chao1 index (with 95% Confidence Interval) for each sample. To use it with Qiime OTU Tables, run it ignoring 1 left column and with header.
Table.barplot.R
Creates nice barplots from tab-delimited tables.
Table.prefScore.R
Estimate preference score of species based on occupancy in biased sample sets.

Population

VCF.SNPs.rb
Counts the number of Single-Nucleotide Polymorphisms (SNPs) in a VCF file.
VCF.KaKs.rb
Estimates the Ka/Ks ratio from the SNPs in a VCF file. Ka and Ks are corrected using pseudo-counts, but no corrections for multiple substitutions are applied.
Table.prefScore.R
Estimate preference score of species based on occupancy in biased sample sets.

§ Annotation

Database mapping

BlastTab.kegg_pep2path_rest.pl
Takes a BLAST against KEGG_PEP (or KO) and retrieves the pathways in which the subject peptides are involved.
BlastTab.taxid2taxrank.pl
Takes a BLAST with NCBI Taxonomy IDs as subjects and replaces them by names at a given taxonomic rank.
EBIseq2tax.rb
Maps a list of EBI-supported IDs to their corresponding NCBI taxonomy using EBI RESTful API.
NCBIacc2tax.rb
Maps a list of NCBI accessions to their corresponding taxonomy using the NCBI EUtilities.
gi2tax.rb
Maps a list of NCBI GIs to their corresponding taxonomy using the NCBI EUtilities.
M5nr.getSequences.rb
Downloads a set of sequences from M5nr with a given functional annotation.
RefSeq.download.bash
Downloads a collection of sequences and/or annotations from NCBI's RefSeq.
SRA.download.bash
Downloads the set of runs from a project, sample, or experiment in SRA. If the expected file already exists, skips the file if the MD5 hash matches.

Tables

Table.barplot.R
Creates nice barplots from tab-delimited tables.
GenBank.add_fields.rb
Adds annotations to GenBank files.
MyTaxa.fragsByTax.pl
Identifies fragments annotated as a taxon in MyTaxa.
Table.df2dist.R
Transform a tab-delimited list of distances into a squared matrix.
Table.filter.pl
Extracts (and re-orders) a subset of rows from a raw table.
Table.merge.pl
Merges multiple (two-column) lists into one table.
Table.replace.rb
Replace a field in a table using a mapping file.
Table.round.rb
Rounds numbers in a table.
Table.split.pl
Split a file with multiple columns into multiple two-columns lists.
HMM.essential.rb
Finds and extracts a collection of essential proteins suitable for genome completeness evaluation and phylogenetic analyses in Archaea and Bacteria.
HMM.haai.rb
Estimates Average Amino Acid Identity (AAI) from the essential genes extracted and aligned by HMM.essential.rb (see Alignments).
HMMsearch.extractIds.rb
Extracts the sequence IDs and query model form a (multiple) HMMsearch report (for HMMer 3.0).
ogs.annotate.rb
Annotates Orthology Groups (OGs) using one or more reference genomes.
ogs.core-pan.rb
Subsamples the genomes in a set of Orthology Groups (OGs) and estimates the trend of core genome and pangenome sizes.
ogs.extract.rb
Extracts sequences of Orthology Groups (OGs) from genomes (proteomes).
ogs.mcl.rb
Identifies Orthology Groups (OGs) in Reciprocal Best Matches (RBM) between all pairs in a collection of genomes, using the Markov Cluster Algorithm.
ogs.stats.rb
Estimates some descriptive statistics on a set of Orthology Groups (OGs).
ogs.rb
Identifies Orthology Groups (OGs) in Reciprocal Best Matches (RBM) between all pairs in a collection of genomes.

§ Other data

Phylogenetic and other distances

CharTable.classify.rb
Uses a dichotomous key to classify objects parsing a character table.
JPlace.distances.rb
Extracts the distance (estimated branch length) of each placed read to a given node in a JPlace file.
JPlace.to_iToL.rb
Generates iToL-compatible files from a .jplace file (produced by RAxML's EPA or pplacer), that can be used to draw pie-charts in the nodes of the reference tree.
Newick.autoprune.R
Automatically prunes a tree, to keep representatives of each clade.
TRIBS.test.R
Estimates the empirical difference between all the distances in a set of objects and a subset, together with its statistical significance.
TRIBS.plot-test.R
Plots an `enve.TRIBStest` object.
Table.df2dist.R
Transform a tab-delimited list of distances into a squared matrix.

Taxonomic

CharTable.classify.rb
Uses a dichotomous key to classify objects parsing a character table.
EBIseq2tax.rb
Maps a list of EBI-supported IDs to their corresponding NCBI taxonomy using EBI RESTful API.
NCBIacc2tax.rb
Maps a list of NCBI accessions to their corresponding taxonomy using the NCBI EUtilities.
Table.barplot.R
Creates nice barplots from tab-delimited tables.
gi2tax.rb
Maps a list of NCBI GIs to their corresponding taxonomy using the NCBI EUtilities.
MyTaxa.fragsByTax.pl
Identifies fragments annotated as a taxon in MyTaxa.
MyTaxa.seq-taxrank.rb
Generates a simple tabular file with the classification of each sequence at a given taxonomic rank from a MyTaxa output.
Taxonomy.silva2ncbi.rb
Re-formats Silva taxonomy into NCBI-like taxonomy dump files.

Alignments

AAsubs.log2ratio.rb
Estimates the log2-ratio of different amino acids in homologous sites using an AAsubs file (see BlastPairwise.AAsubs.pl). It provides the point estimation (.obs file), the bootstrap of the estimation (.boot file) and the null model based on label-permutation (.null file).
Aln.cat.rb
Concatenates several multiple alignments in FastA format into a single multiple alignment. The IDs of the sequences (or the ID prefixes, if using --ignore-after) must coincide across files.
Aln.convert.pl
Translates between different alignment formats.
BlastPairwise.AAsubs.pl
Counts the different AA substitutions in the best hit blast alignments, from a BLASTP pairwise format output (-outfmt 0 in BLAST+, -m 0 in legacy BLAST).

Clustering

ogs.mcl.rb
Identifies Orthology Groups (OGs) in Reciprocal Best Matches (RBM) between all pairs in a collection of genomes, using the Markov Cluster Algorithm.
clust.rand.rb
Calculates the Rand Index and the Adjusted Rand Index between two clusterings. The clustering format is a raw text file with one cluster per line, each defined as comma-delimited members, and a header line (ignored). Note that this is equivalent to the OGs format for 1 genome.

Read recruitments

anir.rb
Estimates ANIr: the Average Nucleotide Identity of reads against a genome.
BedGraph.tad.rb
Estimates the truncated average sequencing depth (TAD) from a BedGraph file.
BedGraph.window.rb
Estimates the sequencing depth per windows from a BedGraph file.
BlastTab.catsbj.pl
Generates a list of hits from a BLAST result concatenating the subject sequences. This can be used, e.g., to analyze BLAST results against draft genomes. This script creates two files using <map.bls> as prefix with extensions .rec (for the recruitment plot) and .lim (for the limits of the different sequences in <seq.fa>).
BlastTab.pairedHits.rb
Identifies the best hits of paired-reads.
BlastTab.recplot2.R
Produce recruitment plot objects provided that BlastTab.catsbj.pl has been previously executed.
FastQ.test-error.rb
Compares the estimated error of sequencing reads (Q-score) with observed mismatches (identity against a know reference sequence).
GFF.catsbj.pl
Generates a list of coordinates from a GFF table concatenating the subject sequences.
RecPlot2.compareIdentities.R
Calculates the difference between identity distributions of two recruitment plots.
sam.filter.rb
Filters a SAM or BAM file by target sequences and/or identity.