Nonpareil

Estimating coverage in metagenomes

Frequently Asked Questions

§ Online submision

What type of data can I use?
Short reads (Nonpareil is optimized for Illumina). Ideally, the sequences should be trimmed and quality-checked. Nonpareil assumes reads to be independent events; please see below if you have paired-end reads. The sequences can be either in FastA or FastQ formats (gzipped or not), but we strongly suggest to use Gzipped FastA for smaller file sizes (i.e., faster uploads).
How should I trim/quality-check my reads?
Nonpareil expects that sequencing error is always below 5%, so we suggest using an expected error cutoff of 1% (i.e., Q>20, or 1 error in 100 nucleotides). We recommend to perform this task using SolexaQA.
Can I use paired-end or mate-end reads?
A strong assumption of Nonpareil is that reads represent independent events. So the short answer is no. However, sister reads can still be used, one sister read at a time. If you have paired reads, chances are that you have two files (one for sister 1, another for sister 2). Upload only one of these files at a time. In fact, you may upload both files in independent jobs, and consider them as replicates.
What does "Match options" mean?
Pair-wise read comparisons are at the core of Nonpareil. To define a "good match" between reads, one needs (at least) two cutoffs: minimum identity, and minimum alignment length. The identity is set at 95% by default, to reflect the sequence-discrete population frequently seen in metagenomics (Caro-Quintero & Konstantinidis 2012), and other values are untested (change it under your own risk). Minimum alignment length, on the other hand, allows for different ranges of resolution. In samples with very high diversity and very low coverage (e.g., small metagenomes from soil samples), Nonpareil would require better low-coverage resolution, and small read overlap should be preferred (25%). On the other hand, in samples with very low diversity and very high coverage (e.g., several Gbp on environments with few species) large read overlap should be preferred (75%) to ensure high-coverage resolution. In some extreme cases, 100% could be used (e.g., over 10 Gb for a small genome), but in those cases coverage is probably better assessed using assembly mapping than Nonpareil. However, 100% overlap is the fastest option, so it can be used for quick data exploration. In most cases, 50% read overlap should be preferred.
What does "Sequence type" mean?
This allows to specify how sequences must be treated. Single-stranded tells Nonpareil to search matches only in the sequenced strand (i.e., skip reverse-complement strand). This can be useful if you have strand-specific data (like RNA data). However, note that most meta-transcriptomes come from cDNA (not RNA directly), so they are no strand-specific. N as mismatch tells Nonpareil to treat every N in the dataset as a mismatch. This typically have no impact, because N implies a very large sequencing error, and should never occur in trimmed data for Nonpareil analysis (>25%; see more on sequencing error).
Are the "Miscellaneous" options any useful?
Rarely. Random seed helps us debug. Nothing else. Query set size specifies the number of reads to be sampled as query to build the Nonpareil curve. Nonpareil curves are actually very expensive to compute, so we approximate using a representative set. A larger set will produce more accurate Nonpareil curves, but we have observed that accuracy doesn't improve above 1,000 reads. A smaller set linearly reduces the running time (i.e., 500 will run in half the time), but is not warrantied to be accurate (or even reproducible).

§ Output

Redundancy summary: .npo file
Tab-delimited file with six columns. The first column indicates the sequencing effort (in number of reads), and the remaining columns indicate the summary of the distribution of redundancy (from the replicates, 1,024 by default) at the given sequencing effort. These five columns are: average redundancy, standard deviation, quartile 1, median (quartile 2), and quartile 3.
Redundancy values: .npa file
Tab-delimited file with three columns. Similar to the .npo files, it contains information about the redundancy at each sequencing effort, but it provides ALL the results from the replicates, not only the summary at each point. The first column indicates the sequencing effort (as a fraction of the dataset), the second column indicates the ID of the replicate (a number used only to introduce some controlled noise in plots), and the third column indicates the estimated redundancy value.
Mates distribution: .npc file
Raw list with the number of reads in the dataset matching a query read. A set of query reads is randomly drawn by Nonpareil (1,000 by default), and compared against all reads in the dataset. Each line on this file corresponds to a query read (the order is not important). We have seen certain correspondance between these numbers and the distribution of abundances in the community (compared, for example, as rank-abundance plots), but this file is provided only for quality-control purposes and comparisons with other tools.
Log: .npl file
A verbose log of internal Nonpareil processing. The number to the left (inside squared brackets) indicate the CPU time (in minutes). This file also provide quality assessment of the Nonpareil run (automated consistency evaluation). Ideally, the last line should read "Everything seems correct". Otherwise, it suggests alternative parameters that may improve the estimation.

§ Nonpareil diversity

What is the Nonpareil diversity index?
We haven't yet published a formal description of the "diversity" reported by Nonpareil, which we internally refer to as Nonpareil diversity. The Nonpareil diversity is a measurement of how complex a community is in "sequence space". Graphically, the index is a measurement of "how much to the right the Nonpareil curve is".
How does Nonpareil diversity compare to other metrics?
Nonpareil diversity may correlate with classic bin-based diversity indices (like Shannon) if certain assumptions are met. The key assumptions are:
  1. the genome sizes are identically distributed between samples,
  2. the gene content correlates well with the genetic distance, and
  3. the duplicated genomic regions are negligible compared to the unique regions.
In our experience, these assumptions are met reasonably well in bacterial communities, but they are typically violated in communities with a significant (e.g. >5%) eukaryotic fraction.
What's the expected range of the Nonpareil diversity index?
We've found that the Nonpareil diversity index ranks communities very reliably in terms of diversity, and can even detect small seasonal variations in bacterial communities. For reference: an environment largely dominated by a single bacterial species like human posterior fornix or the Acid Mine Drainage gets values around 15 to 16, human stool samples around 17 to 19, freshwater and sandy soils around 20 to 22, and marine and other soils around 21 to 25. This index has a logarithmic scale, to allow meaningful comparisons between extreme cases, like single species vs soil.
Can I use Nonpareil with PacBio reads?
Unfortunately, we're not planning on extending support of Nonpareil to PacBio, mainly because of three reasons:
  1. PacBio reads are very long. These reads are typically in the multi-kbp order, for which un-gapped alignments are practically useless. This causes diversity overestimation and coverage underestimation. Possible solutions would be marker extraction, short-read simulation, or even replacing the comparison module. However, the first two violate the statistical assumptions of Nonpareil, and the third is beyond the scope of the current release.
  2. PacBio reads have very high error rates. The single-pass reads have error rates with median upwards of 10%, which is way beyond Nonpareil's sensitivity. This causes diversity overestimation and coverage underestimation. SMRT runs can achieve very low error rates though, but this is typically associated to lower yields and probably not ideal for metagenomes, which relates to:
  3. PacBio yields very few reads. PacBio runs are usually in the order of hundreds of thousands of reads. This is more than enough for scaffolding, but not for profiling of high- or intermediate-complexity communities. In low-complexity, PacBio-based assemblies may result in nearly-complete genomes, so the approximation offered by Nonpareil is rendered moot.