Short reads (Nonpareil is optimized for Illumina). Ideally, the sequences
should be trimmed and quality-checked. Nonpareil
assumes reads to be independent events; please
see below if you have paired-end reads. The
sequences can be either in FastA or FastQ formats (gzipped or not), but we
advice to use Gzipped FastA for alignment kernel or Gzipped FastQ for k-mer
How should I trim/quality-check my reads?
Nonpareil expects that sequencing error is always below 5%, so we suggest using an expected error cutoff
of 1% (i.e., Q>20, or 1 error in 100 nucleotides). We recommend to perform this task
Can I use paired-end or mate-end reads?
A strong assumption of Nonpareil is that reads represent independent events. So the short
answer is no. However, sister reads can still be used, one sister read at a time. If you have paired
reads, chances are that you have two files (one for sister 1, another for sister 2). Upload only
one of these files at a time. In fact, you may upload both files in independent jobs, and consider
them as replicates.
What is the alignment kernel?
Pair-wise read comparisons are at the core of Nonpareil. The most basic kernel consists on aligning query
reads to target reads allowing mismatches. To define a "good match" between reads,
one needs (at least) two cutoffs: minimum identity, and minimum alignment length. The identity is
set at 95% by default, to reflect the sequence-discrete population frequently seen in metagenomics
(Caro-Quintero & Konstantinidis 2012),
and other values are untested (change it under your own risk). Minimum alignment
length, on the other hand, allows for different ranges of resolution. In samples with very high
diversity and very low coverage (e.g., small metagenomes from soil samples), Nonpareil would
require better low-coverage resolution, and small read overlap should be preferred (25%).
On the other hand, in samples with very low diversity and very high coverage (e.g., several
Gbp on environments with few species) large read overlap should be preferred (75%) to ensure
high-coverage resolution. In some extreme cases, 100% could be used
(e.g., over 10 Gb for a small genome), but in those cases coverage is probably better assessed using
assembly mapping than Nonpareil. However, 100% overlap is the fastest option, so it can be used for quick data
exploration. In most cases, 50% read overlap should be preferred.
What is the k-mer kernel?
Nonpareil includes a second kernel that uses the fact that highly similar
sequences should share at least one perfect match of a given (minimum) length.
This length is the k-mer length parameter, which stabilizes around 24 (default).
This kernel is much faster than the alignment kernel, and it produces comparable
results with default parameters. However, there is no direct correspondance with
the identity parameter, which may be useful in some analyses. If you require this
parameter to be explicitly set, use the alignment kernel instead.
What does "Sequence type" mean?
This allows to specify how sequences must be treated. Single-stranded tells Nonpareil to
search matches only in the sequenced strand (i.e., skip reverse-complement strand). This
can be useful if you have strand-specific data (like RNA data). However, note that most meta-transcriptomes
come from cDNA (not RNA directly), so they are no strand-specific. N as mismatch
tells Nonpareil to treat every N in the dataset as a mismatch. This typically have no
impact, because N implies a very large sequencing error, and should never occur in trimmed data for
Nonpareil analysis (>25%; see more on sequencing error).
Are the "Miscellaneous" options any useful?
Rarely. Random seed helps us debug. Nothing else. Query set size specifies the number
of reads to be sampled as query to build the Nonpareil curve. Nonpareil curves are actually very
expensive to compute, so we approximate using a representative set. A larger set will produce more
accurate Nonpareil curves, but we have observed that accuracy doesn't improve above 1,000 reads for
alignment kernel or 10,000 for k-mer kernel (default values). A smaller set linearly reduces the running time
in alignment kernel (i.e., 500 will run in half the time), but is not warrantied to be accurate
(or even reproducible).
Redundancy summary: .npo file
Tab-delimited file with six columns. The first column indicates the sequencing effort (in number of reads),
and the remaining columns indicate the summary of the distribution of redundancy (from the replicates, 1,024
by default) at the given sequencing effort. These five columns are: average redundancy, standard deviation,
quartile 1, median (quartile 2), and quartile 3.
Redundancy values: .npa file
Tab-delimited file with three columns. Similar to the .npo files, it contains information
about the redundancy at each sequencing effort, but it provides ALL the results from the replicates, not only
the summary at each point. The first column indicates the sequencing effort (as a fraction of the dataset),
the second column indicates the ID of the replicate (a number used only to introduce some controlled noise in
plots), and the third column indicates the estimated redundancy value.
Mates distribution: .npc file
Raw list with the number of reads in the dataset matching a query read. A set of query reads is randomly
drawn by Nonpareil (1,000 by default), and compared against all reads in the dataset. Each line on this
file corresponds to a query read (the order is not important). We have seen certain correspondance between
these numbers and the distribution of abundances in the community (compared, for example, as rank-abundance
plots), but this file is provided only for quality-control purposes and comparisons with other tools.
Log: .npl file
A verbose log of internal Nonpareil processing. The number to the left (inside squared brackets) indicate the
CPU time (in minutes). This file also provide quality assessment of the Nonpareil run (automated consistency
evaluation). Ideally, the last line should read "Everything seems correct". Otherwise, it suggests
alternative parameters that may improve the estimation.
§ Nonpareil diversity
What is the Nonpareil diversity index?
We haven't yet published a formal description of the "diversity" reported by Nonpareil, which we internally refer
to as Nonpareil diversity. The Nonpareil diversity is a measurement of how complex a community is in "sequence
space". Graphically, the index is a measurement of "how much to the right the Nonpareil curve is".
How does Nonpareil diversity compare to other metrics?
Nonpareil diversity may correlate with classic bin-based diversity indices (like Shannon) if certain assumptions
are met. The key assumptions are:
the genome sizes are identically distributed between samples,
the gene content correlates well with the genetic distance, and
the duplicated genomic regions are negligible compared to the unique regions.
In our experience, these assumptions are met reasonably well in bacterial communities, but they are typically
violated in communities with a significant (e.g. >5%) eukaryotic fraction.
What's the expected range of the Nonpareil diversity index?
We've found that the Nonpareil diversity index ranks communities very reliably in terms of diversity, and can even
detect small seasonal variations in bacterial communities. For reference: an environment largely dominated by a
single bacterial species like human posterior fornix or the Acid Mine Drainage gets values around 15 to 16, human
stool samples around 17 to 19, freshwater and sandy soils around 20 to 22, and marine and other soils around 21 to
25. This index has a logarithmic scale, to allow meaningful comparisons between extreme cases, like single species
Can I use Nonpareil with PacBio reads?
Unfortunately, we're not planning on extending support of Nonpareil to PacBio, mainly because of three reasons:
PacBio reads are very long. These reads are typically in the multi-kbp order, for which un-gapped
alignments are practically useless. This causes diversity overestimation and coverage underestimation.
Possible solutions would be marker extraction, short-read simulation, or even replacing the comparison
module. However, the first two violate the statistical assumptions of Nonpareil, and the third is beyond
the scope of the current release.
PacBio reads have very high error rates. The single-pass reads have error rates with median upwards
of 10%, which is way beyond Nonpareil's sensitivity. This causes diversity overestimation and coverage
underestimation. SMRT runs can achieve very low error rates though, but this is typically associated to
lower yields and probably not ideal for metagenomes, which relates to:
PacBio yields very few reads. PacBio runs are usually in the order of hundreds of thousands of reads.
This is more than enough for scaffolding, but not for profiling of high- or intermediate-complexity communities.
In low-complexity, PacBio-based assemblies may result in nearly-complete genomes, so the approximation offered
by Nonpareil is rendered moot.