benjjneb / dada2 Goto Github PK

View Code? Open in Web Editor NEW

458.0 41.0 141.0 479.35 MB

Accurate sample inference from amplicon data with single nucleotide resolution

Home Page: http://benjjneb.github.io/dada2/

License: GNU Lesser General Public License v3.0

R 60.49% C 0.06% C++ 38.91% CSS 0.54%

microbiome bioinformatics bioconductor taxonomy amplicon metagenomics metabarcoding

dada2's People

Contributors

Stargazers

Watchers

Forkers

zachcp rajaldebnath cd-mcfarland carlopacioni cmsmoo olgakamneva montoyah gregcaporaso leffj tarah28 kmkemp linls1912 gblanchard4 giriartes yimsea vmikk tankmermaid kdyson marcomeola rlh13 pjeraldo alenzhao liupfskygre ashokrags dereklow18 naupaka erictleung rnandety kiwiroy lemetrec cmaietta alpacman noor-albader memoll michaloskiera cgi-nrm derele rosemjones jessie-hou rmlekanoff schrenklab nagasridhar vani-maguire hk20013106 xbiome bichphuong122 kepoff htnani inambioinfo seedpcseed glomicon virginielemieuxlabonte jacobrprice misazaa pythseq diegoibt brendanf xphab pbrandt992 zachary-foster yaskgupta bioinfoacademy samseljt klboswell clairelee89 biorox marisapb lixiaopi1985 romemul flywind2 watsonwoo mathildebd jknappe fanli-gcb wbazant dogrinev gmagoon ycl6 wlokhorst superahura mdehollander kedham ahderojas jhuanglabtools mitiku90 nutrishinn coolboy-huang vrbacky rowling2392 terjekv mikemc yylu5 minghao2016 iseekwonderful abdo3a liuchao-nj ramongallego sramani-spark agchitu futingzhang

dada2's Issues

Hand-off to phyloseq

After class definition for dada output is defined, can define a hand-off method for phyloseq. Will need to need to figure out what part needs to be defined in dadac, and what part needs to be defined in phyloseq.

On first thought, the conversion to an "OTU_table"-like count matrix with a corresponding vector of reference sequences might already be sufficient.

Taxonomy assignment

Would be nice for people to be able to do their taxonomy assignment within R, and since its not available perhaps it should be added to the dada2 package.

fastqPairedFilter 'Error: could not find function "encoding"'

Possibly a typo for Encoding in the core base?

Equivalent sequence streaming for fasta/fna like as for fastq

Looks like dummy conversion to fastq would be a bad idea:

https://stat.ethz.ch/pipermail/bioconductor/2010-November/036600.html

ShortRead package
probably has some utilities to facilitate this. At the worst, I could copy the structure from fastqStreamer-class, or extend that class, to accommodate fasta instead.

Plots!

I love graphics. This package needs at least a few plot methods to help users visualize and understand the awesome results coming out of dada.

Additional trials on custom training data

I can paste-in here the experimental design, but that would be overkill. This refers to the training experiment that Ben and I forged and I conveyed to some lab team members at Second Genome. They were able to put together the experiment, and eventually run it on our Illumina MiSeq DNA sequencer. We have this data, and it needs to be evaluated.

Some of the data can even be included in the package as example test data, if appropriate.

I will attempt a first-pass trial of latest dada on this data, and then organize a data-share with @benjjneb for any further eval.

Class definitions for uniques and dada-output

Please comment your preferences on the class names O:-)... I was thinking "uniques" and "dada" would work.

This solves at least two problems:

Make sure user understands what their uniques list or dada list is, control some aspects about what to do with it
When these objects get printed to screen, it currently "explodes" the console because there's so much good stuff in there. Need to summarize that good stuff in a useful way

Generally fill-in missing documentation pages

I can do most of this.

There are often missing sections of documentation pages, including missing examples or output definition. Most importantly, in some cases there are undefined or mismatch parameter definitions between the doc and the function.

Good idea to clean this up in general, and obviously necessary for Bioconductor.

BLAST, either interface, or detailed vignette using external dependency

For low numbers of denoising-results, the following might work (against NR database)

annotate::blastSequences

For larger datasets with hundreds or thousands of species-genotypes, this might not be feasible or even dangerous, as 16S classification is usually done with more specialized approaches -- like Naïve Bayes on a custom reference database.

Unpaired forward and reverse reads after demultiplexing

"Strangely my problem here is taking the R1, R2, and index and
demultiplexing but keeping the R1 and R2 completely matched. I'm finding
that most tools seem to occasionally miscall a barcode if done separately
on the R1 and R2 files. SO by the time I demultiplex, I end up with fastq
files for R1 and R2 that are not complete matched. Close but not quite.
This is despite NO filtering prior to bringing the files in to R for dada2.

I scripted something in python as a "check" for synchrony but am finding it
is too slow considering I will often have 12+ seq runs with ~samples.

Suggestions? Again, trying to get from multiplexed R1, R2, and indexed into
1:1 fastq R1 and R2 to each sample.
Thanks
Pat"

Length variation in the dada2 output

First off, dada2 is great... I love software that just works with easy installation, and with a clear pipeline as outlined in the tutorial.

Second, sorry if this is the wrong place for this, since it isn't necessarily a software issue, but I have a question about results interpretation.

In the tutorial, there was a very tight spread in the lengths of sequences returned, and it mentions...

large length differences in the 16S region, especially after merging, are typically caused by alternate priming from non-specific primers. If this is not controlled for, this can produce false variation in the output.

Below is my output, and there is a very large variation in length, although most of the sequences are between 230-235nt.

table(nchar(colnames(seqtab)))
## 
## 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 
## 329   9  10 386  12  12   7   5   9   7   4   3  10   6  11   8  16   7 
## 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 
##   9   2   8   4  10  11   9   4   6   9  11   7   8   6   4   9   6   7 
## 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 
##  14   6   7  10   5   7   7   9   7  16   9   7   8   8  12   8   7   5 
## 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 
##   9  13   7   9  10   6  12  10  11   7   7  11  12   8   7  16  12  11 
## 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 
##  13   8  11   9   8  14  10  12  10  11  11  17  13  14  12  12  11  11 
## 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 
##   7  21  21  13  20  28  17  16  15  18  12  12  18  13  18  12  14  11 
## 338 339 340 341 342 343 344 345 346 347 348 349 
##  14  16  18  15  13  22  10  12  16  15  19  19

I am wondering how do you recommend controlling or correcting for this potential variation? Should I just ignore all sequences greater than ~235? My quality assessments were similar to those in the tutorial, and I trimmed to 10-240 for the forward, and 10-150 for the reverse sequences.

Thanks,

Jeff

verbose = FALSE option in `dada`... and maybe option to save to a log file?

Printing them necessarily to std out is kind of annoying, and will convolute the vignette I was going to add because I'll need to do all this cat or sink type wrapping.

FYI, best practice in R is to throw these types of messages using the message function as an if(verbose){} protected expression.

If they are C-derived messages making their way to standard out... maybe need to optionally-wrap the C call with something like sink?

Retaining samples with one sequence in dada object

Currently dada stops with an error when a sample contains only one sequence (even when derep$Quals has been coerced to a matrix – see issue #56). As mentioned in my pull request (#56), an example where this happens is when NGS is used to detect different strains of pathogens using species (pathogen)-specific primer sets. If some samples (individuals) don’t have a mixed infection, often only one sequence (of the targeted length) is returned when I run fastqFilter. Clearly I can add a couple of lines in my script to pass to dada only samples that have more than 1 sequence. I understand it makes no sense to try to estimate an error rate from one sequence only, but there might be reasons why one may want to retain (or merge later) a sample with only one sequence in a dada object. An obvious one that comes to mind is that downstream analyses will be starting from the dada object, but the ‘offending’ sample will be left out. In my example this is not desirable because the sequence was (possibly) not an error, but caused by an individual not having a mixed infection. Also, if the offending sample is retained in the dada object it is possible to pass the dada object to makeSequenceTable and easily verify whether the strain in the offending sample is present in other samples and/or conduct other analyses.
Ideally, one would like for dada to process the samples issuing a warning that samples x, y…z had only one sequence and no inference was possible, but these were retained in the dada object.

Assuming that the sequence obtained in the samples with n=1 is actually a real sequence, I can see a practical work around, which is to remove from the obtained derep object the samples with n=1, run dada, and then re-run dada with the full derep object (i.e. containing samples with n=1 sequence) passing to err the previously estimated error matrix and setting selfConsist=FALSE, but I find.
Alternatively, I could generate two dada objects, one with the samples with n>1 and properly estimated error rate, and a second with samples with n=1 passing a zero error matrix and then find a way to merge the two dada objects (which I have not yet properly investigated).

I thought to check in to know whether you see a scope to include the change I’m suggesting to dada and your opinion on a temporary work around if possible.

Thanks!

Variable length amplicons

Currently, DADA2 requires equal length amplicons for input. Is there any chance that variable length sequences will be possible in future updates of DADA? I often work with fungal ITS data and would like to be able to apply a denoising approach such as this.

Use devtools in installation instructions page O:-)

this is just a reminder. It is way easier to use, after all...

dada: "Error in strsplit(s1, "") : non-character argument"

Not sure where this is coming from, but dada isn't completing on several test fastq files that I have read in using derepFastqWithQual.

Error in strsplit(s1, "") : non-character argument

This error occurs whether or not I provide the qual vector. It's only been a few files so far, and I've gotten to as high as 6 clusters in any given run, but they're all failing with this same error.

Paired overlapping reads?

Just curious @benjjneb if you had a plan for this. I think I remember you saying that you did, but wanted to check. The training data does have paired reads available.

New consistency measure based on paired reads grouping

It is common with Illumina data to have paired-end reads. Sometimes overlapping, sometimes not, sometimes meant to overlap but sequence quality bad, reads shorter than planned, etc.

For the time being the approach is to execute DADA separately for each read, and then perhaps assemble denoised "genotypes" for which the contributing read-pairs vote strongly in favor of it.

In priniciple this could also be used as a consistency parameter during denoising that is independent from either counts or the sequence identity. Since this would complicate things, I'll just throw that out there.

What would not complicate things is to evaluate this consistency measure after DADA2 has run on both read directions. Since this would be outside the denoising algorithm it wouldn't slow anything to add. I can take a stab, but I wanted to check if you had any prototype already or something in the package I have failed to notice.

In my preliminary testing so far, the reverse reads have fewer denoised sequences relative to the forward, understandably because the length is shorter and overall quality is worse. However, it should still be possible to evaluate a "cluster" consistency measure for the reads that contribute to a sequence that survived denoising in both directions. Denoised sequences that have inconsistent pairing in their contributing reads could be flagged for inspection, or at least reported in case a user later discovers that they care about that apparent genotype.

Thoughts?

Doubletons

My original post:
"Doubletons. Because of the normalization on the abundance p-value, doubletons get off easy. Their p-value is just the chance of getting two reads given 1, so their p-value ends up being 1 - exp(-lambda), the chance not to have none. In real-life contexts (e.g. my Cyano data), doubletons false negatives do seem to be an issue."

Joey then asked whether the error model is deviating from real data. The answer is no: the error model itself isn't the issue, it's how decisions are made based on that error model about whether sequences may be due to errors. Given the way the abundance p-value works, doubletons are assigned a p-value by asking "given that they got one read, what is the chance of them having had two", which is the same as chance of them having at least one read. This may not be so unlikely, even if the chance of them having 2 reads was too unlikely.

Another way to look at this is that doubletons may have smallish singleton and abundance p-values, which in combination make them unlikely to be errors, but they are not sufficiently unlikely under either one to be rejected, and you can only be rejected as an error if you are significant under one or the other. Thus, a somewhat higher false negative rate may be expected for them than other types of errors. I don't know that there's an easy fix, but I think it could be nice to make this clear and have a script to look at their observed and expected distribution (or at least observed and expected number) post-clustering.

loading DADA2

Error in unloadNamespace(package) :
namespace ‘S4Vectors’ is imported by ‘GenomeInfoDb’, ‘Biostrings’, ‘AnnotationDbi’, ‘IRanges’, ‘DESeq2’, ‘XVector’, ‘GenomicRanges’ so cannot be unloaded

DADA2 failing to compile on Windows 7/Ubuntu 14.04.3

Hello,

I've been trying to install DADA2 but running into the same issue on both my own 64 bit desktop running windows 7, and a Ubuntu machine we have here. That issue being

g++ -m32 -I"C:/PROGRA~~1/R/R-32~~1.2/include" -DNDEBUG -I"C:/Program Files/R/R-3.2.2/library/Rcpp/include" -I"d:/RCompile/r-compiling/local/local320/include" -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -O2 -Wall -mtune=core2 -c RcppExports.cpp -o RcppExports.o
g++.exe: error: unrecognized option '-R'
g++.exe: error: unrecognized option '-R'
make: *** [RcppExports.o] Error 1

I haven't seen this same issue in either the open or closed ones here.

Thanks

Feature Request: Concatenate Paired reads

Hi Team Dada,

I like the approach of Dada2 and have followed through your tutorial. It looks very promising even if I don't have my head quite wrapped around the error/partition decision (its less intuitive than clustering).

I have a feature request that concerns paired-end reads that cannot be joined - say from 700bp amplicons on a 2x300 MiSeq run. This might not be a great decision on primer design but its data I have and i'd like to treat the concatenated pairs as a single sequence. It seems that I should be able to run the error model and then simply concatenate the reads instead of merging them. Are there any plans to include this feature? I would be happy to take a stab at it if you like.

thanks,
zach cp

Best way to make dada2 parallel?

Hi Team Dada,

Is there a best-practices for running dada2 in parallel? I know that its one of the theoretical selling points but I am not sure of the best way to make use of it. In the tutorial, for example, the derep and dada steps are held as a list so it might be possible to parallelize there. However, dada accepts the list as a whole. If given a dataset too big to fit in memory might it be possible to chunk the data and recombine only the output? I'd love to hear your opinions on this.

Thanks,
zach cp

dada2 fails to compile on 64 bit windows

I have tried several times to install dada2 into standalone R V3.1 or in RStudio, each on a Windows10 machine (although it also failed on a Windows 8 machine). I appear to have all the dependencies installed but have a failure during the compilation.

Here I am showing only the last part of the compilation where it appears things fail:

C:/Users/patcseed/Documents/R/win-library/3.1/Rcpp/include/Rcpp/internal/Exporter.h:31:31: error: invalid conversion from 'SEXP' to 'long long unsigned int' [-fpermissive]
make: *** [RcppExports.o] Error 1
Warning: running command 'make -f "C:/PROGRA~~1/R/R-3.1.3/etc/x64/Makeconf" -f "C:/PROGRA~~1/R/R-3.1.3/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="dada2.dll" WIN=64 TCLBIN=64 OBJECTS="RcppExports.o Rmain.o cluster.o error.o evaluate.o misc.o nwalign_endsfree.o pval.o strmap.o"' had status 2
ERROR: compilation failed for package 'dada2'

removing 'C:/Users/patcseed/Documents/R/win-library/3.1/dada2'
Error: Command failed (1)

I do not see this issue posted elsewhere.

Thank you.

installation failed -non-zero exit status

I am trying to install DADA2 on Rstudio v 0.99.467. I got the following message
" installation of package ‘C:/Users/yugandhar.reddybs/Documents/R/dada2/dada2-master’ had non-zero exit status". I had downloaded the zip file ad unzipped into a folder and specified the path to the dada2-master.

Can someone kindly point out how to solve thsi.

Thanks.

Error: `fastqFilter` on a single forward-read-only file

Trying to run filtering on my forward read file using fastqFilter

 fastqFilter(fn = dilutReadFile,
             fout = filtFile,
             truncQ="#",
             truncLen = 0,
             trimLeft = 10,
             compress=TRUE, verbose=TRUE)

Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  : 
  solving row 1: 'allow.nonnarrowing' is FALSE and the supplied end (42) is > refwidth

R check: "Compiled code should not call entry points which might terminate R..."

This serious-sounding formal warning arises from a form R CMD check dadc. We will need to fix these before submission to Bioconductor. Side note: occasionally these are annoying to fix, but more often it is helpful for long-term maintenance.

Here is the warning that was thrown:

* checking compiled code ... WARNING
File ‘dadac/libs/dadac.so’:
  Found ‘exit’, possibly from ‘exit’ (C)
    Objects: ‘cluster.o’, ‘misc.o’, ‘uniques.o’
  Found ‘putchar’, possibly from ‘putchar’ (C)
    Objects: ‘Rmain.o’, ‘misc.o’
  Found ‘puts’, possibly from ‘printf’ (C), ‘puts’ (C)
    Objects: ‘cluster.o’, ‘error.o’, ‘misc.o’, ‘nwalign_endsfree.o’,
      ‘pval.o’, ‘uniques.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the system RNG.

See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

Allow Q scores > 40

Right now the dada(...) function can only handle Q scores from 0 to 40. It would be preferable to allow the range of Q scores to be user definable, as Q scores > 40 come up in various protocols.

package ‘missing_package_1’ is not available (for R version 3.2.2)

Hello,

missing_package_1 and 2 are not available for the new R version. How can I fix this? (I am really new to R)

Thanks a lot

Memory Allocation Fail

This is on an Amazon EC2 c3.8xlarge instance (60 GB RAM).

strain-dilut-forward-filt210.fastq.gz
Dereplicating sequence entries in Fastq file: strain-dilut-forward-filt210.fastq.gz
..Encountered 759100 unique sequences from 2773361 total sequences read.
Running DADA...
Sample 1 - 2773361 reads in 759100 unique sequences.
Error: Memory allocation failed!

This is from the dilution experiment I'm about to share. It's a large but far from unrealistic dataset size. Any thoughts? Do I need to tweak some R parameter to change the memory ceiling? Something else?

Cheers. I'll see if I can send an S3 link soon for you to download the properly-labeled but otherwise unfiltered/untrimmed data. By properly-labeled, I mean every sequence has the true t__NNNNNN strain ID. I will also share and RData file that holds a table with lots of related information about each strain, including their true full-length 16S rRNA sequence.

Hope this helps! The mem-alloc error is definitely where I have to pause and ask you for help, so a good time to make the data handoff.

The reason I also labeled this as an enhancement is that anything dadac::dada can do to hand-hold the user regarding their systems memory would be helpful here. How much memory (very roughly) should I expect to need given some properties about the uniques object? If the code execution is pointless given their current mem, should stop and report an informative error message before attempting the computation.

Include several empirical error matrices.

Might it make sense to have a few different options for initialization matrix (maybe even user-input option), and use the latest MiSeq matrix from the recent Quince-group NAR article as default?

Consensus Sequences

Consensus sequences are not updated dynamically (during clustering) as of the most recent version of the matlab code -- don't know if Ben has changed this with the C. In practice it only matters for very small clusters with singletons and doubletons. Otherwise, the ad hoc procedure of using the most abundant sequence should essentially always be identical with the result of any true consensus sequence update.

Although Susan is right that in general EM converges, the reason it is not guaranteed to result in convergence is to do with with fact that alignments, which produce the new consensus sequences, do penalize indels, but the error model, which is responsible for the reshuffling of sequences between clusters, depends only on substitutions. So they're sort of operating under different likelihoods

Specifically, what may happen is: some singleton joins a cluster. In so doing, it pushes the consensus sequence towards itself (in terms of alignment score), which results in many fewer indels but, occasionally, MORE substitutions. These substitutions are all that matters to the error model, so now it decides to move back to its original cluster, after which the consensus sequence goes back to its original value. I have a painfully detailed documentation of one such event from a couple years ago if anyone is interested.

Large Size (7MB) of `inst/extdata`

This balloons up the package size. This directory is 7MB or so. Do we still need all those uniques files? Fine, if so.

I gzipped them and updated system path definitions in examples/vignettes.

Function to Automate Quality "Elbow" Analysis to Define Optimal Length Trim

One issue I've run into for scaling this up for lots of different datasets is knowing where to trim, for both read directions.

Any thoughts or prototype for defining the "elbow", after which the sequence should be thrown away?

No examples done in some of the functions

Every function has to have a usage example in a few line,
ie:
evaluate_kmers
dooesn't have an example.

Real data?

I don't seem to find real data examples , and the data should also be explicitly named, not just example1, example2...
I was able to load "test-nonunique.fastq.gz", but there needs to be the data that appear in the paper
in the package...

Expected memory usage

Hi,

I am attempting to run dada2 on a single sample fastq file (1043802 seqs, length ~300). I'm running this on a single node machine with 32 cores and 256GB of ram.

At this point the job has been running for over 14 hours and is using 42GB ram. Having run this on other smaller data sets, this is surprising as the other data sets finished usually within a few minutes, even datasets with over 100,000 sequences.

A link to the code can be found here:

Is this runtime/memory usage typical?

error with USE_QUALS = FALSE

I noticed a weird pattern on the error estimation sample of some of my samples and decided to compare estimating the substitution rates considering or not the Phred scores (the reasoning being that, if the vast majority of substitutions happen in PCR and not during sequencing, the Phred score is irrelevant). However, I get the following error:

dadas <- dada(derep, err=inflateErr(tperr1,3), USE_QUALS=FALSE)

Sample 1 - 40323 reads in 21169 unique sequences.

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: .Call("dada2_dada_uniques", PACKAGE = "dada2", seqs, abundances, err, quals, score, gap, use_kmers, kdist_cutoff, band_size, omegaA, use_singletons, omegaS, max_clust, min_fold, min_hamming, use_quals, qmin, qmax, final_consensus, verbose)
2: dada_uniques(names(derep[[i]]$uniques), unname(derep[[i]]$uniques), err, qi, opts[["SCORE_MATRIX"]], opts[["GAP_PENALTY"]], opts[["USE_KMERS"]], opts[["KDIST_CUTOFF"]], opts[["BAND_SIZE"]], opts[["OMEGA_A"]], opts[["USE_SINGLETONS"]], opts[["OMEGA_S"]], opts[["MAX_CLUST"]], opts[["MIN_FOLD"]], opts[["MIN_HAMMING"]], opts[["USE_QUALS"]], opts[["QMIN"]], opts[["QMAX"]], opts[["FINAL_CONSENSUS"]], opts[["VERBOSE"]])
3: dada(derep, err = inflateErr(tperr1, 3), USE_QUALS = FALSE)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Parameters not defined in `derepFastqWithQual`

Parameters are not currently defined in function doc.

derepFastqWithQual(fl, n = 1e+06, verbose = FALSE)

I can see that n is the chunk size for file streaming, but no one else would know that easily...

Just noting so that we don't forget to fully document this and all other user-facing functions.

Warning from `derepFastq`: "display list redraw incomplete"

Warning message:
display list redraw incomplete

Not sure what this means, or if I should be worried. If it is not coming from dada, then it is something in ShortRead that should be checked to make sure it is not important.

Extract fasta file containing denoised, merged reads for all samples in tutorial

I just read the Dada2 paper and worked through the tutorial. Very impressive work and software! I look forward to including Dada2 into the workflow for several of our current projects.

I apologize if this is wrong place to ask this, as it isn't a software issue, but I was wondering if there was a direct way to extract all the denoised, merged sequences for all samples in the tutorial into single fasta file? As they are exact sequences, I suppose I could create such a file by multiplying the seq x abundance for each sample from the final matrix (seqtab), but I was wondering if there was a more direct way to access this from the mergers.nochim list perhaps? My interest was to perhaps run this fasta file through a PICRUSt workflow if that seems valid?

I was also curious on what you think might be the best approach to assign taxonomy to the Dada2 sequences given they are exact sequences and not fuzzy OTUs?

Thanks in advance for any thoughts you might have on this and again for developing a fantastic approach and software.

Return value added to doc for isBimeraDenovo() and isShiftDenovo()

Should define the return value (@return) in the doc, including an explanation or reference about the logical vector being returned.

I currently assume that the logical refers to the genotypes in the input, in the same order, but it is not defined in the doc.

Errors when following DADA2 tutorial after sample inference

Hello!

When following the dada2 pipeline tutorial after sample inference when inspecting dadaFs and dadaRs, I do not get the expected info on inferred sample sequences etc. but rather a list of unique sequences and their respective number. When then trying to plot errors, I get the error message

Error in dq$trans : $ operator is invalid for atomic vectors

and no plots. When then trying to infer chimeras, I get a further error

Identified 3 bimeras out of 130 input sequences.
Identified 3 bimeras out of 130 input sequences.
Error in getUniques(unqs) :
Unrecognized format: Requires named integer vector, dada-class, derep-class, sequence matrix, or a data.frame with $sequence and $abundance columns.

and bimFs and bimRs are not created. The only prior error I get is when creating the initial error plots

Error: Input/Output
no input files found
dirPath: /users/Desktop/test/NA
pattern: character(0)

but this does not seem to hamper plot drawing or any other downstream operations up until the point mentioned above.
I'm using R 3.2.2., RStudio 0.99.473, dada2 0.10.1, ShortRead 1.26.0 and ggplot 2.0.0

Do you have any idea what might be going wrong?
Thanks
Paul

Diagnostic Scripts

My original post was: "There should be a set of diagnostic scripts that compare error distributions computed from DADA's error model (and inferred parameters) with the inferred error distributions from the clustering itself. These could include for a start one and two-away abundances."

I actually think this could be one of coolest features of an R package, given the nice plotting and ability to interface with other packages. To spell it out a bit more clearly, DADA produces two things: (1) and error model, and (2) a clustering of the data. Because this is a fully generative error model, we can go forward and ask whether the clustered data really looks typical in various ways of what the error model should produce. A simple example is to ask, under the error model, how many sequences with one error and r reads do we expect. We then plot this against a histogram of the actual numbers of errors called by the clustering with one error and r reads. We then do the same with two-aways. We can then switch and ask for errors with two reads, how many do we expect for various ranges of \lambda under the error model, and plot this against the observed number in the clustered data. Looking at a bunch of plots of this kind allows users to see if the error model is working, and where deviations may be occurring, so it's a powerful diagnostic, which I think isn't even possible under any other denoising tools out there. I'm happy to help brainstorm other useful plots we could include at some point.

ShortRead imports more than necessary

The ShortRead package is importing more than necessary, which adds some time and excess messages during package loading.

Is it possible to import just part of a package? Joey?

454/Pyrosequencing functionality

Currently 454 data is not properly supported because the homopolymer gapping option is still unimplemented and the singleton p-value is broken.

BioC Check Errors, Warnings. These block BioC acceptance

When I run package check on my machine, I get the following summary:

Status: 1 ERROR, 3 WARNINGs, 4 NOTEs

notes okay

The NOTEs can be ignored. CRAN actually doesn't allow NOTEs, but this is one area BioC has been lenient on. It may depend on the type of note, so if it's obviously fixable, do it. Otherwise, I would procrastinate on it.

warnings, errors bad

Error

* checking examples ... ERROR
Running examples in ‘dada2-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: plot_substitutions
> ### Title: Plot Substitution Pairs from DADA Result
> ### Aliases: plot_substitutions
> 
> ### ** Examples
> 
> # Examples here.
> testFile = system.file("extdata", "test-nonunique.fastq.gz", package="dada2")
> test1 = derepFastq(testFile, verbose = TRUE)
Dereplicating sequence entries in Fastq file: /Users/joeymcmurdie/github/dada2.Rcheck/dada2/extdata/test-nonunique.fastq.gz
Encountered 126 unique sequences from 209 total sequences read.
> test1$quals[test1$quals > 40] <- 40
> res1 <- dada(uniques = test1$uniques, quals = test1$quals,
+              err = dada2:::inflateErr(tperr1, 2),
+              OMEGA_A = 1e-40,
+              USE_QUALS = TRUE,
+              err_function = loessErrfun,
+              self_consist = TRUE)
Error in dada2:::inflateErr(tperr1, 2) : 
  object 'inflateNonSubs' not found
Calls: dada -> <Anonymous>
Execution halted

warnings

Warning 1

* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'C_eval_pair'
  ‘s1’ ‘s2’

Undocumented arguments in documentation object 'C_get_overlaps'
  ‘s1’ ‘s2’ ‘allow’ ‘max_shift’

Undocumented arguments in documentation object 'C_pair_consensus'
  ‘s1’ ‘s2’

Undocumented arguments in documentation object 'dada_uniques'
  ‘seqs’ ‘abundances’ ‘err’ ‘quals’ ‘score’ ‘gap’ ‘use_kmers’
  ‘kdist_cutoff’ ‘band_size’ ‘omegaA’ ‘use_singletons’ ‘omegaS’
  ‘max_clust’ ‘min_fold’ ‘min_hamming’ ‘use_quals’ ‘qmin’ ‘qmax’
  ‘final_consensus’ ‘verbose’

Undocumented arguments in documentation object 'derepFastq'
  ‘qeff’

Undocumented arguments in documentation object 'evaluate_kmers'
  ‘kmer_size’ ‘band’

Undocumented arguments in documentation object 'isBimeraDenovo'
  ‘verbose’

Undocumented arguments in documentation object 'isShiftDenovo'
  ‘verbose’

Undocumented arguments in documentation object 'mergePairs'
  ‘keep’ ‘align’

Undocumented arguments in documentation object 'showErrors'
  ‘...’

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

Warning 2

* checking for missing documentation entries ... WARNING
Undocumented code objects:
  ‘as.uniques’ ‘checkConvergence’ ‘filterNs’ ‘getIll’ ‘hamming’
  ‘isHit100’ ‘isOneOff’ ‘mergeUniques’ ‘nwalign’ ‘nweval’ ‘nwhamming’
  ‘rc’ ‘showSubPos’ ‘strdiff’ ‘subseqUniques’ ‘tperr1’
Undocumented data sets:
  ‘tperr1’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

Warning 3

* checking whether package ‘dada2’ can be installed ... [41s/42s] WARNING
Found the following significant warnings:
  Warning: replacing previous import by ‘data.table::tables’ when loading ‘dada2’
  Warning: replacing previous import by ‘Biostrings::BStringSet’ when loading ‘dada2’
  Warning: replacing previous import by ‘Biostrings::DNAStringSet’ when loading ‘dada2’
See ‘/Users/joeymcmurdie/github/dada2.Rcheck/00install.out’ for details.

I'm working on this one. It is related to me beginning to fix, but not finished fixing yet Issue 28 (the importing too much issue).

@benjjneb I would suggest you tackle all the object documentation warnings and the error, and I'll see if I can resolve the import issue after I've dialed-down the dependency import. Sound good?

Error matrix initialization options

Is dadac able to share estimated error rates across samples? For instance, the collection of inputs could be defined as a list of the unique vectors that all have some justification for expecting similar error distributions (e.g. same prep and biological replicate). I believe when we talked about it last the idea was a final "check" or final update of the error rate matrix after convergence of each individual sample. Probably a more sophisticated way to include this?

Similarly, I noticed a relatively generic sounding initialization error matrix. Might it make sense to have a few different options for initialization matrix (maybe even user-input option), and use the latest MiSeq matrix from the recent Quince-group NAR article as default?

Documentation error: `derepFastq`

This one I know how to fix. Just documenting it here.

* checking examples ... ERROR
Running examples in ‘dadac-Ex.R’ failed
The error most likely occurred in:

> ### Name: derepFastq
> ### Title: Read and Dereplicate a Fastq file.
> ### Aliases: derepFastq
> 
> ### ** Examples
>

# Test that chunk-size, `n`, does not affect the result.
testFile = system.file("extdata", "test-nonunique.fastq.gz", package="dadac")
test1 = dereplicateFastqReads(testFile, verbose = TRUE)

Error: could not find function "dereplicateFastqReads"

My issues with DADA

A few things that have always nagged me:

Cluster consensus sequences. As written in MATLAB, consensus sequences weren't updated. Instead, the most abundant sequence was binned as the consensus. This was to avoid strange looping behaviors where alleles would move between clusters, altering the consensus sequences, and destabilizing the current clustering. So to fixed consensus sequences (which is primarily useful for having small groups of singletons), something needs to be done to ensure looping doesn't happen. This could involve having an overall likelihood function that, if it decreases due to any move, either a consensus update, reshuffle, etc, rejects that move.
Doubletons. Because of the normalization on the abundance p-value, doubletons get off easy. Their p-value is just the chance of getting two reads given 1, so their p-value ends up being 1 - exp(-lambda), the chance not to have none. In real-life contexts (e.g. my Cyano data), doubletons false negatives do seem to be an issue.
Correct statistics under early stage PCR-model.
There should be a set of diagnostic scripts that compare error distributions computed from DADA's error model (and inferred parameters) with the inferred error distributions from the clustering itself. These could include for a start one and two-away abundances.