nf-core / ampliseq Goto Github PK

View Code? Open in Web Editor NEW

164.0 146.0 108.0 16.09 MB

Amplicon sequencing analysis workflow using DADA2 and QIIME2

Home Page: https://nf-co.re/ampliseq

License: MIT License

HTML 0.48% R 3.82% Python 4.98% Nextflow 89.06% Shell 1.21% Groovy 0.26% CSS 0.19%

amplicon-sequencing 16s nf-core nextflow pipeline workflow metagenomics rrna 18s its

ampliseq's Introduction

Introduction

nfcore/ampliseq is a bioinformatics analysis pipeline used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S. Phylogenetic placement is also possible. Multiple region analysis such as 5R is implemented. Supported is paired-end Illumina or single-end Illumina, PacBio and IonTorrent data. Default is the analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina.

A video about relevance, usage and output of the pipeline (version 2.1.0; 26th Oct. 2021) can also be found in YouTube and billibilli, the slides are deposited at figshare.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

By default, the pipeline currently performs the following:

Sequencing quality control (FastQC)
Trimming of reads (Cutadapt)
Infer Amplicon Sequence Variants (ASVs) (DADA2)
Optional post-clustering with VSEARCH
Predict whether ASVs are ribosomal RNA sequences (Barrnap)
Phylogenetic placement (EPA-NG)
Taxonomical classification using DADA2; alternatives are SINTAX, Kraken2, and QIIME2
Excludes unwanted taxa, produces absolute and relative feature/taxa count tables and plots, plots alpha rarefaction curves, computes alpha and beta diversity indices and plots thereof (QIIME2)
Creates phyloseq R objects (Phyloseq)
Pipeline QC summaries (MultiQC)
Pipeline summary report (R Markdown)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, you need to know whether the sequencing files at hand are expected to contain primer sequences (usually yes) and if yes, what primer sequences. In the example below, the paired end sequencing data was produced with 515f (GTGYCAGCMGCCGCGGTAA) and 806r (GGACTACNVGGGTWTCTAAT) primers of the V4 region of the 16S rRNA gene. Please note, that those sequences should not contain any sequencing adapter sequences, only the sequence that matches the biological amplicon.

Next, the data needs to be organized in a folder, here data, or detailed in a samplesheet (see input documentation).

Now, you can run the pipeline using:

nextflow run nf-core/ampliseq \
   -profile <docker/singularity/.../institute> \
   --input "data" \
   --FW_primer "GTGYCAGCMGCCGCGGTAA" \
   --RV_primer "GGACTACNVGGGTWTCTAAT" \
   --outdir <OUTDIR>

Note

Adding metadata will considerably increase the output, see metadata documentation.

Tip

By default the taxonomic assignment will be performed with DADA2 on SILVA database, but there are various tools and databases readily available, see taxonomic classification documentation. Differential abundance testing with (ANCOM) or (ANCOM-BC) when opting in.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Credits

nf-core/ampliseq was originally written by Daniel Straub (@d4straub) and Alexander Peltzer (@apeltzer) for use at the Quantitative Biology Center (QBiC) and Microbial Ecology, Center for Applied Geosciences, part of Eberhard Karls Universität Tübingen (Germany). Daniel Lundin @erikrikarddaniel (Linnaeus University, Sweden) joined before pipeline release 2.0.0 and helped to improve the pipeline considerably.

We thank the following people for their extensive assistance in the development of this pipeline (in alphabetical order):

Adam Bennett, Diego Brambilla, Emelie Nilsson, Jeanette Tångrot, Lokeshwaran Manoharan, Marissa Dubbelaar, Sabrina Krakau, Sam Minot, Till Englert

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #ampliseq channel (you can join with this invite).

Citations

If you use nf-core/ampliseq for your analysis, please cite the ampliseq article as follows:

Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline

Daniel Straub, Nia Blackwell, Adrian Langarica-Fuentes, Alexander Peltzer, Sven Nahnsen, Sara Kleindienst

Frontiers in Microbiology 2020, 11:2652 doi: 10.3389/fmicb.2020.550420.

You can cite the nf-core/ampliseq zenodo record for a specific version using the following doi: 10.5281/zenodo.1493841

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

ampliseq's People

Contributors

Stargazers

Watchers

Forkers

apeltzer d4straub lifebit-ai ropolomx changrong1023 thanhleviet solna gabyrech jiangqiuqiuu dayueban vincenthhu saramonzon praveenbas mitchac bioatlas diegobrambilla iferres colindaven science-notes ryan-csu adelemusa replikation jtangrot xinsongdu jerry-luz emnilsson johnne lemaslab nbisweden chan-csu eresearchqut genomic575 erikrikarddaniel hivlab watsonwoo mnoguera mgordon09 ssmufer lukeheneghan evanfloden jfy133 gerbenvoshol amakunin conant-csu ckigenk vsmalladi ssyamoako nasir0001 tferre25 barrydigby msdboost lokeshbio drpatelh smwasya ctuni tillenglert lbiernot utguang realrichi3 theoafidian farchaab nick-youngblut ijustwanthaveaname genostack liubingdong rhinosys dariader xprize-msu shvartsmanirina emagallong truongphi20 maxulysse sbglab arkea-bio-corp weber8thomas a4000 ejseqera nhenry50 lrguo1204 sminot cirrobio bhumika-11 matthewjm96 danclaytondev hh1985 jinhuili-lab onurbulbul2 mashehu jackscanlan mrimoldi npechl biofriends danilodileo jayalalkj vaulot giuseppemartone shenkebio prashiyn lizard-bio mehmetdirenc

ampliseq's Issues

Dependency hell...

Had to polish the environmen.yaml quite a bunch of times unfortunately:

ncurses is requiring 5.9 but doesn#t specify directly from where, prefixing with conda-forge::ncurses=5.9 solved that issue

Similar things happen then when running qiime_import:

ampliseq-1.0dev/lib/python3.5/site-packages/matplotlib/backends/qt_editor/figureoptions.py", line 20, in <module>
      import matplotlib.backends.qt_editor.formlayout as formlayout
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/matplotlib/backends/qt_editor/formlayout.py", line 54, in <module>
      from matplotlib.backends.qt_compat import QtGui, QtWidgets, QtCore
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/matplotlib/backends/qt_compat.py", line 140, in <module>
      from PyQt5 import QtCore, QtGui, QtWidgets
  ImportError: libGL.so.1: cannot open shared object file: No such file or directory
  
  An unexpected error has occurred:
  
    libGL.so.1: cannot open shared object file: No such file or directory
  
  See above for debug info.

Work dir:
  /home/alex/IDEA/nf-core/rrna-ampliseq/work/be/06192acf88d8197b617ef9e3f8d064

Apparently the qt library installed doesn't install the required libGL.so.1 shared objects.

Provide list with citations in docs/

Make a document that lists all citations that should be acknowledged when running the pipeline.
Such as FastQC, MultiQC, DADA2, QIIME2, ANCOM, q2-modules that were involved, ...

Indicate exact regex for sequencing file names

Indicate '.+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz' as required format.
To prevent errors as in #56

Preferentially check that regex before even starting the analysis.

replace bash code in processes alpha_rarefaction and diversity_core

use bin/count_table_minmax_reads.py to replace bash code in processes alpha_rarefaction and diversity_core

publish demux.qza when --untilQ2import

change

process qiime_import {
        publishDir "${params.outdir}/demux", mode: 'copy', 
saveAs: {params.keepIntermediates ? filename : null}

	process qiime_import {
        publishDir "${params.outdir}/demux", mode: 'copy', 
saveAs: {params.untilQ2import ? filename : null}

related to #55

unzip: cannot find zipfile directory in one of Silva_132_release.zip

Hi, I went into an issue as shown below. I am running the docker standard profile. Any suggestions on solving this issue?

[2d/5ce92b] Submitted process > make_SILVA_132_16S_classifier (1)
[ac/9130c2] Submitted process > metadata_category_all (1)
ERROR ~ Error executing process > 'make_SILVA_132_16S_classifier (1)'

Caused by:
Process make_SILVA_132_16S_classifier (1) terminated with an error exit status (9)

Command executed:

unzip -qq Silva_132_release.zip

      fasta="SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna"
      taxonomy="SILVA_132_QIIME_release/taxonomy/16S_only/99/consensus_taxonomy_7_levels.txt"

      if [ "false" = "true" ]; then
	    sed 's/#//g' $taxonomy >taxonomy-99_removeHash.txt
	    taxonomy="taxonomy-99_removeHash.txt"
	    echo "

######## WARNING! The taxonomy file was altered by removing all hash signs!"
fi

    ### Import
    qiime tools import --type 'FeatureData[Sequence]' 		--input-path $fasta 		--output-path ref-seq-99.qza
    qiime tools import --type 'FeatureData[Taxonomy]' 		--source-format HeaderlessTSVTaxonomyFormat 		--input-path $taxonomy 		--output-path ref-taxonomy-99.qza

    #Extract sequences based on primers
    qiime feature-classifier extract-reads 		--i-sequences ref-seq-99.qza 		--p-f-primer ACTCCTACGGGAGGCAGCA 		--p-r-primer GGACTACHVGGGTWTCTAAT 		--o-reads ACTCCTACGGGAGGCAGCA-GGACTACHVGGGTWTCTAAT-99-ref-seq.qza         --quiet

    #Train classifier
    qiime feature-classifier fit-classifier-naive-bayes 		--i-reference-reads ACTCCTACGGGAGGCAGCA-GGACTACHVGGGTWTCTAAT-99-ref-seq.qza 		--i-reference-taxonomy ref-taxonomy-99.qza 		--o-classifier ACTCCTACGGGAGGCAGCA-GGACTACHVGGGTWTCTAAT-99-classifier.qza         --quiet

Command exit status:
9

Command output:
(empty)

Command error:
[Silva_132_release.zip]
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of Silva_132_release.zip or
Silva_132_release.zip.zip, and cannot find Silva_132_release.zip.ZIP, period.

Sanity check for input values

expand sanity check for input values, also --metadata_category could be verified: subset of metadata_category_all

Make CI Tests work

Get Testdata from daniel :-)
Upload Dataset to https://github.com/nf-core/test-datasets
Use them in the CI testing!
Polish CI tests to work for sample data

Metadata double check path

--metadata seems to require the absolute path instead of the relative path

Compute Resources for individual processes

define compute resources for each process, also see params.tree_cores and params.diversity_cores

Read "Manifest" file format?

Hi,
There is another way to import single- and paired-end demultiplexed sequences in QIIME2 aside from the Casava format that is not picky with names. The "Manifest" file format (https://docs.qiime2.org/2018.11/tutorials/importing/?highlight=manifest#fastq-manifest-formats) is a text file that does not pose limitations to file naming. It is up to users to prepare it but I think that with some coding the preparation of manifest files can be automatized and thus can enhance the flexibility of the pipeline in accepting input files. Another way could also be that the pipeline would accept both casava file formats and manifest files manually created by users. This way, downstream analyses like taxonomy or alpha diversity would benefit from having clearer ID names.

Offscreen renderer backend in Singularity

Singularity transparently mounts required directories into the container using features such as overlayfs2.
This is nice when it comes down to temporary files, but bad as we require to configure the backend of matplotlib to "Agg" for not using QT5 in Qiime2 (see #25 for details why) , this should be set a bit more appropriately 👍

https://matplotlib.org/users/customizing.html#the-matplotlibrc-file

Enhancement: add quality control options

Hi there,

Here's another new feature request: adding quality control options.

As part of the quality control, negative (blank extraction/library) and positive (mock) controls are often processed together with the samples. Therefore, allowing input of the mock composition will help to evaluate the reliability of the whole pipeline (wet and dry lab), which can be implemented by the q2-quality-control plugin. For the use of negative controls, the feature table with taxonomy can be generated and exported. Users will then need to supply a tsv file containing the featureIDs of contaminant sequences for the filtering in the next flow pipeline. Alternatively, users can provide a "Sample_DNA_concentraion.tsv" file, which can be used to identify the contaminant sequences by tools like the decontam package. The sample DNA is ideally measured via qPCR using universal primers targetting 16s rRNA gene, which provides accurate quantification of bacterial DNA from host-associated samples.

Cheers,
Yanxian

test with real data

test with real data on binac

Bonus tasks (later make individual issues out of this ticket...)

Bonus: multiple sequencing runs (DOESNT WORK!)
Analysis of multiple sequencing runs:
- input are multiple folders, one per sequencing run (data folder with multiple subfolders (a,b,c) each containing sequencing read files and a metadatafile with comlumn new_name)
- params.metadata still required, contains merged metadata for all samples with ID column containing new_name values of all metadata files in subfolders
- trimming on each folder
- qiime_import on each trimming folder
- dada_multi on each qiime_import folder
- mergeDADA to combine all sequencing runs, than continue as usual.
Dream: final report
- make some sort of final report like extended multiqc report?
- interesting output (all in params.outdir and multiqc/cutadapt report (%pairs retained) could get a link/representation in final report?
- .html reports (e.g. from alpha_diversity, beta_diversity, beta_diversity_ordination, ancom) could get a link/representation in final report?

Report error when dada output table has count 0 for a sample

When choosing too high or too low dada2 truncation values (manually or automatically), all reads of a sample can get lost. This results in a cryptic error message, better report that properly and exit.

Remove SE mode for now

also, all --singleEnd options would not work at the moment. qiime import is expecting paired end, we would need to implement single end processing in several steps.

Report stats after taxa filtering (commit ee4425b)

Run count_table_filter_stats.py in main.nf after process filter_taxa to report how many counts were filtered. Current output is a table to stdout, that might need improvement for larger experiments.

Temp File Handling

all files in params.outdir are valuable output, all files in params.temp_dir are temporary files that can be hidden, only needed when resuming e.g. with --Q2imported

fastqc doesnt work

fastqc doesnt work, see attachemnt
fastqc_command.log

Export dada2 report

Dada2 report might be valuable for trouble shooting. This report includes information such as reaching convergence when calculating the run specific sequencing error model.

Sample Name issues for parsing reads in qiimi

seems a reads name parsing problem in my situation

My command is

nextflow run ampliseq \
 --reads "/data3/zqf/16S_Anaysis/rawdata/28892494"  \
 --FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer GGACTACNVGGGTWTCTAAT \
 --metadata "/data3/zqf/16S_Anaysis/rawdata/Metadata.tsv"\
 --outdir /data3/zqf/16S_Anaysis/results/28892494/ \
 -profile docker --genomes greengenes -resume

my reads file is like 2095566_L001_R1_001.fastq.gz ,
dose sample name need a _ inside ?

Pipeline error

Hi,

I tried to run the nextflow pipeline for a small dataset but encountered an error.

The commands I used:

nextflow run nf-core/rrna-ampliseq \
  -profile standard,docker \
  -name "test1" \
  -r 1.0.0 \
  --reads '/home/nutrition_group/desktop/data/Yanxian/misc/beta-conglycinin/16s/casava-18-paired-end-demultiplexed' \
  --untilQ2import  \
  --Q2imported  \
  --FW_primer GTGCCAGCMGCCGCGGTAA \
  --RV_primer GGACTACHVGGGTWTCTAAT \
  --trunclenf 239 \
  --trunclenr 230 \
  --retain_untrimmed \
  --metadata "/home/nutrition_group/desktop/data/Yanxian/misc/beta-conglycinin/16s/metadata.tsv"\
  --metadata_category "Diet,Compartment" \
  --exclude_taxa "mitochondria,chloroplast" \
  --outdir "/home/nutrition_group/desktop/data/Yanxian/misc/beta-conglycinin/16s/nextflow/" \
  --email "[email protected]" \
  --max_memory '16.GB' \
  --max_cpus 12

The error message:

ERROR ~ No signature of method: static nextflow.Channel.fromFile() is applicable for argument types: (org.codehaus.groovy.runtime.GStringImpl) values: [true]
Possible solutions: from([Ljava.lang.Object;), from(java.util.Collection), fromPath(java.lang.Object)

 -- Check script 'main.nf' at line: 129 or see '.nextflow.log' file for more details

My java version:

java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

Regards,
Yanxian

Valide Input Files for metadata / Q2 Imported Files

validate input files for params.Q2imported params.metadata

Nicer plots?

Should have a look at this one here:

https://github.com/bbarad/matplotlibrc

Super easy to adapt that for the pipeline - simply take the matplotlibrc there, adapt it to our needs and supplement it with ampliseq ;-)

Set up NF-Core Syncing

As this was just created using nf-core create, we shouldn't have bigger issues in setting things up properly here!

Missing "unzip" in the silva classifier step

Process make_SILVA_132_16S_classifier slipped our test runs.
Test command for fast processing e.g.:
nextflow run rrna-ampliseq -profile test,singularity --classifier false --dereplication 90

from #33

Replace Bash Loops

bash for-loops seem inefficient: RelativeAbundanceReducedTaxa, alpha_diversity, beta_diversity, beta_diversity_ordination, ancom

Container for entire pipeline

one container for whole pipeline (remove lines such as "singularity exec ${params.qiimeimage} ", change "~/PROGRAMS/FastQC/fastqc", adjust params.qiimeimage = "$baseDir/qiime2_2018.6.simg")

Enhancement: more quality control options

Hi Alex,

Thank you for developing the next flow pipeline to do fully reproducible analysis of 16S rRNA amplicon data.

I have some new feature requests:

Enable filtering singletons and low prevalent features (only present in one sample) from the feature table. If samples were denoised on a per sample basis, filtering singletons is probably a good option as explained by the DADA2 developers here. Filtering low-prevalent features can also further reduce false sequence variants as suggested in the QIIME2 forum.
Add quality control options. As part of the quality control, negative (blank extraction/library) and positive (mock) controls are often processed together with the samples. Therefore, allowing input of the mock composition will help to evaluate the reliability of the whole pipeline, which can be implemented by the q2-quality-control plugin. For the negative control, the feature table with taxonomy can be generated and exported. Users will then need to supply a tsv file containing the featureIDs of contaminant sequences for the filtering in the next flow pipeline. Alternatively, users can provide a "Sample_DNA_concentraion.tsv" file, which can be used to identify the contaminant sequences by tools like the decontam package.

Regards,
Yanxian

Process "trimming" doesnt provide required input for process "qiime_import"

Process "qiime_import" expects as "--input-path $trimmed" a path/to/folder containing all files with trimmed reads. The files with trimmed reads need to follow the naming scheme "*_L001_R{1,2}_001.fastq.gz".

Problems - process "trimming":

outputs files with the naming scheme "*_L001_R{1,2}_001.fastq.gz.trimmed" instead of "*_L001_R{1,2}_001.fastq.gz"
outputs files are not collected in a folder that is accessible with the process ""qiime_import" parameter "--input-path $trimmed"
publishing all trimmed files would create significant data overhead

Problems - process "qiime_import":

"--input-path $trimmed" is a channel containing all files from process "trimming" instead of a folder containing all these files

Add support for multiple reference databases

Add support for multiple reference databases.

Infos for path to zipped files (db_zip), path to fasta and taxonomy file in zip file are in "conf/ref_databases.config"

Add PICRUSt2 analysis

PICRUSt2 pipeline to get EC, KO, and MetaCyc pathway predictions based on 16S data

~~Requirements:~~
~~- qiime2 version needs upgrade to >2018.8~~
~~- container has to include PICRUSt2 for QIIME2~~

Edit: Use Picrust2 outside of QIIME2, because now with DSL2 the pipeline uses biocontainers and the QIIME2 biocontainer does not contain picrust by default. Also, using Picrust2 independently from QIIME2 allows to skip QIIME2 and use DADA2 output.

Keep certain key .qza files

always publish the following files in the result folder:
table.qza
rep-seqs.qza
taxonomy.qza
rooted-tree.qza

Read Input Parameter as Folder

params.reads should only specify the folder, but only files like *_L001_R{1,2}_001.fastq.gz are chosen (QIIME2 PE input requires *_L001_R{1,2}_001.fastq.gz format, so this is required anyway)

No more space error on classifier

Current path   : /home/alex/IDEA/nf-core/rrna-ampliseq
Script dir     : /home/alex/IDEA/nf-core/rrna-ampliseq
Config Profile : test,docker
=========================================
[warm up] executor > local
[8f/1bdbbf] Cached process > get_software_versions
[23/9a0ff7] Cached process > output_documentation
[08/bc4aca] Cached process > metadata_category_all (1)
[f1/6e41af] Cached process > metadata_category_pairwise (1)
[e0/1388c4] Cached process > fastqc (1a_S103)
[14/1981e3] Cached process > trimming (1a_S103)
[3e/3d4ff6] Cached process > trimming (2a_S115)
[f2/c6524d] Cached process > fastqc (2a_S115)
[40/c66b2f] Cached process > trimming (1_S103)
[7f/99415d] Cached process > fastqc (1_S103)
[46/12c4bd] Cached process > fastqc (2_S115)
[1a/047962] Cached process > trimming (2_S115)
[a0/8ebb9c] Cached process > qiime_import
[75/0fc607] Cached process > qiime_demux_visualize
[12/3e642e] Cached process > multiqc
[0a/954abc] Cached process > dada_trunc_parameter
[ec/712c2c] Cached process > dada_single
[3f/19e7c9] Submitted process > classifier (1)
ERROR ~ Error executing process > 'classifier (1)'

Caused by:
  Process `classifier (1)` terminated with an error exit status (1)

Command executed:

  qiime feature-classifier classify-sklearn  	--i-classifier GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-classifier.qza  	--p-n-jobs "-1"  	--i-reads rep-seqs.qza  	--o-classification taxonomy.qza  	--verbose
  
  qiime metadata tabulate  	--m-input-file taxonomy.qza  	--o-visualization taxonomy.qzv  	--verbose
  
  #produce "taxonomy/taxonomy.tsv"
  qiime tools export taxonomy.qza  	--output-dir taxonomy
  
  qiime tools export taxonomy.qzv  	--output-dir taxonomy

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/q2cli/commands.py", line 274, in __call__
      results = action(**arguments)
    File "<decorator-gen-292>", line 2, in classify_sklearn
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/qiime2/sdk/action.py", line 232, in bound_callable
      output_types, provenance)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/qiime2/sdk/action.py", line 367, in _callable_executor_
      output_views = self._callable(**view_args)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/q2_feature_classifier/classifier.py", line 215, in classify_sklearn
      confidence=confidence)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/q2_feature_classifier/_skl.py", line 45, in predict
      for chunk in _chunks(reads, chunk_size)) for m in c)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
      self.retrieve()
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
      self._output.extend(job.get(timeout=self.timeout))
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/multiprocessing/pool.py", line 644, in get
      raise self._value
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/multiprocessing/pool.py", line 424, in _handle_tasks
      put(task)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 371, in send
      CustomizablePickler(buffer, self._reducers).dump(obj)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 240, in __call__
      for dumped_filename in dump(a, filename):
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 484, in dump
      NumpyPickler(f, protocol=protocol).dump(value)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/pickle.py", line 408, in dump
      self.save(obj)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 278, in save
      wrapper.write_array(obj, self)
    File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 93, in write_array
      pickler.file_handle.write(chunk.tostring('C'))
  OSError: [Errno 28] No space left on device
  
  Plugin error from feature-classifier:
  
    [Errno 28] No space left on device
  
  See above for debug info.

Work dir:
  /home/alex/IDEA/nf-core/rrna-ampliseq/work/3f/19e7c9e31a840d983cf984979ce1bc

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
[nf-core/rrna-ampliseq] Pipeline Complete

LogFile with some features

make nice looking log with stuff written currently to stdout ("echo" etc.)

diversity related processes only run for one channel element

Diversity related processes produce results only for the first of several files in the input channel "qiime_diversity_core_for_*", because there is only one element in the second input channel "ch_metadata_for_*".
This is true for the following processes:
alpha_diversity
beta_diversity
beta_diversity_ordination

Support for analysing multiple sequencing runs

Currently, only data from one sequencing run can be analyzed comfortably.

Starting point:

input are multiple folders, one per sequencing run, specified as --reads "folder1,folder2,..,folderN"
each input folder is containing sequencing read files of samples name such as "sample1, ... , sampleN"
params.metadata points to metadata for all samples of all sequencing runs with sample IDs such as "folder1-sample1"

Idea:

trimming and QC on all files
qiime_import on each input folder
dada_multi on each qiime_import folder
sample IDs have to be renamed to e.g. "folder1-sample1" to avoid overlap
mergeDADA to combine all sequencing runs, than continue as usual.

This is part of #13.

Write Documentation

We should document all parts of the pipeline and some of the reasoning behind it in markdown, accompanying the usage.
Also a detailed explanation what which reports mean in general should follow too.

Consider rename

We should probably get rid of the rrna- prefix and just do ampliseq.

https://nf-co.re/guidelines#workflow-name

Issue with TMP folder inside container

https://forum.qiime2.org/t/plugin-error-from-feature-classifier-no-space-left-on-device/3719/3

Fix categoryPairwise.r script

Could you have a look at that script and fix it in bin?

Enhancement: allow more feature table filtering options

Hi there,

I have a new feature request: enable filtering singletons and low prevalent features (only present in one sample) from the feature table.

According to the DADA2 developers, if samples were denoised on a per sample basis (like in QIIME2-2018.11), filtering singletons is probably a good option as these singletons are likely real biological sequence variants as they're artifacts. However, if the samples were to be denoised using the pooling or pseudo-pooling model in the future QIIME2 releases, filtering singletons would be invalid. Relevant discussions can be found here.

Filtering low-prevalent features can further reduce false sequence variants as suggested in the QIIME2 forum.

Cheers,
Yanxian

change params.metadata to optional parameter because it isnt always required

change params.metadata as not strictly required, because:

params.metadata is "only" required for processes:
barplot
alpha_rarefaction
diversity_core
alpha_diversity
metadata_category_all
metadata_category_pairwise
beta_diversity
beta_diversity_ordination
ancom
so in turn it isn't required if --untilQ2import or --onlyDenoising or if all of the above are not executed.

Problem running offline

I'm running the pipeline in an offline cluster. I downloaded the repository and the singularity image.
I initialize the script with: export NXF_OFFLINE='TRUE'

Running as
nextflow run path/to/ampliseq -with-singularity "path/to/img"

I got this warning:

WARN: Unable to stage foreign file: https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip (try 1) -- Cause: Connection timed out (Connection timed out)

Can i predownload the archive and place it somewhere?

The idea is to modify the file with taxonomy strings, most likely hash tags in this file are causing this issue.