nf-core / ampliseq Goto Github PK
View Code? Open in Web Editor NEWAmplicon sequencing analysis workflow using DADA2 and QIIME2
Home Page: https://nf-co.re/ampliseq
License: MIT License
Amplicon sequencing analysis workflow using DADA2 and QIIME2
Home Page: https://nf-co.re/ampliseq
License: MIT License
Current path : /home/alex/IDEA/nf-core/rrna-ampliseq
Script dir : /home/alex/IDEA/nf-core/rrna-ampliseq
Config Profile : test,docker
=========================================
[warm up] executor > local
[8f/1bdbbf] Cached process > get_software_versions
[23/9a0ff7] Cached process > output_documentation
[08/bc4aca] Cached process > metadata_category_all (1)
[f1/6e41af] Cached process > metadata_category_pairwise (1)
[e0/1388c4] Cached process > fastqc (1a_S103)
[14/1981e3] Cached process > trimming (1a_S103)
[3e/3d4ff6] Cached process > trimming (2a_S115)
[f2/c6524d] Cached process > fastqc (2a_S115)
[40/c66b2f] Cached process > trimming (1_S103)
[7f/99415d] Cached process > fastqc (1_S103)
[46/12c4bd] Cached process > fastqc (2_S115)
[1a/047962] Cached process > trimming (2_S115)
[a0/8ebb9c] Cached process > qiime_import
[75/0fc607] Cached process > qiime_demux_visualize
[12/3e642e] Cached process > multiqc
[0a/954abc] Cached process > dada_trunc_parameter
[ec/712c2c] Cached process > dada_single
[3f/19e7c9] Submitted process > classifier (1)
ERROR ~ Error executing process > 'classifier (1)'
Caused by:
Process `classifier (1)` terminated with an error exit status (1)
Command executed:
qiime feature-classifier classify-sklearn --i-classifier GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-classifier.qza --p-n-jobs "-1" --i-reads rep-seqs.qza --o-classification taxonomy.qza --verbose
qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization taxonomy.qzv --verbose
#produce "taxonomy/taxonomy.tsv"
qiime tools export taxonomy.qza --output-dir taxonomy
qiime tools export taxonomy.qzv --output-dir taxonomy
Command exit status:
1
Command output:
(empty)
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Traceback (most recent call last):
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/q2cli/commands.py", line 274, in __call__
results = action(**arguments)
File "<decorator-gen-292>", line 2, in classify_sklearn
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/qiime2/sdk/action.py", line 232, in bound_callable
output_types, provenance)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/qiime2/sdk/action.py", line 367, in _callable_executor_
output_views = self._callable(**view_args)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/q2_feature_classifier/classifier.py", line 215, in classify_sklearn
confidence=confidence)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/q2_feature_classifier/_skl.py", line 45, in predict
for chunk in _chunks(reads, chunk_size)) for m in c)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
self.retrieve()
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/multiprocessing/pool.py", line 644, in get
raise self._value
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 371, in send
CustomizablePickler(buffer, self._reducers).dump(obj)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/pool.py", line 240, in __call__
for dumped_filename in dump(a, filename):
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 484, in dump
NumpyPickler(f, protocol=protocol).dump(value)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/pickle.py", line 408, in dump
self.save(obj)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 278, in save
wrapper.write_array(obj, self)
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 93, in write_array
pickler.file_handle.write(chunk.tostring('C'))
OSError: [Errno 28] No space left on device
Plugin error from feature-classifier:
[Errno 28] No space left on device
See above for debug info.
Work dir:
/home/alex/IDEA/nf-core/rrna-ampliseq/work/3f/19e7c9e31a840d983cf984979ce1bc
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
-- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
[nf-core/rrna-ampliseq] Pipeline Complete
use bin/count_table_minmax_reads.py to replace bash code in processes alpha_rarefaction and diversity_core
Currently, only data from one sequencing run can be analyzed comfortably.
Starting point:
Idea:
This is part of #13.
Update to reflect all changes.
As this was just created using nf-core create
, we shouldn't have bigger issues in setting things up properly here!
Singularity transparently mounts required directories into the container using features such as overlayfs2.
This is nice when it comes down to temporary files, but bad as we require to configure the backend of matplotlib to "Agg" for not using QT5 in Qiime2 (see #25 for details why) , this should be set a bit more appropriately ๐
https://matplotlib.org/users/customizing.html#the-matplotlibrc-file
all files in params.outdir
are valuable output, all files in params.temp_dir
are temporary files that can be hidden, only needed when resuming e.g. with --Q2imported
params.reads should only specify the folder, but only files like *_L001_R{1,2}_001.fastq.gz are chosen (QIIME2 PE input requires *_L001_R{1,2}_001.fastq.gz format, so this is required anyway)
When choosing too high or too low dada2 truncation values (manually or automatically), all reads of a sample can get lost. This results in a cryptic error message, better report that properly and exit.
fastqc doesnt work, see attachemnt
fastqc_command.log
Hi there,
I have a new feature request: enable filtering singletons and low prevalent features (only present in one sample) from the feature table.
According to the DADA2 developers, if samples were denoised on a per sample basis (like in QIIME2-2018.11), filtering singletons is probably a good option as these singletons are likely real biological sequence variants as they're artifacts. However, if the samples were to be denoised using the pooling or pseudo-pooling model in the future QIIME2 releases, filtering singletons would be invalid. Relevant discussions can be found here.
Filtering low-prevalent features can further reduce false sequence variants as suggested in the QIIME2 forum.
Cheers,
Yanxian
always publish the following files in the result folder:
table.qza
rep-seqs.qza
taxonomy.qza
rooted-tree.qza
Hi,
I tried to run the nextflow pipeline for a small dataset but encountered an error.
The commands I used:
nextflow run nf-core/rrna-ampliseq \
-profile standard,docker \
-name "test1" \
-r 1.0.0 \
--reads '/home/nutrition_group/desktop/data/Yanxian/misc/beta-conglycinin/16s/casava-18-paired-end-demultiplexed' \
--untilQ2import \
--Q2imported \
--FW_primer GTGCCAGCMGCCGCGGTAA \
--RV_primer GGACTACHVGGGTWTCTAAT \
--trunclenf 239 \
--trunclenr 230 \
--retain_untrimmed \
--metadata "/home/nutrition_group/desktop/data/Yanxian/misc/beta-conglycinin/16s/metadata.tsv"\
--metadata_category "Diet,Compartment" \
--exclude_taxa "mitochondria,chloroplast" \
--outdir "/home/nutrition_group/desktop/data/Yanxian/misc/beta-conglycinin/16s/nextflow/" \
--email "[email protected]" \
--max_memory '16.GB' \
--max_cpus 12
The error message:
ERROR ~ No signature of method: static nextflow.Channel.fromFile() is applicable for argument types: (org.codehaus.groovy.runtime.GStringImpl) values: [true]
Possible solutions: from([Ljava.lang.Object;), from(java.util.Collection), fromPath(java.lang.Object)
-- Check script 'main.nf' at line: 129 or see '.nextflow.log' file for more details
My java version:
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
Regards,
Yanxian
better solution for small helper processes: qiime_existing_demux, use_existing_classifier, dada_trunc_parameter (second script), skip_filter_taxa, metadata_category_all (second script)
We should probably get rid of the rrna-
prefix and just do ampliseq
.
data
folder with multiple subfolders (a,b,c) each containing sequencing read files and a metadatafile with comlumn new_name
)params.metadata
still required, contains merged metadata for all samples with ID
column containing new_name
values of all metadata files in subfoldersqiime_import
on each trimming folderdada_multi
on each qiime_import
foldermergeDADA
to combine all sequencing runs, than continue as usual.params.outdir
and multiqc/cutadapt report (%pairs retained
) could get a link/representation in final report?.html
reports (e.g. from alpha_diversity, beta_diversity, beta_diversity_ordination, ancom) could get a link/representation in final report?Process "qiime_import" expects as "--input-path $trimmed" a path/to/folder containing all files with trimmed reads. The files with trimmed reads need to follow the naming scheme "*_L001_R{1,2}_001.fastq.gz".
Problems - process "trimming":
Problems - process "qiime_import":
Run count_table_filter_stats.py in main.nf after process filter_taxa to report how many counts were filtered. Current output is a table to stdout, that might need improvement for larger experiments.
bash for-loops seem inefficient: RelativeAbundanceReducedTaxa, alpha_diversity, beta_diversity, beta_diversity_ordination, ancom
Hi there,
Here's another new feature request: adding quality control options.
As part of the quality control, negative (blank extraction/library) and positive (mock) controls are often processed together with the samples. Therefore, allowing input of the mock composition will help to evaluate the reliability of the whole pipeline (wet and dry lab), which can be implemented by the q2-quality-control plugin. For the use of negative controls, the feature table with taxonomy can be generated and exported. Users will then need to supply a tsv file containing the featureIDs of contaminant sequences for the filtering in the next flow pipeline. Alternatively, users can provide a "Sample_DNA_concentraion.tsv" file, which can be used to identify the contaminant sequences by tools like the decontam package. The sample DNA is ideally measured via qPCR using universal primers targetting 16s rRNA gene, which provides accurate quantification of bacterial DNA from host-associated samples.
Cheers,
Yanxian
Make a document that lists all citations that should be acknowledged when running the pipeline.
Such as FastQC, MultiQC, DADA2, QIIME2, ANCOM, q2-modules that were involved, ...
make nice looking log with stuff written currently to stdout ("echo" etc.)
one container for whole pipeline (remove lines such as "singularity exec ${params.qiimeimage} ", change "~/PROGRAMS/FastQC/fastqc", adjust params.qiimeimage = "$baseDir/qiime2_2018.6.simg")
Hi Alex,
Thank you for developing the next flow pipeline to do fully reproducible analysis of 16S rRNA amplicon data.
I have some new feature requests:
Enable filtering singletons and low prevalent features (only present in one sample) from the feature table. If samples were denoised on a per sample basis, filtering singletons is probably a good option as explained by the DADA2 developers here. Filtering low-prevalent features can also further reduce false sequence variants as suggested in the QIIME2 forum.
Add quality control options. As part of the quality control, negative (blank extraction/library) and positive (mock) controls are often processed together with the samples. Therefore, allowing input of the mock composition will help to evaluate the reliability of the whole pipeline, which can be implemented by the q2-quality-control plugin. For the negative control, the feature table with taxonomy can be generated and exported. Users will then need to supply a tsv file containing the featureIDs of contaminant sequences for the filtering in the next flow pipeline. Alternatively, users can provide a "Sample_DNA_concentraion.tsv" file, which can be used to identify the contaminant sequences by tools like the decontam package.
Regards,
Yanxian
I'm running the pipeline in an offline cluster. I downloaded the repository and the singularity image.
I initialize the script with: export NXF_OFFLINE='TRUE'
Running as
nextflow run path/to/ampliseq -with-singularity "path/to/img"
I got this warning:
WARN: Unable to stage foreign file: https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip (try 1) -- Cause: Connection timed out (Connection timed out)
Can i predownload the archive and place it somewhere?
Had to polish the environmen.yaml quite a bunch of times unfortunately:
conda-forge::ncurses=5.9
solved that issueSimilar things happen then when running qiime_import
:
ampliseq-1.0dev/lib/python3.5/site-packages/matplotlib/backends/qt_editor/figureoptions.py", line 20, in <module>
import matplotlib.backends.qt_editor.formlayout as formlayout
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/matplotlib/backends/qt_editor/formlayout.py", line 54, in <module>
from matplotlib.backends.qt_compat import QtGui, QtWidgets, QtCore
File "/opt/conda/envs/nf-core-rrna-ampliseq-1.0dev/lib/python3.5/site-packages/matplotlib/backends/qt_compat.py", line 140, in <module>
from PyQt5 import QtCore, QtGui, QtWidgets
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
An unexpected error has occurred:
libGL.so.1: cannot open shared object file: No such file or directory
See above for debug info.
Work dir:
/home/alex/IDEA/nf-core/rrna-ampliseq/work/be/06192acf88d8197b617ef9e3f8d064
Apparently the qt library installed doesn't install the required libGL.so.1
shared objects.
change params.metadata as not strictly required, because:
params.metadata is "only" required for processes:
barplot
alpha_rarefaction
diversity_core
alpha_diversity
metadata_category_all
metadata_category_pairwise
beta_diversity
beta_diversity_ordination
ancom
so in turn it isn't required if --untilQ2import or --onlyDenoising or if all of the above are not executed.
Indicate '.+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz' as required format.
To prevent errors as in #56
Preferentially check that regex before even starting the analysis.
test with real data on binac
Process make_SILVA_132_16S_classifier slipped our test runs.
Test command for fast processing e.g.:
nextflow run rrna-ampliseq -profile test,singularity --classifier false --dereplication 90
from #33
--metadata seems to require the absolute path instead of the relative path
Should have a look at this one here:
https://github.com/bbarad/matplotlibrc
Super easy to adapt that for the pipeline - simply take the matplotlibrc there, adapt it to our needs and supplement it with ampliseq ;-)
also, all --singleEnd options would not work at the moment. qiime import is expecting paired end, we would need to implement single end processing in several steps.
expand sanity check for input values, also --metadata_category could be verified: subset of metadata_category_all
Get Testdata from daniel :-)
Upload Dataset to https://github.com/nf-core/test-datasets
Use them in the CI testing!
Polish CI tests to work for sample data
change
process qiime_import {
publishDir "${params.outdir}/demux", mode: 'copy',
saveAs: {params.keepIntermediates ? filename : null}
to
process qiime_import {
publishDir "${params.outdir}/demux", mode: 'copy',
saveAs: {params.untilQ2import ? filename : null}
related to #55
Hi, I went into an issue as shown below. I am running the docker standard profile. Any suggestions on solving this issue?
[2d/5ce92b] Submitted process > make_SILVA_132_16S_classifier (1)
[ac/9130c2] Submitted process > metadata_category_all (1)
ERROR ~ Error executing process > 'make_SILVA_132_16S_classifier (1)'
Caused by:
Process make_SILVA_132_16S_classifier (1)
terminated with an error exit status (9)
Command executed:
unzip -qq Silva_132_release.zip
fasta="SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna"
taxonomy="SILVA_132_QIIME_release/taxonomy/16S_only/99/consensus_taxonomy_7_levels.txt"
if [ "false" = "true" ]; then
sed 's/#//g' $taxonomy >taxonomy-99_removeHash.txt
taxonomy="taxonomy-99_removeHash.txt"
echo "
######## WARNING! The taxonomy file was altered by removing all hash signs!"
fi
### Import
qiime tools import --type 'FeatureData[Sequence]' --input-path $fasta --output-path ref-seq-99.qza
qiime tools import --type 'FeatureData[Taxonomy]' --source-format HeaderlessTSVTaxonomyFormat --input-path $taxonomy --output-path ref-taxonomy-99.qza
#Extract sequences based on primers
qiime feature-classifier extract-reads --i-sequences ref-seq-99.qza --p-f-primer ACTCCTACGGGAGGCAGCA --p-r-primer GGACTACHVGGGTWTCTAAT --o-reads ACTCCTACGGGAGGCAGCA-GGACTACHVGGGTWTCTAAT-99-ref-seq.qza --quiet
#Train classifier
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ACTCCTACGGGAGGCAGCA-GGACTACHVGGGTWTCTAAT-99-ref-seq.qza --i-reference-taxonomy ref-taxonomy-99.qza --o-classifier ACTCCTACGGGAGGCAGCA-GGACTACHVGGGTWTCTAAT-99-classifier.qza --quiet
Command exit status:
9
Command output:
(empty)
Command error:
[Silva_132_release.zip]
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of Silva_132_release.zip or
Silva_132_release.zip.zip, and cannot find Silva_132_release.zip.ZIP, period.
Could you have a look at that script and fix it in bin
?
define compute resources for each process, also see params.tree_cores
and params.diversity_cores
We should document all parts of the pipeline and some of the reasoning behind it in markdown, accompanying the usage.
Also a detailed explanation what which reports mean in general should follow too.
Add support for multiple reference databases.
Infos for path to zipped files (db_zip), path to fasta and taxonomy file in zip file are in "conf/ref_databases.config"
PICRUSt2 pipeline to get EC, KO, and MetaCyc pathway predictions based on 16S data
Requirements:
- qiime2 version needs upgrade to >2018.8
- container has to include PICRUSt2 for QIIME2
Edit: Use Picrust2 outside of QIIME2, because now with DSL2 the pipeline uses biocontainers and the QIIME2 biocontainer does not contain picrust by default. Also, using Picrust2 independently from QIIME2 allows to skip QIIME2 and use DADA2 output.
There is an value error when specific sequences are in a sample related to specific taxonomies of the classifier:
ValueError: CategoricalMetadataColumn does not support strings with leading or trailing whitespace characters:
As long as this issue isnt solved in QIIME2, a way to analyze these data sets anyway will be implemented.
The idea is to modify the file with taxonomy strings, most likely hash tags in this file are causing this issue.
Hi,
A very helpful feature to add would be the demultiplexing of the reads as an optional step. This function has already been developed on QIIME2 and, as such, it should be possible to add it to rrna-ampliseq pipeline.
Diversity related processes produce results only for the first of several files in the input channel "qiime_diversity_core_for_*", because there is only one element in the second input channel "ch_metadata_for_*".
This is true for the following processes:
alpha_diversity
beta_diversity
beta_diversity_ordination
Dada2 report might be valuable for trouble shooting. This report includes information such as reaching convergence when calculating the run specific sequencing error model.
seems a reads name parsing problem in my situation
My command is
nextflow run ampliseq \
--reads "/data3/zqf/16S_Anaysis/rawdata/28892494" \
--FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer GGACTACNVGGGTWTCTAAT \
--metadata "/data3/zqf/16S_Anaysis/rawdata/Metadata.tsv"\
--outdir /data3/zqf/16S_Anaysis/results/28892494/ \
-profile docker --genomes greengenes -resume
my reads file is like 2095566_L001_R1_001.fastq.gz
,
dose sample name need a _
inside ?
Hi,
There is another way to import single- and paired-end demultiplexed sequences in QIIME2 aside from the Casava format that is not picky with names. The "Manifest" file format (https://docs.qiime2.org/2018.11/tutorials/importing/?highlight=manifest#fastq-manifest-formats) is a text file that does not pose limitations to file naming. It is up to users to prepare it but I think that with some coding the preparation of manifest files can be automatized and thus can enhance the flexibility of the pipeline in accepting input files. Another way could also be that the pipeline would accept both casava file formats and manifest files manually created by users. This way, downstream analyses like taxonomy or alpha diversity would benefit from having clearer ID names.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.