bokulich-lab / nf-ducken Goto Github PK
View Code? Open in Web Editor NEWWorkflow to process amplicon meta-analysis data, from NCBI accession IDs to taxonomic diversity metrics.
Workflow to process amplicon meta-analysis data, from NCBI accession IDs to taxonomic diversity metrics.
To allow user to access methods classify-sklearn
, classify-consensus-blast
, and classify-consensus-vsearch
. As the pipeline classify-hybrid-vsearch-sklearn
is still in alpha, this will not be implemented yet.
Document testing data, either in Wiki form or in a separate tests/
directory.
More thorough tests can be conducted via our test suite. Pivoting this issue to incorporate testing data as part of our tutorial.
Currently required for start_process = "fastq_import"
runs which skip the initial FASTQ download steps in favor of using locally stored FASTQ files. FASTQ manifest format to take QIIME 2 import standards.
Two calls; both processes require q2-fondue
environments.
On command line:
qiime tools import \
--input-path ${inp_file} \
--output-path ${inp_base}.qza \
--type 'NCBIAccessionIDs'
qiime fondue get-all \
--m-accession-ids-file ${inp_base}.qza \
--p-email ${email} \
--p-n-jobs 4 \
--output-dir ${out_dir}_fq \
--verbose
Initial command; may need to modify to account for subsequent VSEARCH.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs $TBD \
--p-trunc-len-f 180 \
--p-trunc-len-r 180 \
--p-trunc-q 2 \
--output-dir ${out_dir}_dada2 \
--verbose
See SILVA database processing tutorial with RESCRIPt here. Reference files can be found on the QIIME 2 website.
To run on a compute node (and make use of its lovely resources), the workflow must run on SLURM.
Currently exists a band-aid fix in errorStrategy 'ignore'
.
On LM, in directory /cluster/work/saga/alloHCT/pipeline/test_slurm/split
see work hashes:
67/7c5bb7
12/de4ea7
af/238b26
12/4b1506
09/1a4c83
77/8f151e
Also: should check whether previous runs actually imported these FASTQs or ignored them when part of a larger manifest.
Currently, the workflow processes all samples as a single massive QIIME artifact. This should probably be broken down into more manageable batches as sample size increases to the thousands and beyond.
For closed-reference OTU clustering, which is the only currently implemented method in the workflow.
This issue is so old that another QIIME release has happened in the meantime.
As of 2022-07-04, reference-based chimera filtering is available. Still to be implemented (if deemed important enough):
Quick fix to #28.
#
characters anywhere in input manifest files raise errors in the process of FASTQ imports.
Example input manifest:
sample-id forward-absolute-filepath reverse-absolute-filepath
o#1 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#1_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#1_R2.fastq.gz
o#2 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#2_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#2_R2.fastq.gz
o#3 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#3_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#3_R2.fastq.gz
o#4 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#4_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#4_R2.fastq.gz
o#5 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#5_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#5_R2.fastq.gz
o#6 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#6_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#6_R2.fastq.gz
00.BALB.c /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/00.BALB.c_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/00.BALB.c_R2.fastq.gz
01.BS.17 /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/01.BS.17_R1.fastq.gz /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/01.BS.17_R2.fastq.gz
Traceback (most recent call last):
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2cli/builtin/tools.py", line 157, in import_data
artifact = qiime2.sdk.Artifact.import_data(type, input_path,
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 277, in import_data
return cls._from_view(type_, view, view_type, provenance_capture,
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 305, in _from_view
result = transformation(view, validate_level)
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/core/transform.py", line 70, in transformation
new_view = transformer(view)
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_transformer.py", line 236, in _25
return _fastq_manifest_helper_partial(old_fmt, _copy_with_compression,
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_util.py", line 233, in _fastq_manifest_helper
input_manifest = _parse_and_validate_manifest(
File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_util.py", line 99, in _parse_and_validate_manifest
raise ValueError('Empty cells are not supported in '
ValueError: Empty cells are not supported in manifest files. Found one or more empty cells in this record: o,nan,nan
An unexpected error has occurred:
Empty cells are not supported in manifest files. Found one or more empty cells in this record: o,nan,nan
See above for debug info.
Clearly the #
is interpreted as a comment start regardless of whether it is found at the beginning of a line; everything succeeding the character is ignored. Even if there are valid entries in the manifest, the entire process is treated as a failure.
Start with conda environments initially for primary workflow.
Precursor to #21. Currently the workflow processes all data as a single giant artifact; this would allow an arbitrary artifact split in the FASTQ artifact prior to running DADA2.
100% identity threshold is too stringent. To investigate lower identity threshold using non-QIIME 2 VSEARCH tool for more detailed results and an appropriate cutoff.
Cross-reference low VSEARCH hits from ECAM and MSK datasets?
Download from QIIME 2 documentation page; may need to validate file existence. Probably SILVA over Greengenes?
Currently many parameters live in the sample params.config
file with explicit null
assignments. Time to move these into nextflow.config
with assigned default values.
It will be important to keep track of important changes that occur during the workflow. Particularly, samples removed from analysis during split_manifest.py
and other processing steps will need to be logged and reported at the end of analysis.
See params.config
for associated input parameters. Use command
Workflow16S.initialise(params, log)
.
Depends on call to #3.
Note from Nick:
dada2 should always be done on a single run, do not pool runs prior to denoising.
it should always be multiple samples from the same run (or rather, use all the reads from the same run that you plan to analyze; do not split into individual samples, though in theory N=1 would work).
Launch two commands for closed-reference OTU clustering.
qiime vsearch dereplicate-sequences \
--i-sequences ${out_dir}_fq/seqs.qza \
--o-dereplicated-table ${out_base}-table.qza \
--o-dereplicated-sequences ${out_base}-rep-seqs.qza
qiime vsearch cluster-features-closed-reference \
--i-table ${out_base}-table.qza \
--i-sequences ${out_base}-rep-seqs.qza \
--i-reference-sequences ${ref_otus} \
--p-perc-identity 0.99 \
--output-dir ${out_base}-otus
As output of q2-fondue
call.
Currently, Nextflow processes with high computational needs are run with the general label process_local
, which requests resources similar to that of my laptop. As our needs scale up, it will be helpful to determine efficient parameters for resource handling.
Note: As Leonhard Med doesn't allow for downloads from quay.io, we will need a pre-downloaded version of the image already present on Saga.
Sequenced-based QC is not yet available in q2-turducken
. The plan is to incorporate cutadapt
and publish intermediate outputs/statistics for retrospective user assessment. There is currently no plan to force an automated stop (or any sort of next step) in response to poor QC metrics.
In response to poor QC metrics, automatically relaunch paired-end read analysis as single-end. It is possible that samples or FASTQ files were mislabeled and should be handled without an additional manual step (e.g. assessment).
If using local FASTQ files: By default, requires user to input a QIIME 2-formatted FASTQ manifest to start workflow. This is usually generated (by the user) with make_manifest.py, which assumes FASTQ file name suffix _R[1-2].fastq.gz
by default. Alternate FASTQ file name suffixes are accepted as user-input parameters.
Automatically detect appropriate file name suffixes to generate the associated FASTQ manifest and input artifact. This simplifies the process on the user's end, requiring them to submit only a directory name containing relevant local FASTQs.
Summarize process statistics and QC metrics with a single MultiQC report. Includes outputs from:
Note that Cutadapt and FastQC integrations are already supported for MultiQC.
According to the QIIME 2 Forum, it is most computationally efficient to merge the input sequences and run q2-feature-classifier on a single artifact. This combination step should be added to the workflow.
Command:
qiime feature-classifier classify-sklearn \
--i-classifier ${idtaxa_classifier} \
--i-reads ${out_dir}_dada2/rep_set.qza \
--p-n-jobs -1 \
--output-dir ${out_dir}_classify \
--verbose
Should include details on:
See QIIME 2 resource page here.
Downloading naive Bayes classifier trained on Silva 138 99% OTU full-length sequences.
Currently, the workflow only runs on Singularity containers that have been hard-coded. Allow locations of local or remote containers to be specified by the user, in the input config file or on the command line?
The workflow requires the user to specify whether input FASTQs have single-end or paired-end reads. Based on this user input, the pipeline runs a single- or paired-end branch of analysis. This is regardless of whether FASTQs are local or retrieved using q2-fondue
.
Automatically recognize whether input files are single- or paired-end.
Original issue split primer trimming into #62. Cutadapt can't be used for binning as the QIIME 2 plugin requires an input artifact of type MultiplexedSingleEndBarcodeInSequence
rather than the SampleData[PairedEndSequencesWithQuality]
used for sequences downloaded with q2-fondue
and typical of those downloaded from the SRA.
Instead, use VSEARCH to bin primers, as suggested on the QIIME 2 Forum. -> the alignment-only method of VSEARCH isn't wrapped in the QIIME 2 ecosystem!
To incorporate in two steps before denoising:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.