bokulich-lab / nf-ducken Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 2.33 MB

Workflow to process amplicon meta-analysis data, from NCBI accession IDs to taxonomic diversity metrics.

Groovy 0.28% Nextflow 79.91% Shell 1.83% Python 16.37% Ruby 1.60%

nf-ducken's People

Contributors

Stargazers

Watchers

Forkers

lina-kim christosmatzoros

nf-ducken's Issues

Expose all user-input parameters in q2-feature-classifier to user

To allow user to access methods classify-sklearn, classify-consensus-blast, and classify-consensus-vsearch. As the pipeline classify-hybrid-vsearch-sklearn is still in alpha, this will not be implemented yet.

Write tutorial and incorporate testing data

~~Document testing data, either in Wiki form or in a separate tests/ directory.~~

More thorough tests can be conducted via our test suite. Pivoting this issue to incorporate testing data as part of our tutorial.

Add module to auto-generate FASTQ manifest for QIIME import

Currently required for start_process = "fastq_import" runs which skip the initial FASTQ download steps in favor of using locally stored FASTQ files. FASTQ manifest format to take QIIME 2 import standards.

Expose all user-input parameters in DADA2 to user

Wrap metadata import and q2-fondue call

Two calls; both processes require q2-fondue environments.

On command line:

qiime tools import \
        --input-path ${inp_file} \
        --output-path ${inp_base}.qza \
        --type 'NCBIAccessionIDs'

qiime fondue get-all \
        --m-accession-ids-file ${inp_base}.qza \
        --p-email ${email} \
        --p-n-jobs 4 \
        --output-dir ${out_dir}_fq \
        --verbose

Wrap DADA2 call

Initial command; may need to modify to account for subsequent VSEARCH.

qiime dada2 denoise-paired \
        --i-demultiplexed-seqs $TBD \
        --p-trunc-len-f 180 \
        --p-trunc-len-r 180 \
        --p-trunc-q 2 \
        --output-dir ${out_dir}_dada2 \
        --verbose

Incorporate RESCRIPt-based reference generation for OTU clustering

See SILVA database processing tutorial with RESCRIPt here. Reference files can be found on the QIIME 2 website.

Integrate workflow with SLURM

To run on a compute node (and make use of its lovely resources), the workflow must run on SLURM.

Allow split FASTQs starting with special characters (#) to be imported by IMPORT_FASTQ

Currently exists a band-aid fix in errorStrategy 'ignore'.

On LM, in directory /cluster/work/saga/alloHCT/pipeline/test_slurm/split see work hashes:

67/7c5bb7
12/de4ea7
af/238b26
12/4b1506
09/1a4c83
77/8f151e

Also: should check whether previous runs actually imported these FASTQs or ignored them when part of a larger manifest.

Combine output feature tables into single output

Optimize artifact splitting for DADA2

Currently, the workflow processes all samples as a single massive QIIME artifact. This should probably be broken down into more manageable batches as sample size increases to the thousands and beyond.

Expose all user-input parameters in VSEARCH to user

For closed-reference OTU clustering, which is the only currently implemented method in the workflow.

Update default containers to use QIIME 2022.11

This issue is so old that another QIIME release has happened in the meantime.

Incorporate other VSEARCH application methods: open reference and de novo clustering, de novo chimera filtering, ref-based chimera filtering

As of 2022-07-04, reference-based chimera filtering is available. Still to be implemented (if deemed important enough):

open-reference clustering
de novo clustering
de novo chimera filtering

Completely ignore references to samples starting with special characters (#)

Quick fix to #28.

# characters anywhere in input manifest files raise errors in the process of FASTQ imports.

Example input manifest:

sample-id       forward-absolute-filepath       reverse-absolute-filepath
o#1     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#1_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#1_R2.fastq.gz
o#2     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#2_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#2_R2.fastq.gz
o#3     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#3_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#3_R2.fastq.gz
o#4     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#4_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#4_R2.fastq.gz
o#5     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#5_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#5_R2.fastq.gz
o#6     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#6_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#6_R2.fastq.gz
00.BALB.c       /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/00.BALB.c_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/00.BALB.c_R2.fastq.gz
01.BS.17        /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/01.BS.17_R1.fastq.gz      /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/01.BS.17_R2.fastq.gz

Traceback (most recent call last):
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2cli/builtin/tools.py", line 157, in import_data
    artifact = qiime2.sdk.Artifact.import_data(type, input_path,
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 277, in import_data
    return cls._from_view(type_, view, view_type, provenance_capture,
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 305, in _from_view
    result = transformation(view, validate_level)
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/core/transform.py", line 70, in transformation
    new_view = transformer(view)
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_transformer.py", line 236, in _25
    return _fastq_manifest_helper_partial(old_fmt, _copy_with_compression,
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_util.py", line 233, in _fastq_manifest_helper
    input_manifest = _parse_and_validate_manifest(
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_util.py", line 99, in _parse_and_validate_manifest
    raise ValueError('Empty cells are not supported in '
ValueError: Empty cells are not supported in manifest files. Found one or more empty cells in this record: o,nan,nan

An unexpected error has occurred:

  Empty cells are not supported in manifest files. Found one or more empty cells in this record: o,nan,nan

See above for debug info.

Clearly the # is interpreted as a comment start regardless of whether it is found at the beginning of a line; everything succeeding the character is ignored. Even if there are valid entries in the manifest, the entire process is treated as a failure.

Incorporate conda environments per process

Start with conda environments initially for primary workflow.

Allow artifact splitting prior to running DADA2

Precursor to #21. Currently the workflow processes all data as a single giant artifact; this would allow an arbitrary artifact split in the FASTQ artifact prior to running DADA2.

Integrate CI

Determine appropriate chimera-checking method for DADA2 and VSEARCH

Analyze optimal cutoff for VSEARCH

100% identity threshold is too stringent. To investigate lower identity threshold using non-QIIME 2 VSEARCH tool for more detailed results and an appropriate cutoff.

Optimize VSEARCH parameters

Cross-reference low VSEARCH hits from ECAM and MSK datasets?

Host 16S? Contamination? mt16S in diet? Index hopping?
Nick: “85% [identity] is quite low for a gut cohort”
Use consensus or confidence threshold? Weight by confidence in a given type?

Add phylogenetic tree construction

Incorporate optional download of OTU reference

Download from QIIME 2 documentation page; may need to validate file existence. Probably SILVA over Greengenes?

Update README and/or documentation to include DAG

Move optional and default parameters to nextflow.config

Currently many parameters live in the sample params.config file with explicit null assignments. Time to move these into nextflow.config with assigned default values.

Log split_manifest outputs to contribute to a final meta-log at end of workflow

It will be important to keep track of important changes that occur during the workflow. Particularly, samples removed from analysis during split_manifest.py and other processing steps will need to be logged and reported at the end of analysis.

Expose all user-input parameters in q2-fondue to user

Validate input parameters

See params.config for associated input parameters. Use command
Workflow16S.initialise(params, log).

Split artifact into separate FASTQ files based on sample? Or based on run, to launch DADA2 together?

Depends on call to #3.

Note from Nick:

dada2 should always be done on a single run, do not pool runs prior to denoising.

it should always be multiple samples from the same run (or rather, use all the reads from the same run that you plan to analyze; do not split into individual samples, though in theory N=1 would work).

Add test suite for workflow

Consider solutions to process heterogeneous read types i.e. datasets with SE + PE reads

Wrap VSEARCH call

Launch two commands for closed-reference OTU clustering.

qiime vsearch dereplicate-sequences \
        --i-sequences ${out_dir}_fq/seqs.qza \
        --o-dereplicated-table ${out_base}-table.qza \
        --o-dereplicated-sequences ${out_base}-rep-seqs.qza

qiime vsearch cluster-features-closed-reference \
        --i-table ${out_base}-table.qza \
        --i-sequences ${out_base}-rep-seqs.qza \
        --i-reference-sequences ${ref_otus} \
        --p-perc-identity 0.99 \
        --output-dir ${out_base}-otus

Note failed FASTQ downloads to user

As output of q2-fondue call.

Optimize memory/CPU usage per process

Currently, Nextflow processes with high computational needs are run with the general label process_local, which requests resources similar to that of my laptop. As our needs scale up, it will be helpful to determine efficient parameters for resource handling.

Incorporate Singularity environments per process: QIIME 2

Note: As Leonhard Med doesn't allow for downloads from quay.io, we will need a pre-downloaded version of the image already present on Saga.

Automate rerun of paired-end reads as single-end based on QC

Current workflow

Sequenced-based QC is not yet available in q2-turducken. The plan is to incorporate cutadapt and publish intermediate outputs/statistics for retrospective user assessment. There is currently no plan to force an automated stop (or any sort of next step) in response to poor QC metrics.

Requested

In response to poor QC metrics, automatically relaunch paired-end read analysis as single-end. It is possible that samples or FASTQ files were mislabeled and should be handled without an additional manual step (e.g. assessment).

Automate filename format recognition

Current workflow

If using local FASTQ files: By default, requires user to input a QIIME 2-formatted FASTQ manifest to start workflow. This is usually generated (by the user) with make_manifest.py, which assumes FASTQ file name suffix _R[1-2].fastq.gz by default. Alternate FASTQ file name suffixes are accepted as user-input parameters.

Request

Automatically detect appropriate file name suffixes to generate the associated FASTQ manifest and input artifact. This simplifies the process on the user's end, requiring them to submit only a directory name containing relevant local FASTQs.

Incorporate process statistics with MultiQC

Summarize process statistics and QC metrics with a single MultiQC report. Includes outputs from:

~~Cutadapt~~
DADA2
potentially FastQC

Note that Cutadapt and FastQC integrations are already supported for MultiQC.

Optimize artifact splitting for VSEARCH

Optimize artifact splitting for feature classification

According to the QIIME 2 Forum, it is most computationally efficient to merge the input sequences and run q2-feature-classifier on a single artifact. This combination step should be added to the workflow.

Incorporate optional feature classifier training

Train and incorporate a feature classifier for our particular samples, to pass into taxonomy classification after denoising
- Use RESCRIPt to pull down amplicon-specific reference taxonomy databases e.g. SILVA, UNITE-INSD

Wrap feature-classifier call

Command:

qiime feature-classifier classify-sklearn \
        --i-classifier ${idtaxa_classifier} \
        --i-reads ${out_dir}_dada2/rep_set.qza \
        --p-n-jobs -1 \
        --output-dir ${out_dir}_classify \
        --verbose

Add workflow output detailing original command run

Should include details on:

date run
parameters used
workflow version and tag
Nextflow version
OS

Incorporate FastQC into workflow for quality control

Incorporate pretrained feature classifier download

See QIIME 2 resource page here.

Downloading naive Bayes classifier trained on Silva 138 99% OTU full-length sequences.

Add option to use local FASTQ files rather than those pulled from SRA

Incorporate user-input containers

Currently, the workflow only runs on Singularity containers that have been hard-coded. Allow locations of local or remote containers to be specified by the user, in the input config file or on the command line?

Automate recognition of input read types: single-end or paired-end

Current workflow

The workflow requires the user to specify whether input FASTQs have single-end or paired-end reads. Based on this user input, the pipeline runs a single- or paired-end branch of analysis. This is regardless of whether FASTQs are local or retrieved using q2-fondue.

Request

Automatically recognize whether input files are single- or paired-end.

Incorporate primer-based binning prior to primer removal

Original issue split primer trimming into #62. Cutadapt can't be used for binning as the QIIME 2 plugin requires an input artifact of type MultiplexedSingleEndBarcodeInSequence rather than the SampleData[PairedEndSequencesWithQuality] used for sequences downloaded with q2-fondue and typical of those downloaded from the SRA.

~~Instead, use VSEARCH to bin primers, as suggested on the QIIME 2 Forum.~~ -> the alignment-only method of VSEARCH isn't wrapped in the QIIME 2 ecosystem!

To incorporate ~~in two steps~~ before denoising:

Initially to demultiplex amplicons, binning reads into each library-amplicon combination
- e.g. V1V2, V2V3, V3V4, V4V5, V5V7, V7V9, ITS
~~Then primer/adapter trimming and QC (this one with MultiQC output)~~

bokulich-lab / nf-ducken Goto Github PK

nf-ducken's People

Contributors

Stargazers

Watchers

Forkers

nf-ducken's Issues

Current workflow

Requested

Current workflow

Request

Current workflow

Request

Recommend Projects

Recommend Topics

Recommend Org