Giter VIP home page Giter VIP logo

nf-ducken's People

Contributors

christosmatzoros avatar lina-kim avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nf-ducken's Issues

Write tutorial and incorporate testing data

Document testing data, either in Wiki form or in a separate tests/ directory.

More thorough tests can be conducted via our test suite. Pivoting this issue to incorporate testing data as part of our tutorial.

Wrap metadata import and q2-fondue call

Two calls; both processes require q2-fondue environments.

On command line:

qiime tools import \
        --input-path ${inp_file} \
        --output-path ${inp_base}.qza \
        --type 'NCBIAccessionIDs'

qiime fondue get-all \
        --m-accession-ids-file ${inp_base}.qza \
        --p-email ${email} \
        --p-n-jobs 4 \
        --output-dir ${out_dir}_fq \
        --verbose

Wrap DADA2 call

Initial command; may need to modify to account for subsequent VSEARCH.

qiime dada2 denoise-paired \
        --i-demultiplexed-seqs $TBD \
        --p-trunc-len-f 180 \
        --p-trunc-len-r 180 \
        --p-trunc-q 2 \
        --output-dir ${out_dir}_dada2 \
        --verbose

Optimize artifact splitting for DADA2

Currently, the workflow processes all samples as a single massive QIIME artifact. This should probably be broken down into more manageable batches as sample size increases to the thousands and beyond.

Completely ignore references to samples starting with special characters (#)

Quick fix to #28.

# characters anywhere in input manifest files raise errors in the process of FASTQ imports.

Example input manifest:

sample-id       forward-absolute-filepath       reverse-absolute-filepath
o#1     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#1_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#1_R2.fastq.gz
o#2     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#2_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#2_R2.fastq.gz
o#3     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#3_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#3_R2.fastq.gz
o#4     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#4_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#4_R2.fastq.gz
o#5     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#5_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#5_R2.fastq.gz
o#6     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#6_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/#6_R2.fastq.gz
00.BALB.c       /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/00.BALB.c_R1.fastq.gz     /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/00.BALB.c_R2.fastq.gz
01.BS.17        /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/01.BS.17_R1.fastq.gz      /cluster/work/saga/alloHCT/data_uz/MSKCC_microbiome_samples_for_Greg/01.BS.17_R2.fastq.gz
Traceback (most recent call last):
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2cli/builtin/tools.py", line 157, in import_data
    artifact = qiime2.sdk.Artifact.import_data(type, input_path,
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 277, in import_data
    return cls._from_view(type_, view, view_type, provenance_capture,
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/result.py", line 305, in _from_view
    result = transformation(view, validate_level)
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/core/transform.py", line 70, in transformation
    new_view = transformer(view)
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_transformer.py", line 236, in _25
    return _fastq_manifest_helper_partial(old_fmt, _copy_with_compression,
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_util.py", line 233, in _fastq_manifest_helper
    input_manifest = _parse_and_validate_manifest(
  File "/opt/conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_types/per_sample_sequences/_util.py", line 99, in _parse_and_validate_manifest
    raise ValueError('Empty cells are not supported in '
ValueError: Empty cells are not supported in manifest files. Found one or more empty cells in this record: o,nan,nan

An unexpected error has occurred:

  Empty cells are not supported in manifest files. Found one or more empty cells in this record: o,nan,nan

See above for debug info.

Clearly the # is interpreted as a comment start regardless of whether it is found at the beginning of a line; everything succeeding the character is ignored. Even if there are valid entries in the manifest, the entire process is treated as a failure.

Analyze optimal cutoff for VSEARCH

100% identity threshold is too stringent. To investigate lower identity threshold using non-QIIME 2 VSEARCH tool for more detailed results and an appropriate cutoff.

Optimize VSEARCH parameters

Cross-reference low VSEARCH hits from ECAM and MSK datasets?

  • Host 16S? Contamination? mt16S in diet? Index hopping?
  • Nick: “85% [identity] is quite low for a gut cohort”
  • Use consensus or confidence threshold? Weight by confidence in a given type?

Validate input parameters

See params.config for associated input parameters. Use command
Workflow16S.initialise(params, log).

Wrap VSEARCH call

Launch two commands for closed-reference OTU clustering.

qiime vsearch dereplicate-sequences \
        --i-sequences ${out_dir}_fq/seqs.qza \
        --o-dereplicated-table ${out_base}-table.qza \
        --o-dereplicated-sequences ${out_base}-rep-seqs.qza

qiime vsearch cluster-features-closed-reference \
        --i-table ${out_base}-table.qza \
        --i-sequences ${out_base}-rep-seqs.qza \
        --i-reference-sequences ${ref_otus} \
        --p-perc-identity 0.99 \
        --output-dir ${out_base}-otus

Optimize memory/CPU usage per process

Currently, Nextflow processes with high computational needs are run with the general label process_local, which requests resources similar to that of my laptop. As our needs scale up, it will be helpful to determine efficient parameters for resource handling.

Automate rerun of paired-end reads as single-end based on QC

Current workflow

Sequenced-based QC is not yet available in q2-turducken. The plan is to incorporate cutadapt and publish intermediate outputs/statistics for retrospective user assessment. There is currently no plan to force an automated stop (or any sort of next step) in response to poor QC metrics.

Requested

In response to poor QC metrics, automatically relaunch paired-end read analysis as single-end. It is possible that samples or FASTQ files were mislabeled and should be handled without an additional manual step (e.g. assessment).

Automate filename format recognition

Current workflow

If using local FASTQ files: By default, requires user to input a QIIME 2-formatted FASTQ manifest to start workflow. This is usually generated (by the user) with make_manifest.py, which assumes FASTQ file name suffix _R[1-2].fastq.gz by default. Alternate FASTQ file name suffixes are accepted as user-input parameters.

Request

Automatically detect appropriate file name suffixes to generate the associated FASTQ manifest and input artifact. This simplifies the process on the user's end, requiring them to submit only a directory name containing relevant local FASTQs.

Incorporate optional feature classifier training

  • Train and incorporate a feature classifier for our particular samples, to pass into taxonomy classification after denoising
    • Use RESCRIPt to pull down amplicon-specific reference taxonomy databases e.g. SILVA, UNITE-INSD

Wrap feature-classifier call

Command:

qiime feature-classifier classify-sklearn \
        --i-classifier ${idtaxa_classifier} \
        --i-reads ${out_dir}_dada2/rep_set.qza \
        --p-n-jobs -1 \
        --output-dir ${out_dir}_classify \
        --verbose

Incorporate user-input containers

Currently, the workflow only runs on Singularity containers that have been hard-coded. Allow locations of local or remote containers to be specified by the user, in the input config file or on the command line?

Automate recognition of input read types: single-end or paired-end

Current workflow

The workflow requires the user to specify whether input FASTQs have single-end or paired-end reads. Based on this user input, the pipeline runs a single- or paired-end branch of analysis. This is regardless of whether FASTQs are local or retrieved using q2-fondue.

Request

Automatically recognize whether input files are single- or paired-end.

Incorporate primer-based binning prior to primer removal

Original issue split primer trimming into #62. Cutadapt can't be used for binning as the QIIME 2 plugin requires an input artifact of type MultiplexedSingleEndBarcodeInSequence rather than the SampleData[PairedEndSequencesWithQuality] used for sequences downloaded with q2-fondue and typical of those downloaded from the SRA.

Instead, use VSEARCH to bin primers, as suggested on the QIIME 2 Forum. -> the alignment-only method of VSEARCH isn't wrapped in the QIIME 2 ecosystem!

To incorporate in two steps before denoising:

  • Initially to demultiplex amplicons, binning reads into each library-amplicon combination
    • e.g. V1V2, V2V3, V3V4, V4V5, V5V7, V7V9, ITS
  • Then primer/adapter trimming and QC (this one with MultiQC output)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.