fmalmeida / ngs-preprocess Goto Github PK

View Code? Open in Web Editor NEW

30.0 4.0 4.0 5.4 MB

A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies

Home Page: https://ngs-preprocess.readthedocs.io/

License: GNU General Public License v3.0

Nextflow 97.60% Dockerfile 2.40%

ngs-preprocess illumina pacbio nextflow pipeline trimgalore nanopack bax2bam porechop reproducible-research

ngs-preprocess's Introduction

Hello 😁 👋

Hello there, my name is Felipe Almeida, a brazilian scientist, bioinformatician, pipeline developer and problem solver. My main interests are: Bioinformatics, genomic surveillance, precision medicine, and microbial genomics. You can also find me on twitter @fmarquesalmeida, stackoverflow and linkedin.

Academic info

I'm a PhD student at the University of Brasilia, at the CompGen (Computational Genomics) laboratory with academic guidance from PhD. Prof. Georgios J. Pappas Jr.

Some of my favourite tools:

My stats

ngs-preprocess's People

Contributors

Stargazers

Watchers

Forkers

vikash84 minghao2016 ravinpoudel lorepaga1996

ngs-preprocess's Issues

include the automatic generation of a samplesheet for MpGAP

Add the automatic generation of a samplesheet that can be directly used as input for the https://github.com/fmalmeida/MpGAP pipeline.

SRA fetch and preprocess of illumina with FASTP hangs

This code

nextflow run fmalmeida/ngs-preprocess   -r dev -latest -profile docker --sra_ids "./input/sra_ids.txt"   --output illumina_single  --shortreads_type "single"   --fastp_additional_parameters " --trim_front1 5 --trim_tail1 5 "

hangs during procesing on 0/1 for the FASTP process

[54/9c682c] process > SRA_FETCH:GET_FASTQ (SRR28776895)    [100%] 1 of 1 ✔
[32/48d60d] process > SRA_FETCH:GET_METADATA (SRR28776895) [100%] 1 of 1 ✔
[-        ] process > NANOPORE:PORECHOP                    -
[-        ] process > NANOPORE:FILTER                      -
[-        ] process > NANOPORE:NANOPACK                    -
[-        ] process > PACBIO:BAM2FASTQ                     -
[-        ] process > PACBIO:NANOPACK                      -
[-        ] process > PACBIO:FILTER                        -
[17/78fe41] process > ILLUMINA:FASTP (SRR28776895)         [  0%] 0 of 1

However, there are 3 fastq files produced and following the previous command with this command completes the preprocessing:

nextflow run fmalmeida/ngs-preprocess   -r dev -latest -profile docker   --shortreads "illumina_single/SRA_FETCH/FASTQ/SRR28776895_data/*.fastq.gz" \                                
   --output illumina_single  --shortreads_type "single"   --fastp_additional_parameters " --trim_front1 5 --trim_tail1 5 "

executor >  local (3)
[-        ] process > SRA_FETCH:GET_FASTQ            -
[-        ] process > SRA_FETCH:GET_METADATA         -
[-        ] process > NANOPORE:PORECHOP              -
[-        ] process > NANOPORE:FILTER                -
[-        ] process > NANOPORE:NANOPACK              -
[-        ] process > PACBIO:BAM2FASTQ               -
[-        ] process > PACBIO:NANOPACK                -
[-        ] process > PACBIO:FILTER                  -
[64/63cded] process > ILLUMINA:FASTP (SRR28776895_2) [100%] 3 of 3 ✔

My guess is the nextflow does not point to the downloaded SRA files automatically. Perhaps there's a flag I missed.

change to bioconda images

Instead of creating a custom docker image with all tools, reconfigure the pipeline to use the bioconda channels and images, which will enable users to run the tool with conda, docker or singularity.

SRA NBCI fetch and preprossing script only works automatically for illumina sequences

using fmalmeida/ngs-preprocess:v2.6

nextflow should identify sequencing platform and route preprocessing to nanopore/pacbio/illumina.


ERROR ~ Error executing process > 'SRA_FETCH:GET_FASTQ (SRR9641620)'

Caused by:
  Process `SRA_FETCH:GET_FASTQ (SRR9641620)` terminated with an error exit status (3)

Command executed:

  fasterq-dump \
    --include-technical \
    --split-files \
    --threads 2 \
    --outdir ./SRR9641620_data \
    --progress \
    SRR9641620

Command exit status:
  3

Command output:
  (empty)

Command error:
  2024-05-09T13:37:02 fasterq-dump.3.0.3 err: accession 'SRR9641620' is PACBIO, please use fastq-dump instead
  fasterq-dump quit with error code 3

change structure of output directory

The structure of the output directory is not standardized and needs some changes in order to enable easy accession of final (preprocessed) reads.

It would be nice to have:

A final directory, probably called final_output that will contain all final (trimmed and filtered) fastq files, in fq.gz format to standardize filenames.
This directory will hold all results and separate reads in subdirectories (for longreads or shortreads).
Then, the other files (quality, merging steps, correction steps, etc.) would be saved in other directories, one for each step, software or strain ... still needs to think about it.

More brainstorming about this issue is still required before taking action into its implementation. Help required to decide the structure (@gpappasunb).

add nf-test

Add nf-tests for pipeline integrity checks

Update NanoPack tools alternatives

Some tools from NanoPack have been replaced for quicker tools as described here: https://github.com/wdecoster/nanopack?tab=readme-ov-file

The task would be to update such tools in the pipeline.

update module to fetch data from sra

Currently, the pipeline understands it to split downloaded data to modules based on the patterns: Illumina,pacbio,nanopore.
But what if a downloaded data is not from any of these platforms?
Think on how to better approach channel splitting.

Add example of non-bacterial dataset analysis (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.

Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).

Suggestion for hybrid error correction

Hi there,

Found your wrapper over Twitter, great incentive :). I have a suggestion for your pipeline - it would be of interest to consider hybrid correction (aka. combine short + longread). With my current pipeline I was using fmlrc, combine with ONT works pretty well

Cheers,

Tuan

Add more parallel jobs

Add the option to execute more jobs in parallel, being each job up to N threads. As it happens in bacannot!

add citation information

Add information about citation: https://f1000research.com/articles/12-1205/v1

consider using porechop_abi

Assess and consider the change from 'porechop' which is deprecated to porechop_abi which is under maintenance.

https://github.com/bonsai-team/Porechop_ABI

Change software for filtering longreads

Consider changing the software used to filter the long reads in order to use a more recent and faster app.

Currently, NanoFilt is used for the task.

The software to consider changing it is nanoq.

standard profile to not load docker

Instead of making the standard profile of the pipeline to automatically load Docker, it is best to make it do not load for any profile by default and act as a simple local pipeline.

So, if users desire to use one of the available profiles one must explicitly select -profile docker/singularity/conda.

Enhance documentation (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.

new tool for long reads QC

A new tool for long reads quality assessment is now available:

https://github.com/yfukasawa/LongQC

The task is to evaluate the tool and compare it with NanoPack and pycoQC in order to evaluate whether this tools is worthy its inclusion or the replacement of one of the mentioned tools in the pipeline.

fix bam2fastq source code

PacBio has changed the location of the many of its tools, including bam2fastx that is now in a different conda package.
https://github.com/pacificbiosciences/pbtk/

The reason of this ticket is to update references and from where the pipeline fetches the code to be able to use the latest.