Giter VIP home page Giter VIP logo

fmalmeida / ngs-preprocess Goto Github PK

View Code? Open in Web Editor NEW
30.0 4.0 4.0 5.4 MB

A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies

Home Page: https://ngs-preprocess.readthedocs.io/

License: GNU General Public License v3.0

Nextflow 97.60% Dockerfile 2.40%
ngs-preprocess illumina pacbio nextflow pipeline trimgalore nanopack bax2bam porechop reproducible-research

ngs-preprocess's Introduction

Hello ๐Ÿ˜ ๐Ÿ‘‹

Hello there, my name is Felipe Almeida, a brazilian scientist, bioinformatician, pipeline developer and problem solver. My main interests are: Bioinformatics, genomic surveillance, precision medicine, and microbial genomics. You can also find me on twitter @fmarquesalmeida, stackoverflow and linkedin.

Academic info

I'm a PhD student at the University of Brasilia, at the CompGen (Computational Genomics) laboratory with academic guidance from PhD. Prof. Georgios J. Pappas Jr.

Some of my favourite tools:

Nextflow Python R bash

My stats

Top Langs

Felipe Github Stats

ngs-preprocess's People

Contributors

fmalmeida avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ngs-preprocess's Issues

SRA fetch and preprocess of illumina with FASTP hangs

This code

nextflow run fmalmeida/ngs-preprocess   -r dev -latest -profile docker --sra_ids "./input/sra_ids.txt"   --output illumina_single  --shortreads_type "single"   --fastp_additional_parameters " --trim_front1 5 --trim_tail1 5 "

hangs during procesing on 0/1 for the FASTP process

[54/9c682c] process > SRA_FETCH:GET_FASTQ (SRR28776895)    [100%] 1 of 1 โœ”
[32/48d60d] process > SRA_FETCH:GET_METADATA (SRR28776895) [100%] 1 of 1 โœ”
[-        ] process > NANOPORE:PORECHOP                    -
[-        ] process > NANOPORE:FILTER                      -
[-        ] process > NANOPORE:NANOPACK                    -
[-        ] process > PACBIO:BAM2FASTQ                     -
[-        ] process > PACBIO:NANOPACK                      -
[-        ] process > PACBIO:FILTER                        -
[17/78fe41] process > ILLUMINA:FASTP (SRR28776895)         [  0%] 0 of 1

However, there are 3 fastq files produced and following the previous command with this command completes the preprocessing:

nextflow run fmalmeida/ngs-preprocess   -r dev -latest -profile docker   --shortreads "illumina_single/SRA_FETCH/FASTQ/SRR28776895_data/*.fastq.gz" \                                
   --output illumina_single  --shortreads_type "single"   --fastp_additional_parameters " --trim_front1 5 --trim_tail1 5 " 
executor >  local (3)
[-        ] process > SRA_FETCH:GET_FASTQ            -
[-        ] process > SRA_FETCH:GET_METADATA         -
[-        ] process > NANOPORE:PORECHOP              -
[-        ] process > NANOPORE:FILTER                -
[-        ] process > NANOPORE:NANOPACK              -
[-        ] process > PACBIO:BAM2FASTQ               -
[-        ] process > PACBIO:NANOPACK                -
[-        ] process > PACBIO:FILTER                  -
[64/63cded] process > ILLUMINA:FASTP (SRR28776895_2) [100%] 3 of 3 โœ”

My guess is the nextflow does not point to the downloaded SRA files automatically. Perhaps there's a flag I missed.

change to bioconda images

Instead of creating a custom docker image with all tools, reconfigure the pipeline to use the bioconda channels and images, which will enable users to run the tool with conda, docker or singularity.

SRA NBCI fetch and preprossing script only works automatically for illumina sequences

using fmalmeida/ngs-preprocess:v2.6

nextflow should identify sequencing platform and route preprocessing to nanopore/pacbio/illumina.


ERROR ~ Error executing process > 'SRA_FETCH:GET_FASTQ (SRR9641620)'

Caused by:
  Process `SRA_FETCH:GET_FASTQ (SRR9641620)` terminated with an error exit status (3)

Command executed:

  fasterq-dump \
    --include-technical \
    --split-files \
    --threads 2 \
    --outdir ./SRR9641620_data \
    --progress \
    SRR9641620

Command exit status:
  3

Command output:
  (empty)

Command error:
  2024-05-09T13:37:02 fasterq-dump.3.0.3 err: accession 'SRR9641620' is PACBIO, please use fastq-dump instead
  fasterq-dump quit with error code 3

change structure of output directory

The structure of the output directory is not standardized and needs some changes in order to enable easy accession of final (preprocessed) reads.

It would be nice to have:

  • A final directory, probably called final_output that will contain all final (trimmed and filtered) fastq files, in fq.gz format to standardize filenames.
  • This directory will hold all results and separate reads in subdirectories (for longreads or shortreads).
  • Then, the other files (quality, merging steps, correction steps, etc.) would be saved in other directories, one for each step, software or strain ... still needs to think about it.

More brainstorming about this issue is still required before taking action into its implementation. Help required to decide the structure (@gpappasunb).

add nf-test

Add nf-tests for pipeline integrity checks

update module to fetch data from sra

Currently, the pipeline understands it to split downloaded data to modules based on the patterns: Illumina,pacbio,nanopore.
But what if a downloaded data is not from any of these platforms?
Think on how to better approach channel splitting.

Add example of non-bacterial dataset analysis (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.

Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).

Suggestion for hybrid error correction

Hi there,

Found your wrapper over Twitter, great incentive :). I have a suggestion for your pipeline - it would be of interest to consider hybrid correction (aka. combine short + longread). With my current pipeline I was using fmlrc, combine with ONT works pretty well

Cheers,

Tuan

standard profile to not load docker

Instead of making the standard profile of the pipeline to automatically load Docker, it is best to make it do not load for any profile by default and act as a simple local pipeline.

So, if users desire to use one of the available profiles one must explicitly select -profile docker/singularity/conda.

Enhance documentation (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.

new tool for long reads QC

A new tool for long reads quality assessment is now available:

The task is to evaluate the tool and compare it with NanoPack and pycoQC in order to evaluate whether this tools is worthy its inclusion or the replacement of one of the mentioned tools in the pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.