phac-nml / ncov-dehoster Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 4.0 7.36 MB

Removal of human reads from ncov nanopore sequencing data

Nextflow 74.95% Python 20.42% Shell 4.63%

ncov-dehoster's People

Contributors

Stargazers

Watchers

Forkers

bccdc-phl anwarmz james778800 erikadva

ncov-dehoster's Issues

Docs update, Flow update, and Help statement update

Want to change up how the input data and pipeline are parsed to be more in line with more recent developments

Ideal Flow Changes

Validate inputs with groovy script
- Will change how the main.nf script flows (simplifies it)
Help command with groovy script
Print given inputs to log at the start
Fix trace/execution outputs
Add to tests

Docs:

Better description of inputs and when to use certain flags like the --flat one
Update Manifest

Other:

Add license

Better Fastq Filtering [ Nanopore Minimap2 ]

#17 works as a quick solution to dealing with outputs that have no fastq data and preventing the next process, regenerateFast5s_MM2, from failing.

However it still creates the fastq file and folder in the final fastq_pass output directory. A change should be made so that the fastq file is also filtered out to generate a clearer output

Make pipeline work on Illumina gzipped fastq files

This will allow more versatility and be easier to run for most users

Current Problem:

When I change the .fastq to .fastq.gz in the line:

ncov-dehoster/conf/illumina.config

Line 36 in a786941

 fastq_searchpath.add(params.directory.toString() + '/**' + suffix.toString() + '.fastq') 

it doesn't pull files correctly from the directory

not working regenerateFastqFiles.

Hi,
I am really happy to look a great code. Maybe I am doing something wrong, I can't get any filtered fastqs after running this command line. In addition, can we use the resume function for this pipeline? I tried to find the contact information, but I couldn't. So I decided to leave my question here. Please let me know if you don't like this question to be here, I will delete it.

Thank you,

nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --minimap2 --fastq_directory MG_adapted --human_ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --run_name adative_seq -resume --min_length 1

Empty nanopore fastq files cause fast5 regeneration to fail minimap2

Basically, when a dehosted nanopore fastq file is empty (0 reads) in the minimap2 process, the fast5-dehost-regenerate.sh script fails to run

Either need to change the two processes (regenerateFastqFiles and regenerateFast5s_MM2) or find a way to filter the regenerateFastqFiles tuple on fastq count

Flat Output Option [ Nanopore Minimap2]

Allow user to specify a flat output option

Current output is organized as such to make it easier to work with demultiplexed data from the instrument:

YYYYY
    ├── removal_summary.csv
    └── run
        ├── fast5_pass
        │   ├── barcode27
        │   │   ├── barcode27_0.fast5
        │   │   ├── barcode27_10.fast5
        │   │   ├── barcode27_11.fast5
        │   │   ├── barcode27_1.fast5
        │   │   ├── barcode27_2.fast5
        │   │   ├── barcode27_3.fast5
        │   │   ├── barcode27_4.fast5
        │   │   ├── barcode27_5.fast5
        │   │   ├── barcode27_6.fast5
        │   │   ├── barcode27_7.fast5
        │   │   ├── barcode27_8.fast5
        │   │   ├── barcode27_9.fast5
        │   │   └── filename_mapping.txt
        │   ├── barcode51
        │   │   ├── barcode51_0.fast5
        │   │   ├── barcode51_1.fast5
        │   │   ├── barcode51_2.fast5
        │   │   ├── barcode51_3.fast5
        │   │   ├── barcode51_4.fast5
        │   │   ├── barcode51_5.fast5
        │   │   ├── barcode51_6.fast5
        │   │   ├── barcode51_7.fast5
        │   │   ├── barcode51_8.fast5
        │   │   └── filename_mapping.txt
        │   └── barcode96
        │       ├── barcode96_0.fast5
        │       └── filename_mapping.txt
        ├── fastq_pass
        │   ├── barcode27
        │   │   └── barcode27.host_removed.fastq
        │   ├── barcode29
        │   │   └── barcode29.host_removed.fastq
        │   ├── barcode51
        │   │   └── barcode51.host_removed.fastq
        │   └── barcode96
        │       └── barcode96.host_removed.fastq
        └── sequencing_summary.txt

Change would be to make the fastq_pass and fast5_pass flat (no additional directories)

YYYYY
    ├── removal_summary.csv
    └── run
        ├── fast5_pass
        │   ├── barcode27_0.fast5
        │   ├── barcode27_10.fast5
        │   ├── barcode27_11.fast5
        │   ├── barcode27_1.fast5
        │   ├── barcode27_2.fast5
        │   ├── barcode27_3.fast5
        │   ├── barcode27_4.fast5
        │   ├── barcode27_5.fast5
        │   ├── barcode27_6.fast5
        │   ├── barcode27_7.fast5
        │   ├── barcode27_8.fast5
        │   ├── barcode27_9.fast5
        │   ├── barcode51_0.fast5
        │   ├── barcode51_1.fast5
        │   ├── barcode51_2.fast5
        │   ├── barcode51_3.fast5
        │   ├── barcode51_4.fast5
        │   ├── barcode51_5.fast5
        │   ├── barcode51_6.fast5
        │   ├── barcode51_7.fast5
        │   ├── barcode51_8.fast5
        │   ├── barcode96_0.fast5
        │   └── filename_mapping.txt
        ├── fastq_pass
        │   ├── barcode27
        │   ├── barcode27.host_removed.fastq
        │   ├── barcode29.host_removed.fastq
        │   ├── barcode51.host_removed.fastq
        │   └── barcode96.host_removed.fastq
        └── sequencing_summary.txt

Reformat Nanopore Host Removal Codeblock

I think that this should work if you want to cut the pass in 30-33.

if (read.reference_name != contig_ID and read.mapping_quality >= remove_minimum_quality) or read.query_name in reads_to_remove_set:
    h_count += 1
    pass

If you never plan on passing an actual value to reads_to_remove_set, you can also just cut that as well and do

if read.reference_name != contig_ID and read.mapping_quality >= remove_minimum_quality:
    h_count += 1
    pass

Originally posted by @ConnorChato in #14 (comment)

Significant speed enhancement that should be done sooner rather than later

Keep Full Fastq Read Identifiers [ Nanopore Minimap2 ]

As title says, keep the full name of each of the fastq reads being generated in the nanopore minimap2 pipeline (so that we can do strict demultiplexing post host removal and other specifics that utilize that information)

Ex Currently.

@2f33f219-86dd-4b14-9442-2b1e068b7b22 runid=28fdafe3bad100kjfd94mh2bmfda932 read=84325 ch=426 start_time=2021-04-15T08:56:07.367881+00:00 flow_cell_id=FAR protocol_group_id=YYYY sample_id=YYYY barcode=barcode96 barcode_alias=barcode96 barcode=barcode96
ATGC
+
%$%'

Becomes

@2f33f219-86dd-4b14-9442-2b1e068b7b22
ATGC
+
%$%'

Want to keep the whole line

Add Nanopore Dehosting

Add in nanopore dehosting support with the following stipulations:

dehosting fast5 files
re-basecalling fastq files
size selection of fastq reads

Slightly tricky as guppy is proprietary and must be configured on the system beforehand to run this properly

phac-nml / ncov-dehoster Goto Github PK

ncov-dehoster's People

Contributors

Stargazers

Watchers

Forkers

ncov-dehoster's Issues

Docs update, Flow update, and Help statement update

Better Fastq Filtering [ Nanopore Minimap2 ]

Make pipeline work on Illumina gzipped fastq files

not working regenerateFastqFiles.

Empty nanopore fastq files cause fast5 regeneration to fail minimap2

Flat Output Option [ Nanopore Minimap2]

Reformat Nanopore Host Removal Codeblock

Keep Full Fastq Read Identifiers [ Nanopore Minimap2 ]

Add Nanopore Dehosting

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent