phac-nml / ncov-dehoster Goto Github PK
View Code? Open in Web Editor NEWRemoval of human reads from ncov nanopore sequencing data
Removal of human reads from ncov nanopore sequencing data
Want to change up how the input data and pipeline are parsed to be more in line with more recent developments
Ideal Flow Changes
Docs:
--flat
oneOther:
#17 works as a quick solution to dealing with outputs that have no fastq data and preventing the next process, regenerateFast5s_MM2, from failing.
However it still creates the fastq file and folder in the final fastq_pass
output directory. A change should be made so that the fastq file is also filtered out to generate a clearer output
This will allow more versatility and be easier to run for most users
Current Problem:
When I change the .fastq
to .fastq.gz
in the line:
ncov-dehoster/conf/illumina.config
Line 36 in a786941
it doesn't pull files correctly from the directory
Hi,
I am really happy to look a great code. Maybe I am doing something wrong, I can't get any filtered fastqs after running this command line. In addition, can we use the resume function for this pipeline? I tried to find the contact information, but I couldn't. So I decided to leave my question here. Please let me know if you don't like this question to be here, I will delete it.
Thank you,
nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --minimap2 --fastq_directory MG_adapted --human_ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --run_name adative_seq -resume --min_length 1
Basically, when a dehosted nanopore fastq file is empty (0 reads) in the minimap2 process, the fast5-dehost-regenerate.sh script fails to run
Either need to change the two processes (regenerateFastqFiles and regenerateFast5s_MM2) or find a way to filter the regenerateFastqFiles tuple on fastq count
Allow user to specify a flat output option
Current output is organized as such to make it easier to work with demultiplexed data from the instrument:
YYYYY
├── removal_summary.csv
└── run
├── fast5_pass
│ ├── barcode27
│ │ ├── barcode27_0.fast5
│ │ ├── barcode27_10.fast5
│ │ ├── barcode27_11.fast5
│ │ ├── barcode27_1.fast5
│ │ ├── barcode27_2.fast5
│ │ ├── barcode27_3.fast5
│ │ ├── barcode27_4.fast5
│ │ ├── barcode27_5.fast5
│ │ ├── barcode27_6.fast5
│ │ ├── barcode27_7.fast5
│ │ ├── barcode27_8.fast5
│ │ ├── barcode27_9.fast5
│ │ └── filename_mapping.txt
│ ├── barcode51
│ │ ├── barcode51_0.fast5
│ │ ├── barcode51_1.fast5
│ │ ├── barcode51_2.fast5
│ │ ├── barcode51_3.fast5
│ │ ├── barcode51_4.fast5
│ │ ├── barcode51_5.fast5
│ │ ├── barcode51_6.fast5
│ │ ├── barcode51_7.fast5
│ │ ├── barcode51_8.fast5
│ │ └── filename_mapping.txt
│ └── barcode96
│ ├── barcode96_0.fast5
│ └── filename_mapping.txt
├── fastq_pass
│ ├── barcode27
│ │ └── barcode27.host_removed.fastq
│ ├── barcode29
│ │ └── barcode29.host_removed.fastq
│ ├── barcode51
│ │ └── barcode51.host_removed.fastq
│ └── barcode96
│ └── barcode96.host_removed.fastq
└── sequencing_summary.txt
Change would be to make the fastq_pass
and fast5_pass
flat (no additional directories)
YYYYY
├── removal_summary.csv
└── run
├── fast5_pass
│ ├── barcode27_0.fast5
│ ├── barcode27_10.fast5
│ ├── barcode27_11.fast5
│ ├── barcode27_1.fast5
│ ├── barcode27_2.fast5
│ ├── barcode27_3.fast5
│ ├── barcode27_4.fast5
│ ├── barcode27_5.fast5
│ ├── barcode27_6.fast5
│ ├── barcode27_7.fast5
│ ├── barcode27_8.fast5
│ ├── barcode27_9.fast5
│ ├── barcode51_0.fast5
│ ├── barcode51_1.fast5
│ ├── barcode51_2.fast5
│ ├── barcode51_3.fast5
│ ├── barcode51_4.fast5
│ ├── barcode51_5.fast5
│ ├── barcode51_6.fast5
│ ├── barcode51_7.fast5
│ ├── barcode51_8.fast5
│ ├── barcode96_0.fast5
│ └── filename_mapping.txt
├── fastq_pass
│ ├── barcode27
│ ├── barcode27.host_removed.fastq
│ ├── barcode29.host_removed.fastq
│ ├── barcode51.host_removed.fastq
│ └── barcode96.host_removed.fastq
└── sequencing_summary.txt
I think that this should work if you want to cut the pass in 30-33.
if (read.reference_name != contig_ID and read.mapping_quality >= remove_minimum_quality) or read.query_name in reads_to_remove_set:
h_count += 1
pass
If you never plan on passing an actual value to reads_to_remove_set, you can also just cut that as well and do
if read.reference_name != contig_ID and read.mapping_quality >= remove_minimum_quality:
h_count += 1
pass
Originally posted by @ConnorChato in #14 (comment)
Significant speed enhancement that should be done sooner rather than later
As title says, keep the full name of each of the fastq reads being generated in the nanopore minimap2 pipeline (so that we can do strict demultiplexing post host removal and other specifics that utilize that information)
Ex Currently.
@2f33f219-86dd-4b14-9442-2b1e068b7b22 runid=28fdafe3bad100kjfd94mh2bmfda932 read=84325 ch=426 start_time=2021-04-15T08:56:07.367881+00:00 flow_cell_id=FAR protocol_group_id=YYYY sample_id=YYYY barcode=barcode96 barcode_alias=barcode96 barcode=barcode96
ATGC
+
%$%'
Becomes
@2f33f219-86dd-4b14-9442-2b1e068b7b22
ATGC
+
%$%'
Want to keep the whole line
Add in nanopore dehosting support with the following stipulations:
Slightly tricky as guppy is proprietary and must be configured on the system beforehand to run this properly
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.