Comments (8)
I'd like to ditto that request. 99% of the time taken up by the pipeline when I run it is taken up in this step.
from hic.
I agree that this could be a very nice feature indeed. Actually, it might be useful for several nf-core pipelines. Will try to discuss this point with the nf-core core members ;)
from hic.
To give some further context, when inputting very large FASTQ files, the .splitFastq implementation currently being used will in INPUT_CHECK will:
- Read and decompress every imported FASTQ.gz
- Parse the FASTQ reads and split into chunks of reads
- Recompress the chunks
- Write the chunks to scratch storage
Given these currently happen on the device where the engine itself is running, and the very large size of FASTQs for these type of data, this ends up taking a long time. Increasing the workflow engine instance resources doesn't help much since the actions are not parallelized regardless, so additional CPUs don't provide much help.
Again, very much appreciate the time and energy you've put into this pipeline, and hopeful it can be applied to larger-scale data more easily with this change. Thanks!
from hic.
Hi @ieres-amgen - Can you explain a little more how you currently implement this .splitFastq ?
I'm new to this workflow and tried setting "--split-fastq TRUE" in my yaml file but the workflow didn't move forward for me.
To give some further context, when inputting very large FASTQ files, the .splitFastq implementation currently being used will in INPUT_CHECK will:
- Read and decompress every imported FASTQ.gz
- Parse the FASTQ reads and split into chunks of reads
- Recompress the chunks
- Write the chunks to scratch storage
Given these currently happen on the device where the engine itself is running, and the very large size of FASTQs for these type of data, this ends up taking a long time. Increasing the workflow engine instance resources doesn't help much since the actions are not parallelized regardless, so additional CPUs don't provide much help.
Again, very much appreciate the time and energy you've put into this pipeline, and hopeful it can be applied to larger-scale data more easily with this change. Thanks!
from hic.
@Krithika-Bhuvan, this thread is about changing the general way split_fastq is implemented, not troubleshooting its current functionality.
Based on what you wrote, my guess is that you encounter issues because you're passing the wrong name for the parameter (it's "split_fastq" instead of "split-fastq"), but I can't say for certain without more information about the errors you see. I highly recommend heading over to the nf-core slack and seeking guidance there if you encounter further issues, that is the best place for troubleshooting.
from hic.
Thank you for the explanation @ieres-amgen. Is there anyway to check if the splitting process is working or not ? I could not locate any log files related to this during my test so I can't tell if that is working or not (I used the right tags in the yaml file). Any suggestions on where to look would be helpful. Thank you !
from hic.
No, to my knowledge, part of the disadvantage of not having this in a dedicated process is that there is no way to check on progress.
from hic.
I'm new to the pipeline so I've been wondering If I was doing something wrong. Thank you for confirming this ! Its just a waiting game now.
from hic.
Related Issues (20)
- Include pairtools protocol HOT 3
- Certain sample names produce a missing header error on INPUT_CHECK:SAMPLESHEET_CHECK step HOT 2
- Exit error 140 HOT 2
- COOLTOOLS_INSULATION creates invalid software version on Azure Batch
- sort extreme large pairs files
- Difference in the detection of validpair interactions between various version of hic HOT 2
- samplesheet error HOT 3
- java error HOT 1
- NFCORE_HIC:HIC:CUSTOM_DUMPSOFTWAREVERSIONS (1) terminated with an error exit status (1) HOT 4
- NFCORE_HIC:HIC:FASTQC error: fastqc tmp dir on HPC
- add parameter to tune balancing HOT 1
- add calder2 module
- Add loop calling tools HOT 1
- Improve multiQC report
- [Question] Mad-max filter HOT 1
- Reopening the split-fastq error
- What are the restriction enzymes name used by Arima? HOT 1
- Fatal error related to the format of the input files
- Process `NFCORE_HIC:HIC:COOLER:COOLER_MAKEBINS (null})` terminated with an error exit status (127) HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hic.