Comments (14)
Hi Matthias,
For elPrep 4, we made a predictor for peak RAM use based on a set of benchmark runs. More specifically, we made such a predictor for WGS data for the elPrep filter mode. This gave use the following equation for predicting the RAM use based on input BAM size: Y = 15X + 32.
This means ePrep 4 requires about 32 GB base memory + 15 times the input BAM size (in GB) for the elPrep filter mode, in the case of WGS data. For estimating the memory use for the sfm mode, you would need to look at the BAM size of the largest split file, which can vary for different data sets.
The numbers would look a bit different for WES data. We would also need to update the predictor for elPrep 5.
Does this help? Would it be useful to update a specific predictor for your use case?
Thanks!
Charlotte
from elprep.
Hi Charlotte,
Thanks! I ran the numbers and we're getting a bit different results. For an exome of about 8GB we see a RAM usage of about 300GB on average ( 3 tests, with 20, 40 and 80 threads). Anecdotally, the more threads we used, the lower the ram usage was (about 30gb difference between 20 and 80 threads).
An updated predictor would be most welcome!
cheers
Matthias
NB, command used was
elprep filter \
$1 \
${1%-sort.bam}.bam \
--nr-of-threads 20 \
--mark-duplicates \
--mark-optical-duplicates ${1%-sort.bam}_duplicate_metrics.txt \
--optical-duplicates-pixel-distance 2500 \
--sorting-order coordinate \
--haplotypecaller ${1%-sort.bam}.vcf.gz \
--reference /references/Hsapiens/hg38/seq/hg38.elfasta \
--target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed \
--log-path $PWD --timed
``
from elprep.
Hi Matthias,
I have made a preliminary predictor for elPrep 5 based on benchmarks for data samples we have at our lab: Y = 24X + 3. This, however, is quite far from the numbers you saw in your runs.
I have a couple of questions:
- Would it be possible to do a run with BQSR included? BQSR smooths the quality scores and we have seen that removing the option can have an impact on the computational performance of the haplotype caller step, possibly increasing the RAM use.
- Did you compile the elPrep binary yourself? If so, which version of the Go compiler was used? Or did you download the binary from our website? If so, which version did you test?
- Would it be possible to send us the log files of the elprep runs?
- Would it be possible for us to get access to your data sample so that we can do some tests ourselves?
Thanks a lot!
Best,
Charlotte
from elprep.
Hi Charlotte,
- I'll rerun my testcase with BQSR enabled and keep you posted.
- We're using the precompiled binary from the website, version 5.0.1
- I'll e-mail you the logs
- I'll see what I can do with the data. Perhaps I can retry using a GiaB sample, which is beter for sharing.
I'll keep you posted!
Matthias
from elprep.
On an unrelated note,
While converting a dbsnp vcf to elsites I noticed the process took over the whole node (80 threads) and consumed about 320GB ram.
Perhaps some kind of warning should be in place so unsuspecting users don't crash their servers trying to prepare data.
Alternatively, an argument to limit resource usage could be added.
Matthias
from elprep.
A quick test with BQSR on 80 threads reduces RAM usage by about 20GB (270GB total), so you were right about it having an effect on the requirements!
Matthias
edit: removed off topic remark
from elprep.
@matthdsm I opened new issues for your two side notes. I hope you have been notified by my answers there.
Thanks,
Pascal
from elprep.
Duly noted!
from elprep.
Hi Charlotte,
Are there any updates wrt to the RAM usage estimate?
Thanks
M
from elprep.
Hi Matthias,
We had a last e-mail exchange to get access to a data file on March 17th. As far as I know, there was never a reply?
Thanks!
Charlotte
from elprep.
Right, I lost sight on what had already been done. Let me get back to you!
M
from elprep.
Hi Charlotte,
To get back to this, which compression level do you use for your test input data? That might be the reason your formula doesn't work for our data. Since the input bam is intermediate data, we only use fast compression (e.g. samtools view -1
) to save on time, which results in a bigger bam file.
On a related note, what compression level do you use for the output bams? I noticed the output bam is larger than the input, which usually isn't the case when the data is sorted.
M
from elprep.
from elprep.
I ran some tests on our infrastructure and came up with Y = 34X + 20
.
from elprep.
Related Issues (20)
- Comment: handling RG's in SFM HOT 4
- GATK equivalency - version HOT 1
- Issue during build of v5.1.0 HOT 3
- Variant calling using `sfm` does not adhere to thread limit HOT 13
- "exit status 2" error HOT 2
- vcf-to-elsites HOT 3
- Proper running using sam input file? HOT 1
- coverage vs vcf.gz size discrepancy HOT 4
- elprep5: gVCF does not match BAM read pileup HOT 2
- `elprep split` output directory HOT 5
- Add wiki page about "creating your own `sfm` script" HOT 3
- header tag differs between runs HOT 2
- elprep merge: add input for intermediate files
- feature: index output HOT 2
- Very low number of variants called by elprep in an heterozygous genome
- elfasta-conversion error: bufio.Scanner: token too long HOT 6
- Option to output realigned BAM file during --haplotypecaller step
- build fails on ALMA linux 9
- NO_COOR error when trying to index the output bam file
- elprpe vcf-to-elsites dbSNP_156.vcf dbSNP_156.elsites [Killed]
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elprep.