exascience / elprep Goto Github PK

elPrep: a high-performance tool for analyzing sequence alignment/map files in sequencing pipelines.

License: Other

Go 100.00%

elprep's Introduction

Overview

elPrep is a high-performance tool for analyzing .sam/.bam files (up to and including variant calling) in sequencing pipelines. The key advantage of elPrep is that it only performs a single-pass to process a .sam/.bam file, independent of the number of processing steps that need to be applied in a particular pipeline, greatly improving runtime performance.

elPrep is designed as an in-memory and multi-threaded application to fully take advantage of the processing power available with modern servers. Its software architecture is based on functional programming techniques, which allows easy composition of multiple alignment filters and optimizations such as loop merging. To make this possible, elPrep proposes a couple of new algorithms. For example, for duplicate marking we devised an algorithm that expresses the operation as a single-pass filter using memoization techniques and hierarchical hash tables for efficient parallel synchronisation. For base quality score recalibration (BQSR) we designed a parallel range-reduce algorithm. For variant calling, we designed a parallel algorithm that overlaps execution as much as possible with other phases in the pipeline.

Our benchmarks show that elPrep executes a 5-step variant calling best practices pipeline (sorting, duplicate marking, base quality score recalibration and application, and variant calling) between 6-10 times faster than other tools for whole-exome data, and 8-20x faster for whole-genome data.

The main advantage of elPrep is very fast execution times on high-end servers, as is available for example through cloud computing or custom server setups. We do not recommend using elPrep on laptops, desktops, or low-end servers. Please consult the system requirements below for more details.

elPrep is being developed at the ExaScience Life Lab at Imec. For questions, use our mailing list (below), our github page, or contact us via [email protected].

Fig. 1 Improvements with elprep 5 wrt runtime, RAM, and disk use for a variant calling best practices pipeline on a 50x Platinum NA12878 WGS aligned against hg38. elPrep combines the execution of the 5 pipeline steps for efficient parallel execution.

For more benchmark details, please consult our publication list.

Advantages

The advantages of elPrep include:

efficient multi-threaded execution
operates mainly in-memory, few intermediate files are generated
100% equivalent output to results produced by other widely used tools
compatible with existing tools
modular, easy to add and remove pipeline steps

Licensing & Availability

elPrep 5 is released and distributed under a dual-licensing scheme:

It is released as an open-source project under the terms of the GNU Affero General Public License version 3 as published by the Free Software Foundation, with Additional Terms. Please see the file LICENSE.txt for a copy of the GNU Affero Public License and the Additional Terms.
It is also available under the terms of the elPrep Premium License. This is the same code base but lifts some of the AGPL restrictions. Please see the file PREMIUMLICENSE.pdf for details. This license can be acquired via the Flintbox platform.

Go to our terms of use page for detailed information.

elPrep 5 binaries can be compiled from the source code available on GitHub, or can also be installed via anaconda/bioconda:

conda install -c bioconda elprep

GitHub

The elPrep source code is freely available on GitHub. elPrep is implemented in Go and tested for Linux.

elPrep GitHub URL:

https://github.com/exascience/elprep

Dependencies

elPrep works with the .sam, .bam, and .vcf formats as input/output. Previously, there was a dependency on samtools to read and write .bam files, but since elPrep4.0, .bam files are directly supported by elPrep, with no need for samtools to be present anymore. If you need support for .cram files, consider converting them to/from .bam files before/after elPrep using samtools, or other alternatives. There was previously also a dependency on bcftools to read and write .vcf.gz and .bcf files, but since elPrep5.0, .vcf.gz are directly supported by elPrep as well. If you need support for .bcf files, consider converting them to/from .vcf.gz files before/after elPrep using bcftools, or other alternatives.

elPrep relies on its own .elsites file format for representing known sites during base quality score recalibration. Such .elsites files can be generated from .vcf files using the elPrep vcf-to-elsites command, and from .bed files using bed-to-elsites. elPrep also uses its ows .elfasta file format for representing references during base quality score recalibration and variant calling. They can be generated from .fasta files using the elPrep fasta-to-elfasta command.

There are no dependencies on other tools.

Building

The following is only relevant information if you wish to build elPrep yourself. It is not necessary to use the elPrep binary we provide.

elPrep (since version 3.0) is implemented in Go. Please make sure you have a working installation of Go. You can either install Go from the Go website. Alternatively, most package managers provide options to install Go as a development tool. Check the documentation of the package manager of your Linux distribution for details.

First checkout the elPrep sources using the following command:

	go get -u github.com/exascience/elprep

This downloads the elPrep Go source code, and creates the elprep binary in your configured Go home folder, for example ~/go/bin/elprep. See the GOPATH variable for your Go home folder.

Add the binary to your path, for example:

      export PATH=$PATH:~/go/bin

Compatibility

elPrep has been developed for Linux and has not been tested for other operating systems. We have tested elPrep with the following Linux distributions:

Ubuntu 14.04.5 LTS
Manjaro Linux
Red Hat Enterprise Linux 6.4 and 6.5
Amazon Linux 2

Memory Requirements

RAM

elPrep is designed to operate in memory, i.e. data is stored in RAM during computation. As long as you do not use the in-memory sort, mark duplicates filter, base recalibration, or variant calling, elPrep operates as a streaming tool, and peak memory use is limited to a few GB.

elPrep also provides a tool for splitting .sam files per chromosome - or better per groups of chromosomes - and guarantees that processing these split files and then merging the results does not lose information when compared to processing a .sam file as a whole. Using the split/merge tool greatly reduces the RAM required to process a .sam file, but it comes at the cost of an additional processing step.

We recommend the following minimum of RAM when executing memory-intensive operations such as sorting, duplicate marking, base quality score recalibration and haplotype caller:

whole-genome 30x: 128 GB RAM using the elprep split/filter/merge mode (sfm)
whole-genome 50x: 200 GB RAM using the elprep split/filter/merge mode (sfm)
whole-exome 30x: 24GB RAM using the elprep split/filter/merge mode (sfm)
whole-exome 30x: 92GB RAM using the elprep in memory mode (filter)

These numbers are only estimates, and the actual RAM use may look different for your data sets.

Disk Space

elPrep by default does not write any intermediate files, and therefore does not require additional (peak) disk space beyond what is needed for storing the input and output files. If you use the elPrep split and merge tools, elPrep requires additional disk space equal to the size of the input file.

Mailing List and Contact

Use the Google forum for discussions. You need a Google account to subscribe through the forum URL. You can also subscribe without a Google account by sending an email to [email protected].

You can also contact us via [email protected] directly.

For inquiries about commercial licensing options contact us via [email protected].

Citing elPrep

Please cite the following articles:

Herzeel C, Costanza P, Decap D, Fostier J, Wuyts R, Verachtert W (2021) Multithreaded variant calling in elPrep 5. PLoS ONE 16(2): e0244471. https://doi.org/10.1371/journal.pone.0244471

Herzeel C, Costanza P, Decap D, Fostier J, Verachtert W (2019) elPrep 4: A multithreaded framework for sequence analysis. PLoS ONE 14(2): e0209523. https://doi.org/10.1371/journal.pone.0209523

Herzeel C, Costanza P, Decap D, Fostier J, Reumers J (2015) elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling. PLoS ONE 10(7): e0132868. https://doi.org/10.1371/journal.pone.0132868

Costanza P, Herzeel C, Verachter W (2019) A comparison of three programming languages for a full-fledged next-generation sequencing tool. BMC Bioinformatics 2019 20:301. https://doi.org/10.1186/s12859-019-2903-5

If performance is below your expectations, please contact us first before reporting your results.

Examples

Variant calling pipeline for whole-genome data (WGS)

The following elprep command shows a 5-step variant calling best practices pipeline on WGS data:

elprep sfm NA12878.input.bam NA12878.output.bam
           --mark-duplicates --mark-optical-duplicates NA12878.output.metrics
           --sorting-order coordinate
           --bqsr NA12878.output.recal --known-sites dbsnp_138.hg38.elsites,Mills_and_1000G_gold_standard.indels.hg38.elsites
           --reference hg38.elfasta
           --haplotypecaller NA128787.output.vcf.gz

The command executes a pipeline that consists of 5 steps: sorting, PCR and optical duplicate marking, base quality score recalibration and application, and variant calling.

We can break up the command as follows:

The sfm subcommand tells elprep to run in sfm (split/filter/merge) mode. This is generally the preferred mode for WGS data, unless the data has very low coverage (<= 10x).
The input file is "NA12878.input.bam".
Output is written to a file "NA12878.bam" that contains the result of modifying the input bam file by performing duplicate marking, sorting, and base quality score recalibration and application.
The flags --mark-duplicates and --mark-optical-duplicates instruct elprep to perform PCR and optical duplicate marking respectively. The statistics generated by this are written to a file "NA12878.output.metrics".
The flag --sorting-order tells elprep to sort the input bam file by coordinate order.
The flag --bqsr instructs elprep to perform base quality score recalibration. The statistics generated by this are written to a file "NA12878.output.recal". The --bqsr flags also need to know the reference fasta file with which the input bam was created, cf. the "--reference hg38.elfasta" option. Note the file extension ".elfasta". elPrep requires converting the fasta file to this format before running the pipeline via the command "elprep fasta-to-elfasta hg38.fasta hg38.elfasta". The --bqsr option also needs to know the known variant sites, passed via the "--known-sites dbsnp_138.hg38.elsites" option. Note the file extension ".elsites". elPrep requires converting vcf files to this format before running the pipeline via the command "elprep vcf-to-elsites dbsnp_138.hg38.vcf dbsnp_138.hg38.elsites".
The flag --haplotypecaller instructs elprep to perform variant calling. It uses the same reference fasta as the one passed for --bqsr (via --reference). The result of this step is written are written to a file "NA12878.output.vcf.gz".

For details, consult the manual reference pages.

Variant calling pipeline for whole-exome data (WES)

The following elprep command shows a 5-step variant calling best practices pipeline on WES data:

elprep sfm NA12878.input.bam  NA12878.output.bam 
           --mark-duplicates --mark-optical-duplicates NA12878.output.metrics 
           --sorting-order coordinate 
           --bqsr NA12878.output.recal --known-sites dbsnp_137.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites 
           --reference hg19.elfasta 
           --haplotypecaller NA12878.output.vcf.gz 
           --target-regions nexterarapidcapture_expandedexome_targetedregions.bed

elPrep uses an internal ".elfasta" format for representing fasta files, which can be created using the "elprep fasta-to-eflasta" command before running the pipeline. Similarly, elPrep uses an internal format for representing vcf files containing known variant sites (.elsites), which can be created using the command "elprep vcf-to-elsites".

For details, consult the manual reference pages.

Manual Reference Pages

Name

elprep filter - a commandline tool for filtering and updating .sam/.bam files and variant calling

Synopsis

elprep filter input.sam output.sam --mark-duplicates --mark-optical-duplicates output.metrics
                                   --sorting-order coordinate 
                                   --bqsr output.recal --reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites 
                                   --haplotypecaller output.vcf.gz

elprep filter input.bam output.bam --mark-duplicates --mark-optical-duplicates output.metrics 
                                   --sorting-order coordinate 
                                   --bqsr output.recal --reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites 
                                   --haplotypecaller output.vcf.gz

elprep filter /dev/stdin /dev/stdout --mark-duplicates --mark-optical-duplicates output.metrics 
                                     --sorting-order coordinate 
                                     --bqsr output.recal --reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites	
                                     --haplotypecaller output.vcf.gz

Description

The elprep filter command requires two arguments: the input file and the output file. The input/output format can be .sam or .bam. elPrep determines the format of the input by analyzing the actual contents of the input file. The format of the output file is determined by looking at the file extension. elPrep also allows to use /dev/stdin and /dev/stdout as respective input or output sources for using Unix pipes. When doing so, elPrep assumes output is in .sam format, which can be changed by additional parameters (see below).

The elprep filter command-line tool has three types of command options: filters, which implement actual .sam/.bam manipulations, sorting options, and execution-related options, for example for setting the number of threads. For optimal performance, issue a single elprep filter call that combines all filters you wish to apply.

The order in which command options are passed is ignored. For optimal performance, elPrep always applies filters/operations in the following order:

filter-unmapped-reads or filter-unmapped-reads-strict
filter-mapping-quality
filter-non-exact-mapping-reads or filter-non-exact-mapping-reads-strict
filter-non-overlapping-reads
clean-sam
replace-reference-sequences
replace-read-group
mark-duplicates
mark-optical-duplicates
bqsr
remove-duplicates
remove-optional-fields
keep-optional-fields
haplotypecaller

Sorting is done after filtering.

Please also see the elprep sfm command.

Unix pipes

elPrep is compatible with Unix pipes and allows using /dev/stdin and /dev/stdout as input or output sources. elPrep analyzes the input from /dev/stdin to determine if it is in .sam or .bam format, and assumes that output to /dev/stdout is in .sam format, unless specified otherwise (see below).

Filter Command Options

--replace-reference-sequences file

This filter is used for replacing the header of a .sam/.bam file by a new header. The new header is passed as a single argument following the command option. The format of the new header can either be a .dict file or another .sam/.bam file from which you wish to extract the new header.

All alignments in the input file that do not map to a chromosome that is present in the new header are removed. Therefore, there should be some overlap between the old and the new header for this command option to be meaningful. The option is typically used to reorder the reference sequence dictionary in the header.

Replacing the header of a .sam/.bam file may destroy the sorting order of the file. In this case, the sorting order in the header is set to "unknown" by elPrep in the output file (cf. the 'so' tag).

--filter-unmapped-reads

Removes all alignments in the input file that are unmapped. An alignment is determined unmapped when bit 0x4 of its FLAG is set, conforming to the SAM specification.

--filter-unmapped-reads-strict

Removes all alignments in the input file that are unmapped. An alignment is determined unmapped when bit 0x4 of its FLAG is set, conforming to the SAM specification. Also removes alignments where the mapping position (POS) is 0 or where the reference sequence name (RNAME) is *. Such alignments are considered unmapped by the SAM specification, but some alignment programs may not mark the FLAG of those alignments as unmapped.

--filter-mapping-quality mapping-quality

Remove all alignments with mapping quality lower than the given mapping quality.

--filter-non-exact-mapping-reads

Removes all alignments where the mapping is not an exact match with the reference, albeit soft-clipping is allowed. This filter checks the CIGAR string and only allows occurences of M and S.

--filter-non-exact-mapping-reads-strict

Removes all alignments where the mapping is not an exact match with reference or not a unique match. This filter checks for each read that the following optional fields are present with the following values: X0=1 (unique mapping), X1=0 (no suboptimal hit), XM=0 (no mismatch), XO=0 (no gap opening), XG=0 (no gap extension).

--filter-non-overlapping-reads bed-file

Removes all reads where the mapping positions do not overlap with any region specified in the bed file. Specifically, either the start or end of the read's mapping position must be contained in an interval, or the read is removed from the output.

This option produces a different result from --target-regions option. For the difference between both options and details on the algorithms, please consult our latest publication.

--replace-read-group read-group-string

This filter replaces or adds read groups to the alignments in the input file. This command option takes a single argument, a string of the form "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" where the names following ID:, PL:, PU:, etc. can be any user-chosen name conforming to the SAM specification. See SAM Format Specification Section 1.3 for details: The string passed here can be any string conforming to a header line for tag @RG, omitting the tag @RG itself, and using whitespace as separators for the line instead of TABs.

--mark-duplicates

This filter marks the duplicate reads in the input file by setting bit 0x400 of their FLAG conforming to the SAM specification. For details on the algorithm and comparison to other tools, please consult our publication list.

--mark-optical-duplicates file

When the --mark-duplicates filter is passed, one can also pass --mark-optical-duplicates. This option makes sure that optical duplicate marking is performed and a metrics file is generated that contains read statistics such as number of unmapped reads, secondary reads, duplicate reads, optical duplicates, library size estimate, etc. For details on the algorithm and comparison to other tools, please consult our publication list.

The metrics file generated by --mark-optical-duplicates is compatible with MultiQC for visualisation.

--optical-duplicates-pixel-distance nr

This option allows specifying the pixel distance that is used for optical duplicate marking. This option is only usable in conjunction with --mark-optical-duplicates. The default value for the pixel distance is 100. In general, a pixel distance of 100 is recommended for data generated using unpatterned flowcells (e.g. HiSeq2500) and a pixel distance of 2500 is recommended for patterned flowcells (e.g. NovaSeq/HiSeq4000).

--remove-duplicates

This filter removes all reads marked as duplicates. Duplicate reads are reads where their FLAG's bit 0x400 is set conforming the SAM specification.

--remove-optional-fields [all | list]

This filter removes for each alignment either all optional fields or all optional fields specified in the given list. The list of optional fields to remove has to be of the form "tag1, tag2, ..." where tag1, tag2, etc are the tags of the optional fields that need to be deleted.

--keep-optional-fields [none | list]

This filter removes for each alignment either none of its optional fields, or all optional fields except those specified in the given list. The list of optional fields to keep has to be of the form "tag1, tag2, ..." where tag1, tag2, etc are the tags of the optional fields that need to be kept in the output.

--clean-sam

This filter fixes alignments in two ways:

it soft clips alignments that hang off the end of its reference sequence
it sets the mapping quality to 0 if the alignment is unmapped

This filter is similar to the CleanSam command of Picard.

--bqsr recal-file

This filter performs base quality score recalibration. The recal-file is used for logging the recalibration tables computed during base recalibration. This file is compatible with MultiQC for visualisation.

There are additional elprep options that can be used for configuring the base quality score recalibration:

--reference elfasta (required)
--known-sites list (optional)
--quantize-levels nr (optional)
--sqq list (optional)
--max-cycle nr (optional)
--target-regions bed-file (optional)

See detailed descriptions of these options next.

--reference elfasta-file

This option is used to pass a reference file for base quality score recalibration (--bqsr). The reference file must be in the .elfasta format, specific to elPrep.

You can create an .elfasta file from a .fasta file using the elprep command fasta-to-elfasta. For example:

elprep fasta-to-elfasta ucsc.hg19.fasta ucsc.hg19.elfasta

You only need to pass this option once if you are using both the --bqsr and --haplotypecaller options (which both require passing a reference file).

--known-sites list

This option is used to pass a number of known polymorphic sites that will be excluded during base recalibration (--bqsr) . The list is a list of files in the .elsites format, specific to elPrep. For example:

--known-sites Mills_and_1000G_gold_standard.indels.hg19.elsites,dbsnp_137.hg19.elsites

You can create .elsites files from .vcf or .bed files using the vcf-to-elsites and bed-to-elsites parameters respectively. For example:

elprep vcf-to-elsites dbsnp_137.hg19.vcf dbsnp_137.hg19.elsites

--quantize-levels nr

This option is used to specify the number of levels for quantizing quality scores during base quality score recalibration (--bqsr). The default value is 0.

--sqq list

This option is used to indicate to use static quantized quality scores to a given number of levels during base quality score recalibration (--bqsr). This list should be of the form "[nr, nr, nr]". The default value is [].

--max-cycle nr

This option is used to specify the maximum cycle value during base quality score recalibration (--bqsr). The default value is 500.

--target-regions bed-file

This option can be used to restrict which reads the base recalibration operates on by passing a .bed file that lists which genomic regions to consider. When this option is used, the reads that fall out of the specified regions are removed from the output .bam file. The option is for example used when processing exomes.

This option produces a different result from --filter-non-overlapping-reads option. For the difference between both options and details on the algorithms, please consult our latest publication.

--bqsr-tablename-prefix prefix

This option can be used to determine the prefix for the table names when logging the recalibration tables. The default value ensures that the output is compatible with MultiQC. It is normally not necessary to set this option.

--mark-optical-duplicates-intermediate file

This option is used in the context of filtering files created using the elprep split command. It is used internally by the elprep sfm command, but can be used when writing your own split/filter/merge scripts.

This option tells elPrep to perform optical duplicate marking and to write the result to an intermediate metrics file. The intermediate metrics file generated this way can later be merged with other intermediate metrics files, see the merge-optical-duplicates-metrics command.

--bqsr-tables-only table-file

This option tells elPrep to perform base quality score recalibration and to write the result of the recalibration to an intermediate table file. This table file will need to be merged with other intermediate recalibration results during the application of the base quality score recalibration. See the --bqsr-apply option.

--bqsr-apply path

This option is used when filtering files created by the elprep split command. It is used internally by the elprep sfm command, and can be used when writing your own split/filter/merge scripts.

This option is used for applying base quality score recalibration on an input file. It expects a path parameter that refers to a directory that contains intermediate recalibration results for multiple files created using the --bqsr-tables-only option.

--haplotypecaller vcf-file

This option performs variant calling for detecting germline SNPS and indels. The vcf-file is used for storing the vcf output. This file can be in gzipped format.

There are additional elprep options that can be used for configuring the haplotype variant caller:

--reference elfasta (required)
--reference-confidence [GVCF | BP_RESOLUTION | NONE] (optional)
--sample-name name (optional)
--activity-profile igv-file (optional)
--assembly-regions igv-file (optional)
--assembly-region-padding nr (optional)
--target-regions (optional)

See detailed descriptions of these options next.

--reference elfasta

This option is used to pass a reference file for variant calling (--haplotypecaller). The reference file must be in the .elfasta format, specific to elPrep.

You can create an .elfasta file from a .fasta file using the elprep command fasta-to-elfasta. For example:

elprep fasta-to-elfasta ucsc.hg19.fasta ucsc.hg19.elfasta

You only need to pass this option once if you are using both the --bqsr and --haplotypecaller options (which both require passing a reference file).

--reference-confidence [GVCF | BP_RESOLUTION | NONE]

This option is used to set the mode for emitting reference confidence scores when performing variant calling (--haplotypecaller). There are three options to choose from:

GVCF (default): emit the GVCF format, i.e. the reference model is written with condensed non-variant blocks
BP_RESOLUTION: the reference model is emitted site by site
NONE: reference confidence calls are not emitted

--sample-name name

The elPrep haplotypecaller (--haplotypecaller) only works for single samples. In case the input .bam file contains multiple samples, the --sample-name option can be used to select the sample reads on which to operate on.

--activity-profile igv-file

Use this option to output the activity profile calculated by the haplotypecaller to the given file in IGV format.

--activity-regions igv-file

This option can be used to output the assembly regions calculated by haplotypecaller to the speficied file in IGV format .

--assembly-region-padding nr

This option specfies the number of additional bases to include around each assembly region for variant calling.

--target-regions bed-file

By default, the haplotypecaller scans the full genome for variants. Use this option to restrict the variant caller to specific regions by passing a .bed file. It is for example used when processing exomes.

You only need to pass this option once if you are using both the --bqsr and --haplotypcaller options.

Sorting Command Options

--sorting-order [keep | unknown | unsorted | queryname | coordinate]

This command option determines the order of the alignments in the output file. The command option must be followed by one of five possible orders:

keep: The original order of the input file is preserved in the output file. This is the default setting when the --sorting-order option is not passed. Some filters may change the order of the input, in which case elPrep forces a sort to recover the order of the input file.
unknown: The order of the alignments in the output file is undetermined, elPrep performs no sorting of any form. The order in the header of the output file will be unknown.
unsorted: The alignments in the output file are unsorted, elPrep performs no sorting of any form. The order in the header of the output file will be unsorted.
queryname: The output file is sorted according to the query name. The sort is enforced and guaranteed to be executed. If the original input file is already sorted by query name and you wish to avoid a sort with elPrep, use the keep option instead.
coordinate: The output file is sorted according to coordinate order. The sort is enforced and guaranteed to be executed. If the original input file is already sorted by coordinate order and you wish to avoid a sort with elPrep, use the keep option instead.

Execution Command Options

--nr-of-threads number

This command option sets the number of threads that elPrep uses during execution. The default number of threads is equal to the number of cpu threads.

It is normally not necessary to set this option. elPrep by default allocates the optimal number of threads.

--timed

This command option is used to time the different phases of the execution of the elprep command, e.g. time spent on reading from file into memory, filtering, sorting, etc.

It is normally not necessary to set this option. It is only useful to get some details on where execution time is spent.

--log-path path

This command option is used to specify a path where elPrep can store log files. The default path is the logs folder in your home path (~/logs).

Format conversion tools

elPrep uses internal formats for representing .vcf, .bed, or .fasta files used by specific filter/sfm options. elPrep provides commands for creating these files from existing .vcf, .bed or .fasta files.

Name

elprep vcf-to-elsites - a commandline tool for converting a .vcf file to an .elsites file

Synposis

elprep vcf-to-elsites input.vcf output.elsites --log-path /home/user/logs

Description

Converts a .vcf file to an .elsites file. Such a file can be passed to the --known-sites suboption of the --bqsr option.

Options

--log-path path

Sets the path for writing a log file.

Name

elprep bed-to-elsites - a commandline tool for converting a .bed file to an .elsites file

Synposis

elprep bed-to-elsites input.bed output.elsites --log-path /home/user/logs

Description

Converts a .bed file to an .elsites file. Such a file can be passed to the --known-sites suboption of the --bqsr option.

Options

--log-path path

Sets the path for writing a log file.

Name

elprep fasta-to-elfasta - a commandline tool for converting a .fasta file to an .elfasta file

Synopsis

elprep fasta-to-elfasta input.fasta output.elfasta --log-path /home/user/logs

Description

Converts a .fasta file to an .elfasta file. The --reference suboption of the --bqsr and --haplotypecaller options requires an .elfasta file.

Options

--log-path path

Sets the path for writing a log file.

Split and Merge tools

The elprep split command can be used to split up .sam files into smaller files that store the reads "per chromosome," or more precisely groups of chromosomes. elPrep determines the chromosomes by analyzing the sequence dictionary in the header of the input file and generates a split file for groups of chromosomes that are roughly equal in size and that stores all read pairs that map to that group of chromosomes. elPrep additionally creates a file for storing the unmapped reads, and in the case of paired-end data, also a file for storing the pairs where reads map to different chromosomes. elPrep also duplicates the latter pairs across chromosome files so that preparation pipelines have access to all information they need to run correctly. Once processed, use the elprep merge command to merge the split files back into a single .sam file.

Splitting the .sam file into smaller files for processing "per chromosome" is useful for reducing the memory pressure as these split files are typically significantly smaller than the input file as a whole. Splitting also makes it possible to parallelize the processing of a single .sam file by distributing the different split files across different processing nodes.

We provide an sfm command that executes a pipeline while silently using the elprep filter and split/merge tools. It is of course possible to write scripts to combine the filter and split/merge tools yourself. We provide a recipe for writing your own split/filter/merge scripts on our github wiki.

Name

elprep sfm - a commandline tool for filtering and updating .sam/.bam files and variant calling "per chromosome"

Synopsis

elprep sfm input.sam output.sam 
           --mark-duplicates --mark-optical-duplicates output.metrics 
           --sorting-order coordinate 
           --bqsr output.recal --reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites 
           --haplotypecaller output.vcf.gz

elprep sfm input.bam output.bam 
           --mark-duplicates --mark-optical-duplicates output.metrics 
           --sorting-order coordinate 
           --bqsr output.recal --reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites 
           --haplotypecaller output.vcf.gz
           
elprep sfm input.bam output.bam 
           --mark-duplicates --mark-optical-duplicates output.metrics 
           --sorting-order coordinate 
           --bqsr output.recal --reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites 
           --haplotypecaller output.vcf.gz

Description

The elprep sfm command is a drop-in replacement for the elprep filter command that minimises the use of RAM. For this, it silently calls the elprep split and merge tools to split up the data "per chromosome" for processing, which requires less RAM than processing a .sam/.bam file as a whole (see Split and Merge tools).

Options

The elprep sfm command has the same options as the elprep filter command, with the following additions.

--intermediate-files-output-type [sam | bam]

This command option sets the format of the split files. By default, elprep uses the same format as the input file for the split files. Changing the intermediate file output type may improve either runtime (.sam) or reduce peak disk usage (.bam).

--tmp-path

This command option is used to specify a path where elPrep can store temporary files that are created (and deleted) by the split and merge commands that are silently called by the elprep sfm command. The default path is the folder from where you call elprep sfm.

--single-end

Use this command option to indicate the sfm command is processing single-end data. This information is important for the split/merge tools to operate correcly. For more details, see the description of the elprep split and elprep merge commands.

--contig-group-size number

This command option is passed to the split tool.

The elprep split command groups the sequence dictionary entries for deciding how to split up the input data. The goal is to end up with groups of sequence dictionary entries (contigs) for which the total length (sum of LN tags) is roughly the same among all groups. By default, the elprep split command identifies the sequence dictionary entry with the largest length (LN) and chooses this as a target size for the groups.

The --contig-group-size option allows configuring a specific group size. This size may be smaller than some of the sequence dictionary entries: elprep split will attempt to create as many groups of contigs of the chosen size, and contigs which are "too long" will be put in their own group.

Configuring the contig group size has an impact on how large the split files are that are generated by the elprep split command. Consequently, this also impacts how much RAM elprep uses for processing the split files. The default group size determines the minimum amount of RAM that is necessary to process a .sam/.bam file without information loss.

The default value for the --contig-group-size option is 0. For this, elprep split makes groups based on the sequence dictionary entry with the biggest length (LN).

Choosing the value 1 for the --contig-group-size tells elprep split to split the data "per chromosome", i.e. a split file is created for each entry in the sequence dictionary.

Name

elprep split - a commandline tool for splitting .sam/.bam files per chromosome so they can be processed without information loss

Synopsis

elprep split [sam-file | /path/to/input/] /path/to/output/ --output-prefix "split-sam-file" --output-type sam 
--nr-of-threads $threads --single-end

Description

The elprep split command requires two arguments: 1) the input file or a path to multiple input files and 2) a path to a directory where elPrep can store the split files. The input file(s) can be .sam or .bam. It is also possible to use /dev/stdin as the input for using Unix pipes. There are no structural requirements on the input file(s) for using elprep split. For example, it is not necessary to sort the input file, nor is it necessary to convert to .bam or index the input file.

If you pass a path to multiple input files to the elprep split command, elprep attempts to merge the headers, resolving potential conflicts by adhering to the SAM specification. Specifically, while merging headers: 1) the order of sequence dictionaries must be kept (@sq); 2) read group identifiers must be unique (@rg) and in case of collisions elprep makes the IDs unique and updates optional @rg tags in alignments accordingly; 3) program identifiers must be unique (@pg) and elprep changes IDs to be unique in case of collisions and updates optional @pg tags in alignments accordingly; 4) comment lines are all merged (any order); 5) the order of the header is updated (GO entry). In case the headers are incompatible and merging violates any of the SAM specification requirements, elPrep produces an error and aborts the execution of the command.

elPrep creates the output directory denoted by the output path, unless the directory already exists, in which case elPrep may override the existing files in that directory. Please make sure elPrep has the correct permissions for writing that directory.

By default, the elprep split command assumes it is processing pair-end data. The flag --single-end can be used for processing single-end data. The output will look different for paired-end and single-end data.

Paired-end data (default)

The split command outputs two types of files:

a subdirectory "/path/to/output/splits/". The split command groups the entries in the sequence dictionary of the input file and creates a file for each of these groups containing all reads that map to that group.
a "/path/to/output/output-prefix-spread.output-type" file containing all reads of which the mate maps to a different entry in the sequence dictionary of the input file.

To process the files created by the elprep split command, one needs to call the elprep filter command for each entry in the path/to/output/splits/ directory as well as the /path/to/output/output-prefix-spread.output-type file. The output files produced this way need to be merged with the elprep merge command. This is implemented by the elprep sfm command.

Single-end data (--single-end)

The split command groups entries in the sequence dictionary of the input file and creates a file for each of these groups that contain all reads that map to that group, and writes those files to the /path/to/output/ directory.

To process the files created by the elprep split --single-end command, one needs to call the elprep filter command for each entry in the /path/to/output/ directory. The output files produces this way need to be merged with the elprep merge command. This is implemented by the elprep sfm command.

Options

--output-prefix name

The split command groups entries in the sequence dictionary. The purpose of this grouping is to create groups of which the lengths of the entries (LN tags) add up to roughly the same size.

The names of the split files created by elprep split are generated by combing a prefix and a chromosome group name. The --output-prefix option sets that prefix.

For example, if the prefix is "NA12878", and the sfm command creates N groups for the sequence dictionary of the input file, then the names of the split files will be "NA12878-group1.output-type", "NA12878-group2.output-type", "NA12878-group3.output-type", and so on. A seperate file for the unmapped reads is created, e.g. "NA12878-unmapped.output-type".

If the user does not specify the --output-prefix option, the name of the input file, minus the file extension, is used as a prefix.

--output-type [sam | bam]

This command option sets the format of the split files. By default, elprep uses the same format as the input file for the split files.

--nr-of-threads number

This command option sets the number of threads that elPrep uses during execution for parsing/outputting .sam/.bam data. The default number of threads is equal to the number of cpu threads.

It is normally not necessary to set this option. elPrep by default allocates the optimal number of threads.

--single-end

When this command option is set, the elprep split command will treat the data as single-end data. When the option is not used, the elprep split command will treat the data as paired-end data.

--log-path path

Sets the path for writing a log file.

--contig-group-size number

The elprep split command groups the sequence dictionary entries for deciding how to split up the input data. The --contig-group-size options allows configuring a specific group size. See the description of --contig-group-size for the elprep sfm command for more details.

Name

elprep merge - a commandline tool for merging .sam/.bam files created by elprep split

Synopsis

elprep merge /path/to/input/ sam-output-file --nr-of-threads $threads --single-end

Description

The elprep merge command requires two arguments: a path to the files that need to be merged, and an output file. Use this command to merge files created with elprep split. The output file can be .sam or .bam. It is also possible to use /dev/stdout as output when using Unix pipes for connecting other tools.

Options

--nr-of-threads number

This command option sets the number of threads that elPrep uses during execution for parsing/outputting .sam/.bam data. The default number of threads is equal to the number of cpu threads.

It is normally not necessary to set this option. elPrep by default allocates the optimal number of threads.

--single-end

This command option tells the elprep merge command to treat the data as single-end data. When this option is not used, elprep merge assumes the data is paired-end, expecting the data is merging to be generated by the elprep split command accordingly.

--log-path path

Sets the path for writing a log file.

Name

elprep merge-optical-duplicate-metrics - a commandline tool for merging intermediate metrics files created by the --mark-optical-duplicates-intermediate option

Synopsis

elprep merge-optical-duplicates-metrics input-file output-file metrics-file /path/to/intermediate/metrics --remove-duplicates

Description

The elprep merge-optical-duplicates-metrics command requires four arguments: the names of the original input and output .sam/.bam files for which the metrics are calculated, the metrics file to which the merged metrics should be written, and a path to the intermediate metrics files that need to be merged (and were generated using --mark-optical-duplicates-intermediate).

Options

--nr-of-threads number

This command option sets the number of threads that elPrep uses during execution for parsing/outputting .sam/.bam data. The default number of threads is equal to the number of cpu threads.

It is normally not necessary to set this option. elPrep by default allocates the optimal number of threads.

--remove-duplicates

Pass this option if the metrics were generated for a file for which the duplicates were removed. This information will be included in the merged metrics file.

Extending elPrep

If you wish to extend elPrep, for example by adding your own filters, please consult our API documentation.

Acknowledgements

Many thanks for testing, bug reports, or contributions:

Amin Ardeshirdavani

Pierre Bourbon

Benoit Charloteaux

Richard Corbett

Didier Croes

Matthias De Smet

Keith James

Leonor Palmeira

Joke Reumers

Geert Vandeweyer

elprep's People

Contributors

Stargazers

Watchers

elprep's Issues

Converting a dbsnp vcf to elsites uses a lot of resources

While converting a dbsnp vcf to elsites I noticed the process took over the whole node (80 threads) and consumed about 320GB ram.
Perhaps some kind of warning should be in place so unsuspecting users don't crash their servers trying to prepare data.
Alternatively, an argument to limit resource usage could be added.

Matthias

Originally posted by @matthdsm in #44 (comment)

BQSR filters out reads 'not in target region'

I noticed in the logs that when using the BQSR option, reads not in target region are filtered out. In our case this is unwanted behavior, since we intend to use the output bam as our only archived copy of the data. As such we would like to keep a maximum amount of data.
The read removal doesn't seem to happen when omitting BQSR, so I'm not sure if this is a feature or a bug.

Matthias

Originally posted by @matthdsm in #44 (comment)

BaseRecalibrator maxCycle

Hello,

I was in JOBM2019 conference, and I have seen the presentation of your elprep4 tool and it looks pretty cool. I decided to test it on our data (PGM sequencing).
Unfortunately I have an issue I had with GATK base recalibrator about the number of cycle :

2019/07/11 14:11:09 cycle value exceeds maximum cycle value

In GATK there is an option "-maxCycle" that allow to avoid this error by increasing the default (500 by memory). I believe there is no such option in elprep. Am I wrong ?

Do you have any advice regarding my issue please ?

Thank you for your guidance.

MapReduce markdup

I’m wanting to use this tool in a mapreduce engine that I use (it’s not Hadoop it’s container based).

I’m wondering about how I might go about marking duplicates ( similar to Picard markdup) in a sharded manner. Currently, I have a pipeline that bins unmapped reads, maps them, and then merges the mapped bam file. I then shard the bam file into padded bins of 1Mb. I’m wondering if it’s possible to use your markdup tool on individual shards. Do you see any way forward here? Reads are currently sorted by chr coordinate, but maybe sorting by read name makes this possible?

Bioconda recipe

Hi,

Following your talk at FOSDEM earlier this week, I've taken the liberty of creating a conda recipe for elPrep. I believe this will add another layer of visibility to the tool, and make it even easier to install.
Please feel free to modify/update/whatever.

bioconda/bioconda-recipes#7576

Cheers
Matthias

Doing just variant calling /haplotypecaller? non-model organism no known-variant file for bqsr

Hi,
I got interested in elprep5 because I have been trying GATK4 but it is taking more than five days which is ridiculous.
I wanted to ask two questions. Since I have tried GATK4 I have a sorted/markeduplicated/bam file. Is there a way to just perform variant calling using elprep?

Also, I have tried to do the mapping and converting to get a .bam file as input, I am using the following job script:

./elprep sfm PA113corr.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --sorting-order coordinate
--bqsr PA113corr.recal --haplotypecaller PA113corr.vcf.gz --reference Autosome.elfasta

and I still get and error
elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information.
2021/06/06 03:18:42 Filename(s) in command line missing.

Thanks;

elprep split behaviour with/without trailing slash on /path/to/output/

elPrep version 2.10 (LispWorks pre-compiled distribution)

I think the documentation for elprep split could be clearer on the importance of the trailing slash on the /path/to/output/. If you omit the trailing slash, everything appears to be written to a single file called e.g. /path/to/output.sam. If the slash is included, then the data are split correctly into the directory.

This could be considered a bug because the non-splitting behaviour is probably not what is intended.

More memory needed than samtools on 1CPU

I was trying to sot a huge file, I didn't succeed with elprep (I tried with different combination of CPU even 1). I went back to samtools and succeeded (on 1 CPU). Did you get any similar feedback that elprer is using more memory than sametools in some cases?

Warning: The --sqq optional flag is set without using --bqsr.

Hi,
I am trying to use elprep in order for whole exome sequencing.
I used this command :
elprep sfm /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.bam /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.bam --mark-duplicates --mark-optical-duplicates /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.out_metrics --sorting-order coordinate --bqsr /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.out_recal --sqq 10,20,30 --known-sites /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/dbsnp_138.hg19.elsites,/ngs/datagen/genetique/dev/TEST_GATK/ElPREP/Mills_and_1000G_gold_standard.indels.hg19.elsites --bqsr-reference /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/hg19.elfasta

While runing there is a warning massage at the end saying that the base recalibration is not requested, while I specify that I wanted it.
:
2019/07/09 11:07:53 Command line: [elprep filter /dev/stdin /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.bam --sqq 10,20,30 --bqsr-apply /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/elprep-tabs-2019-07-09T11:01:46+02:00/ --recal-file /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.out_recal]
2019/07/09 11:07:53 Warning: The --sqq optional flag is set without using --bqsr. This parameter is ignored because base recalibration is not requested.
2019/07/09 11:07:53 Executing command:
elprep filter /dev/stdin /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.bam --bqsr-apply /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/elprep-tabs-2019-07-09T11:01:46+02:00/ --quantize-levels 0 --recal-file /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.out_recal --sorting-order keep
2019/07/09 11:07:54 Executing command:
elprep merge /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/elprep-splits-processed-2019-07-09T11:01:46+02:00/ /dev/stdout

If I don't use sfm but filter instead, I have no warning massage. Does it means that the bqsr is not done with sfm ?

Thanks

Does the command remove duplicates also does the marking of the duplicates?

Hi,
Should command --remove-duplicates only be used in combination with --mark-duplicates? Does it also mark the duplicates or just removes previously marked ones?
(when I used both --mark-duplicates and --remove-duplicates it seamed i got smaller size output file comparing to the one I got with only --remove-duplicates filter)
Thank you.

panic: interface conversion: interface {} is nil, not string

Using this command (from container quay.io/biocontainers/elprep:4.1.6--0):

    elprep sfm ${bam_file} ${bam_file.baseName}.elprep.bam  \\
    --mark-duplicates --mark-optical-duplicates output.metrics \\
    --bqsr output.recal \\
    --bqsr-reference ${genome_index} \\
    --known-sites ${known_sites} \\
    --nr-of-threads ${task.cpus} \\
    ${params.elprep_options}

I got the following error

 elprep merge /home/juke34/git/test/nf_global_editing/.nextflow/workDir/9e/36a20046eb015afb9584a1738d9f6c/elprep-splits-processed-2021-02-02T13:29:40Z/ /dev/stdout --nr-of-threads 2
panic: interface conversion: interface {} is nil, not string

goroutine 10 [running]:
github.com/exascience/elprep/v4/filters.readGroupCovariate(0xc00010d730, 0xc00019a000, 0x5ba740, 0xc000182200)
	/Users/costanza/develop/go/elprep/filters/bqsr.go:76 +0x205
github.com/exascience/elprep/v4/filters.BaseRecalibratorTables.ApplyBQSRWithMaxCycle.func1.2(0xc00019a000, 0xf01)
	/Users/costanza/develop/go/elprep/filters/bqsr.go:1416 +0x9d
github.com/exascience/elprep/v4/sam.ComposeFilters.func1(0x0, 0x5b2460, 0xc0001a37c0, 0x5b2460, 0xc0001a37c0)
	/Users/costanza/develop/go/elprep/sam/filter-pipeline.go:219 +0xcb
github.com/exascience/pargo/pipeline.feed(0xc000078380, 0xc00000dd60, 0x3, 0x4, 0x0, 0x0, 0x5b2ca0, 0xc00000dda0)
	/Users/costanza/go/pkg/mod/github.com/exascience/[email protected]/pipeline/filter.go:64 +0x5b
github.com/exascience/pargo/pipeline.(*lparnode).Begin.func1(0xc000078400, 0xc000078380, 0x0)
	/Users/costanza/go/pkg/mod/github.com/exascience/[email protected]/pipeline/lparnode.go:80 +0xb3
created by github.com/exascience/pargo/pipeline.(*lparnode).Begin
	/Users/costanza/go/pkg/mod/github.com/exascience/[email protected]/pipeline/lparnode.go:58 +0x188

Feature request: working with limits on VMEM

It would be grand if ElPrep could somehow be made to work with limited virtual memory.

Our cluster consists for 90% of nodes that have ~360GB RAM available. In addition to that we process whole-genome BAM files that are around 90-120X Coverage.

From our ElPrep benchmark we guestimate that ElPrep will need between 400~550GB VMEM to run most of our samples. This would negate the runtime benefit because the jobs would spend more time in the shared-cluster queue.

Hence a feature request to be able to work with limited memory without crashing:
runtime: out of memory: cannot allocate 8192-byte block

One could imagine to have ElPrep only start a sfm thread if sufficient memory is available and have the limit set via memory limit.

I doubt we are the only one that would love to take ElPrep into production but are limited by hardware and are looking for a somewhat grey area in the resources X runtime landscape.

Looking forward to your thoughts on the matter.

Cheers,
Chris

exit status 2

Hi,
I run elprep. However I couldn't get a vcf file.

$ elprep sfm Sample.Aligned.sortedByCoord.out.bam Sample.output.bam --mark-duplicates --mark-optical-duplicates Sample.output.metrics --sorting-order coordinate --bqsr Sample.output.recal --known-sites dbsnp_138.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites --reference ucsc.hg19.elfasta --haplotypecaller Sample.output.vcf.gz

elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information.

2021/06/30 19:13:41 Created log file at /home/nmc-ei/logs/elprep/elprep-2021-06-30-19-13-41-430755928-JST.log
2021/06/30 19:13:41 Command line: [elprep sfm Sample.Aligned.sortedByCoord.out.bam Sample.output.bam --mark-duplicates --mark-optical-duplicates Sample.output.metrics --sorting-order coordinate --bqsr Sample.output.recal --known-sites dbsnp_138.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites --reference ucsc.hg19.elfasta --haplotypecaller Sample.output.vcf.gz]
2021/06/30 19:13:41 Executing command:
 elprep sfm Sample.Aligned.sortedByCoord.out.bam Sample.output.bam --mark-duplicates --mark-optical-duplicates Sample.output.metrics --optical-duplicates-pixel-distance 100 --bqsr Sample.output.recal --reference ucsc.hg19.elfasta --quantize-levels 0 --max-cycle 500 --known-sites dbsnp_138.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites --haplotypecaller Sample.output.vcf.gz --sorting-order coordinate --intermediate-files-output-prefix Sample.Aligned.sortedByCoord.out --intermediate-files-output-type bam
2021/06/30 19:13:41 Splitting...
2021/06/30 19:15:39 Filtering (phase 1)...
2021/06/30 19:16:46 exit status 2

best and thanks for this utility,

eisuke

Support for multiple input files

Hi,

I was wondering if it would be possible to add support for multiple input files. This way elprep could be used to merge multiple split bam files into a single final bam. This approach is also used by GATK.

Thanks
M

Q: support for PacBio "native" format

Hi,
is elPrep also a drop-in replacement for processing BAM files in "PacBio native" format, i.e., BAM files containing unaligned long reads plus additional quality information that is used for downstream steps such as alignment with pbmm2?
Thanks for the info.
+Peter

elprep sfm run failed due to insufficient RAM

Hi,
Would it be possible to make elPrep sfm more stable by not failing it due to insufficient RAM memory, but to maybe just prolong its runtime? It is really high resource demanding, e.g. elprep has a failed run with a 150x bam file of 113GB on an instance with 92vCPUs and 192GB RAM.
Thank you.

Comment: handling RG's in SFM

With regard to handling RG's when merging different bams in sfmmode as mentioned here.
Would it be possible to just add multiple @RG headers to the bam file?
I've noticed bamsormadup does just that, while preserving the RG tag in the reads. This indirectly also leads to "multi-sample" support in the bam file.
In our current case we merge single sample bam files from different lanes, so we'd expect to see those lanes represented in the RG header lines, which would then correspond to the RG tag in the reads.

Just spitballing here.
Cheers
M

Feature request: Add quality filter

Hi,

We're very excited about elPrep, but we are missing a vital filter flag for it to be drop in replacement for our current pipeline.

Would it be possible to add a quality score filter, similar to the -q flag in samtools view?
Additionally, would it be possible to add a filter flag where the user can set a custom value to filter?

Thanks a lot
Matthias

Compilation problems

Hi,

When I tried to install your tools, I get the following error:
$ go get github.com/exascience/elprep
package github.com/exascience/elprep/v4/cmd: cannot find package "github.com/exascience/elprep/v4/cmd" in any of:
/usr/lib/go-1.7/src/github.com/exascience/elprep/v4/cmd (from $GOROOT)
/home/bvalot3/.go/src/github.com/exascience/elprep/v4/cmd (from $GOPATH

The GOPATH variable is set correctly:
$ echo $GOPATH
/home/bvalot3/.go

My configuration is:

Debian Strech
go version go1.7.4 linux/amd64

Thanks in advance

panic: runtime error: index out of range

I was trying to run elprep with most of the filters on:

2019/01/31 09:59:09 Executing command:
 elprep filter /data/input/DX123_HFJ5KDSXX_L2.sam /data/output/DX123_HFJ5KDSXX_L2.bam --mark-duplicates --mark-optical-duplicates /data/output/DX123_HFJ5KDSXX_L2.opticaldups.txt --optical-duplicates-pixel-distance 100 --remove-duplicates --bqsr /data/output/DX123_HFJ5KDSXX_L2.bqsr.txt --bqsr-reference /data/reference/hg19/ucsc_hg19.fa.elprep --quantize-levels 0 --sorting-order coordinate --nr-of-threads 30 --log-path /data/output
panic: runtime error: index out of range
goroutine 166040 [running]:
runtime/debug.Stack(0xdcc8256380, 0x0, 0xc00011e700)
	/opt/local/lib/go/src/runtime/debug/stack.go:24 +0xa7
github.com/exascience/pargo/internal.WrapPanic(0x5aa000, 0x741450, 0x741450, 0xe1cf34ab)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/internal/internal.go:41 +0x45
github.com/exascience/pargo/parallel.RangeReduce.func1.1.1(0xd559ba65a0, 0xd55ce65a50)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:803 +0x43
panic(0x5aa000, 0x741450)
	/opt/local/lib/go/src/runtime/panic.go:513 +0x1b9
github.com/exascience/elprep/v4/filters.computeSnpEvents(0xd0c194a3c0, 0x0, 0x0, 0x0, 0xd0c1933800, 0x1e, 0x100, 0x4, 0x1e, 0x100)
	/Users/caherzee/Documents/Work/Code/elprep/filters/bqsr.go:326 +0x3b3
github.com/exascience/elprep/v4/filters.(*BaseRecalibrator).Recalibrate.func2(0x457055b, 0x469da0b, 0xc00004a220, 0xc00004aca0)
	/Users/caherzee/Documents/Work/Code/elprep/filters/bqsr.go:952 +0x65a
github.com/exascience/pargo/parallel.RangeReduce.func1(0x457055b, 0x469da0b, 0x1, 0xd55ce65a50, 0x96)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:789 +0x2ba
github.com/exascience/pargo/parallel.RangeReduce.func1.1(0xd559ba65a0, 0xd55ce65a50, 0xd0c21c34e8, 0x457055b, 0x469da0b, 0x2, 0x1, 0xd559ba6590)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:806 +0x83
created by github.com/exascience/pargo/parallel.RangeReduce.func1
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:801 +0x1d2

The sam file was created by mapping raw reads to reference using bwa
Reference was generated by elprep fasta-to-elfasta

>uname -a
Linux hostname 2.6.32-696.28.1.el6.x86_64 #1 SMP Wed May 9 23:09:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Any ideas?

UPDATE:
It seemed there is something wrong with snp event computing, so I tried to pass to --known-sites with elsites compiled from dbsnp by elprep vcf-to-elsites, which didn't help.

Is this a reasonable way to limit default threads used

We are very excited to try the new features in Version 5. Version 4 has saved us much time and complexity.

I know that by default GOMAXPROCS is set to all threads. This doesn't work well with our shared system. We have had people use a shell script to force a smaller default. However I was curious if the following change would work for the executable itself. I realize that duplicating the 16 is not the best software hygiene (should be a const var that could be changed just once) but this is an easy change for me to make:

cmd/filter.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/merge.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/merge-optical-duplicates-metrics.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/sfm.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/split.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")

best and thanks for this utility,

Jim Henderson, Cal Academy of Sciences

Type error for stream in parse-sam-header on sbcl 1.2.9 OSX

I've got further from #2 by using the method I described. I've read the code and would expect this to work (via Gray streams), but nevertheless:

elPrep version 2.11. See http://github.com/exascience/elprep for more information.
Executing command:
  elprep /dev/stdin NA12878-chr22-10pct.only_mapped.bam --filter-unmapped-reads --sorting-order keep --gc-on 0 --nr-of-threads 1

debugger invoked on a TYPE-ERROR in thread
#<THREAD "main thread" RUNNING {10037AECC3}>:
  The value
    #S(ELPREP::BUFFERED-ASCII-INPUT-STREAM
       :INDEX 0
       :LIMIT 8192
       :BUFFER #(69 82 82 49 57 52 49 52 55 46 53 53 ...)
       :SECONDARY-BUFFER NIL
       :ELEMENT-TYPE BASE-CHAR
       :STREAM #<SB-SYS:FD-STREAM for "file /dev/fd/0" {10038C60C3}>)
  is not of type
    STREAM.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

(no restarts: If you didn't do this on purpose, please report it as a bug.)

(ELPREP:PARSE-SAM-HEADER #<unavailable argument>) [tl,external]
0]

"exit status 2" error

Hey,

It's really a exciting variant calling tool. I would like to use for aligners benchmark. But I ran it on SSD and always got an error "exit status 2". I attached the log and the commands. Thanks.

Log:

2021/12/10 12:47:43 Splitting...
2021/12/10 12:54:06 Filtering (phase 1)...
2021/12/10 13:05:51 Filtering (phase 2) and variant calling...
2021/12/10 13:05:51 exit status 2

Command:

#1.	ref 
elprep fasta-to-elfasta /ssd-path-to-folder/data/hg37.fna \
/ssd-path-to-folder/data/elprep/hg37.elfasta
 
#2.	sites 
elprep vcf-to-elsites /ssd-path-to-folder/yan/variant-call/v2.19-out/truth_snp.recode.vcf \
/ssd-path-to-folder/data/elprep/hg37_snp.elsites
 
elprep vcf-to-elsites /ssd-path-to-folder/yan/variant-call/v2.19-out/truth_indels.recode.vcf \
/ssd-path-to-folder/data/elprep/hg37_indels.elsites
 
#3.	variant calling
elprep sfm /ssd-path-to-folder/yan/NA12878/accalign.mason.bam \
/ssd-path-to-folder/yan/NA12878/NA12878.output.bam \
--mark-duplicates --mark-optical-duplicates /ssd-path-to-folder/yan/NA12878/NA12878.output.metrics \
--sorting-order coordinate \
--bqsr /ssd-path-to-folder/yan/NA12878/NA12878.output.recal \
--known-sites /ssd-path-to-folder/data/elprep/hg37_indels.elsites,/ssd-path-to-folder/data/elprep/hg37_snp.elsites \
--reference /ssd-path-to-folder/data/elprep/hg37.elfasta \
--haplotypecaller accalign.gatk.vcf.gz

filter mismatches and or percentage identity?

Hello,

thanks for a really fast and efficient tool.

I know this is primarily targeted towards speeding up preprocessing before the GATK workflow, but I wonder if you had considered mismatches ?

It's actually not that easy to filter by mismatches, especially dynamically. With bamtools you can set a filter as json, eg NM<4, but you can't do this dynamically per read to my knowledge. That leaves pysam, which is slow and an additional dependency.

With long reads eg ONT, it would be very advantageous to set a read filter defined as percent identity or NM per 100 / 1000 bp. This would help to better filter out poor reads, which at the moment are not easy to filter. Some aligners (ngmlr) include a pc identity filter, yet the fastest tools such as minimap2 do not support a percent identity filter.

Are there any plans to add new features like this ?

cheers,
Colin

Question about the paper

I read the article you published about the benchmarking with elPrep in C++, Java and Go.
I'm very interested also to know how fast was each version developed.

I would have used the mailing-list/forum, but I can't send any messages on it.

testing?

There only a very light covering of the code base with tests here. In my experience, ensuring a safe and robust implementation of BGZF and BAM (and to a lesser degree SAM) is non-trivial. So more tests are certainly warranted.

Encapsulate format conversions into elprep

Do you have plans to encapsulate format conversions (VCF->elPrep VCF, Fasta->elprep_fasta, BED->ElPrep BED) inside the tool, so the user does not have the overhead of running these commands?
These are standard bioinformatic formats, and additionally keeping these special elprep files could be cumbersome. Thanks :)

elprep drops into ldb on sbcl 1.2.9 OSX

The following error is reproducible. I traced the problem to (setup-standard-streams). If this is not called, all is well. If I rebind the standard IO streams to sb-sys:*stderr* instead of using setq then the problem is resolved (so far).

./elprep CORRUPTION WARNING in SBCL pid 2496(tid 140735277009248):
Memory fault at 0x1bc000 (pc=0x100116a4b2, sp=0x19ff878)
The integrity of this image is possibly compromised.
Continuing with fingers crossed.

...

Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10037B6B83}>
Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10037B6B83}>
fatal error encountered in SBCL pid 2496(tid 140735277009248):
%PRIMITIVE HALT called; the party is over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> backtrace
Backtrace:
   0: Foreign function ldb_monitor, fp = 0x19f6f80, ra = 0x109c16
   1: Foreign function call_lossage_handler, fp = 0x19f6f90, ra = 0x10625a
   2: Foreign function lose, fp = 0x19f7080, ra = 0x1064a1
   3: Foreign function handle_trap, fp = 0x19f70b0, ra = 0x108bc2
   4: Foreign function signal_emulation_wrapper, fp = 0x19f7100, ra = 0x112ff7
   5: Foreign function stack_allocation_recover, fp = 0x19f7170, ra = 0x111bf0
   6: Foreign function stack_allocation_recover, fp = 0x19f75e8, ra = 0x111bf0
   7: SB-IMPL::ERROR-ERROR
   8: SB-IMPL::INFINITE-ERROR-PROTECTOR
   9: SB-KERNEL::INTERNAL-ERROR
  10: Foreign function call_into_lisp, fp = 0x19f76f0, ra = 0x11992e
  11: Foreign function funcall2, fp = 0x19f7720, ra = 0x10298c
  12: Foreign function interrupt_internal_error, fp = 0x19f7760, ra = 0x108ada
  13: Foreign function handle_trap, fp = 0x19f7790, ra = 0x108b87
  14: Foreign function signal_emulation_wrapper, fp = 0x19f77e0, ra = 0x112ff7
  15: Foreign function stack_allocation_recover, fp = 0x19f7850, ra = 0x111bf0
  16: Foreign function stack_allocation_recover, fp = 0x19f7cc0, ra = 0x111bf0
  17: 
  18: SB-IMPL::ERROR-ERROR
  19: SB-IMPL::INFINITE-ERROR-PROTECTOR
  20: COMMON-LISP::ERROR
  21: SB-KERNEL::TWO-ARG--
  22: SB-DEBUG::MAP-BACKTRACE
  23: (COMMON-LISP::FLET THUNK66 KEYWORD::IN SB-DEBUG::PRINT-BACKTRACE)
  24: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  25: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  26: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  27: SB-DEBUG::PRINT-BACKTRACE
  28: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-IMPL::ERROR-ERROR)
  29: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  30: SB-IMPL::ERROR-ERROR
  31: SB-IMPL::INFINITE-ERROR-PROTECTOR
  32: COMMON-LISP::ERROR
  33: SB-KERNEL::TWO-ARG--
  34: SB-DEBUG::MAP-BACKTRACE
  35: (COMMON-LISP::FLET THUNK66 KEYWORD::IN SB-DEBUG::PRINT-BACKTRACE)
  36: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  37: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  38: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  39: SB-DEBUG::PRINT-BACKTRACE
  40: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-IMPL::ERROR-ERROR)
  41: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  42: SB-IMPL::ERROR-ERROR
  43: SB-IMPL::INFINITE-ERROR-PROTECTOR
  44: COMMON-LISP::ERROR
  45: SB-SYS::MEMORY-FAULT-ERROR
  46: Foreign function call_into_lisp, fp = 0x19f8630, ra = 0x11992e
  47: Foreign function post_signal_tramp, fp = 0x19f86b8, ra = 0x119b20
  48: SB-IMPL::OUTPUT-CHAR-UTF-8-LINE-BUFFERED
  49: (COMMON-LISP::LAMBDA (COMMON-LISP::&REST COMMON-LISP::REST) KEYWORD::IN SB-IMPL::GET-EXTERNAL-FORMAT)
  50: COMMON-LISP::TERPRI
  51: SB-FORMAT::&-FORMAT-DIRECTIVE-INTERPRETER
  52: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  53: SB-FORMAT::%FORMAT
  54: COMMON-LISP::FORMAT
  55: SB-DEBUG::%PRINT-DEBUGGER-INVOCATION-REASON
  56: SB-DEBUG::%INVOKE-DEBUGGER
  57: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  58: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  59: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  60: COMMON-LISP::INVOKE-DEBUGGER
  61: COMMON-LISP::ERROR
  62: SB-SYS::MEMORY-FAULT-ERROR
  63: Foreign function call_into_lisp, fp = 0x19f8d00, ra = 0x11992e
  64: Foreign function post_signal_tramp, fp = 0x19f8d88, ra = 0x119b20
  65: SB-IMPL::OUTPUT-BYTES/UTF-8
  66: (COMMON-LISP::LAMBDA (COMMON-LISP::&REST COMMON-LISP::REST) KEYWORD::IN SB-IMPL::GET-EXTERNAL-FORMAT)
  67: SB-IMPL::FD-SOUT
  68: SB-IMPL::%WRITE-STRING
  69: SB-PRETTY::OUTPUT-LINE
  70: SB-PRETTY::MAYBE-OUTPUT
  71: COMMON-LISP::PPRINT-NEWLINE
  72: SB-FORMAT::_-FORMAT-DIRECTIVE-INTERPRETER
  73: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  74: (COMMON-LISP::LABELS BODY-NAME-1166 KEYWORD::IN SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK)
  75: SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK
  76: SB-FORMAT::<-FORMAT-DIRECTIVE-INTERPRETER
  77: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  78: SB-FORMAT::%FORMAT
  79: COMMON-LISP::FORMAT
  80: SB-DEBUG::%INVOKE-DEBUGGER
  81: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  82: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  83: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  84: COMMON-LISP::INVOKE-DEBUGGER
  85: COMMON-LISP::ERROR
  86: SB-SYS::MEMORY-FAULT-ERROR
  87: Foreign function call_into_lisp, fp = 0x19f9840, ra = 0x11992e
  88: Foreign function post_signal_tramp, fp = 0x19f98c8, ra = 0x119b20
  89: SB-IMPL::OUTPUT-BYTES/UTF-8
  90: (COMMON-LISP::LAMBDA (COMMON-LISP::&REST COMMON-LISP::REST) KEYWORD::IN SB-IMPL::GET-EXTERNAL-FORMAT)
  91: SB-IMPL::FD-SOUT
  92: SB-IMPL::%WRITE-STRING
  93: SB-PRETTY::OUTPUT-LINE
  94: SB-PRETTY::MAYBE-OUTPUT
  95: COMMON-LISP::PPRINT-NEWLINE
  96: SB-FORMAT::_-FORMAT-DIRECTIVE-INTERPRETER
  97: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  98: (COMMON-LISP::LABELS BODY-NAME-1166 KEYWORD::IN SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK)
  99: SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK
ldb>

How to control compression level of output?

Hi,
Is there a way to control compression level of output, similarly to picard MarkDuplicates?
Thanks.

"claws" dependency in .asd file not found.

compiling or loading the .asd system barfs on the missing "claws" ("claws" is not available in quicklisp and I wasn't sure what is actually is).

Marco

Feature discussion: avoiding samtools calls

Hi,

To further improve performance, would it be a good choice to avoid shelling out to samtools by using a go native HTSlib implementation? A quick google search brought https://github.com/biogo/hts to the surface.

Do you suppose it's viable to integrate this into elprep?

Cheers
M

sfm --haplotypecaller introduces duplicate lines in BAM

Hi,

I am testing elPrep, mostly for the variant-calling step: hoping to use it as a drop-in replacement for GATK4. I am using a small test BAM, paired-end reads, some mates map to different chromosomes. I noticed that in sfm mode, every read whose mate maps to a different chrom gets duplicated in the output.bam. This doesn't affect the GVCFs produced by elPrep, but the resulting BAM is not spec-compliant, and I suspect this may affect other tools working on the elPrep-produced BAMs. Details below.

Run elprep in sfm mode:
elprep sfm testin.bam testout_sfm.bam --haplotypecaller test_sfm.g.vcf.gz --reference /data/HumanGenome/hs38DH.elfasta
Run elprep in filter mode:
elprep filter testin.bam testout_filter.bam --haplotypecaller test_filter.g.vcf.gz --reference /data/HumanGenome/hs38DH.elfasta
Compare small BAMs sfm vs filter:
/home/nthierry/Software/BamUtil/bamUtil_1.0.13/bamUtil/bin/bam diff --in1 testout_sfm.bam --in2 testout_filter.bam
-> testout_sfm.bam contains duplicate lines for each read whose mate maps to a different chromosome.

Question: does this impact the elprep GVCFs?
zdiff test_sfm.g.vcf.gz test_filter.g.vcf.gz
-> only diff is the header ##elPrepCommandLine , no consequence for the elprep variant-caller.

The bug doesn't occur if I don't call variants:
elprep filter testin.bam ttt_filter.bam
elprep sfm testin.bam ttt_sfm.bam
/home/nthierry/Software/BamUtil/bamUtil_1.0.13/bamUtil/bin/bam diff --in1 ttt_sfm.bam --in2 ttt_filter.bam
-> no difference

The SAM spec says:
[QNAME] In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
[FLAG] For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’
-> the elPrep-produced BAMs with duplicate lines doesn't seem compliant (AFAICT?)

Regards,
Nicolas

Issue during build of v5.1.0

It seems building from source doesn't work:
https://app.circleci.com/pipelines/github/bioconda/bioconda-recipes/53687/workflows/40af8e10-1d30-4541-9fba-0b3cc7ab98c3/jobs/170646

13:19:42 BIOCONDA INFO (OUT) go: downloading github.com/exascience/pargo v1.1.0
13:19:42 BIOCONDA INFO (OUT) go: downloading github.com/google/uuid v1.2.0
13:19:42 BIOCONDA INFO (OUT) go: downloading golang.org/x/sys v0.0.0-20210514084401-e8d321eab015
13:19:42 BIOCONDA INFO (OUT) go: downloading github.com/bits-and-blooms/bitset v1.2.0
13:19:44 BIOCONDA INFO (OUT) # github.com/exascience/elprep/v5/sam
13:19:44 BIOCONDA INFO (OUT) sam/split-merge.go:237:21: undefined: MergeInputs
13:19:44 BIOCONDA INFO (OUT) sam/split-merge.go:671:21: undefined: MergeInputs

GATK equivalency - version

Hi,

Can you add in the README which GATK version exactly Elprep outputs equivalent data with?
Thanks
M

Proper running using sam input file?

Hi,

I read that elprep can work with .sam file input, and since it d oes coordinate sorting, I just mapped my reads to the elfasta converted reference, and used the first .sam file as input.
I am a little confused/concerned whether the final vcf would be correct due to the job log.
I used the following code:
elprep sfm AL91.sam AL91.output.bam --filter-unmapped-reads --nr-of-threads 28 --tmp-path $TMPDIR
--mark-duplicates --mark-optical-duplicates AL91.metrics
--sorting-order coordinate
--bqsr AL91.recal
--reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta
--haplotypecaller AL91.vcf.gz

and the log- I thought for proper variant calling it had to first convert/sort the .sam and then split.
It has been ~16 hours and the only output is the AL91.recal and not a AL91.metrics out.

Here is the log.
elprep version 5.1.1 compiled with go1.16.7 - see http://github.com/exascience/elprep for more information.

2022/01/20 20:44:07 Created log file at /users/PHS0338/jpac1984/logs/elprep/elprep-2022-01-20-20-44-07-250202704-EST.log
2022/01/20 20:44:07 Command line: [elprep sfm AL91.sam AL91.output.bam --filter-unmapped-reads --nr-of-threads 28 --tmp-path /tmp/slurmtmp.17532726 --mark-duplicates --mark-optical-duplicates AL91.metrics --sorting-order coordinate --bqsr AL91.recal --reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta --haplotypecaller AL91.vcf.gz]
2022/01/20 20:44:07 Executing command:
elprep sfm AL91.sam AL91.output.bam --filter-unmapped-reads --mark-duplicates --mark-optical-duplicates AL91.metrics --optical-duplicates-pixel-distance 100 --bqsr AL91.recal --reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta --quantize-levels 0 --max-cycle 500 --haplotypecaller AL91.vcf.gz --sorting-order coordinate --nr-of-threads 28 --tmp-path /tmp/slurmtmp.17532726 --intermediate-files-output-prefix AL91 --intermediate-files-output-type sam
2022/01/20 20:44:07 Splitting...
2022/01/20 21:01:22 Filtering (phase 1)...
2022/01/20 21:29:00 Filtering (phase 2) and variant calling...

Hopefully, I am doing the proper procedure and not wasting time.

Best regards;

Juan

UMI's and duplicate marking

Hi,

Are there plans to support UMI based duplicate marking as is supported in Picard? We're looking into using UMI's in our shallow WGS pipeline and would like to stream into Elprep.

Thanks
Matthias

elprep split fails to split into CRAM format

elPrep version 2.10 (LispWorks pre-compiled distribution)

split sam-file /path/to/output/ 
 [--output-prefix name] 
 [--output-type [sam | bam | cram]] 
 [--nr-of-threads nr]

When I select CRAM, I get an error reminding me to specifiy a reference:

An error occurred in elPrep 2.10: When creating CRAM output, either a reference-fasta or a reference-fai must be provided.

However, when I add the required arguments, elprep complains:

elprep split --reference-t Homo_sapiens.GRCh37.NCBI.allchr_MT.fa.fai --output-prefix foo_n --output-type cram --nr-of-threads 12 foo_n.bam  ./elprep/split/
elPrep version 2.10. See http://github.com/exascience/elprep for more information.
Incorrect number of parameters: Expected 2, got 4 (--reference-t Homo_sapiens.GRCh37.NCBI.allchr_MT.fa.fai foo_n.bam ./elprep/split/).
split sam-file /path/to/output/ 
 [--output-prefix name] 
 [--output-type [sam | bam | cram]] 
 [--nr-of-threads nr]

Secondary .bai files

Hi,
Is there a possibility for elprep to generate index .bai files? Or do you maybe plan on including this functionality, as the secondary index files are often requested by the tools for variant calling.

Thank you.

elprep sfm mode exits with out of Memory error

Thanks for the great work with elPrep! It has been really useful in cutting down analysis runtimes!

We have been running elPrep (4.1.5) on a WGS dataset to primarily use the mark duplicates and bqsr functionalities, with mixed success. A subset of samples work as expected while some are exiting with the following runtime out of memory error. Would greatly appreciate any inputs regarding this problem -

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x5f0611, 0x16)
        /opt/local/lib/go/src/runtime/panic.go:617 +0x72
runtime.sysMap(0x11748000000, 0x4000000, 0x78c078)
        /opt/local/lib/go/src/runtime/mem_linux.go:170 +0xc7
runtime.(*mheap).sysAlloc(0x7746c0, 0x2000, 0x7746d0, 0x1)
        /opt/local/lib/go/src/runtime/malloc.go:633 +0x1cd
runtime.(*mheap).grow(0x7746c0, 0x1, 0x0)
        /opt/local/lib/go/src/runtime/mheap.go:1222 +0x42
runtime.(*mheap).allocSpanLocked(0x7746c0, 0x1, 0x78c088, 0x7f9d8df1a888)
        /opt/local/lib/go/src/runtime/mheap.go:1150 +0x37f
runtime.(*mheap).alloc_m(0x7746c0, 0x1, 0x45002f, 0x7f9d8df1a888)
        /opt/local/lib/go/src/runtime/mheap.go:977 +0xc2
runtime.(*mheap).alloc.func1()
        /opt/local/lib/go/src/runtime/mheap.go:1048 +0x4c
runtime.systemstack(0x0)
        /opt/local/lib/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /opt/local/lib/go/src/runtime/proc.go:1153

The targeted coverage for the dataset is 60X , the input BAM is roughly ~114GB. The command line taken from the logs -

/home/ubuntu/elPrep/elprep sfm INPUTBAM OUTPUTBAM --mark-duplicates --mark-optical-duplicates OUTPUTDUPMETRICS --sorting-order keep --bqsr OUTPUTRECAL --bqsr-reference ucsc.hg19.elfasta --known-sites <knownSiteFiles>

Variant calling using `sfm` does not adhere to thread limit

Hi,

I noticed that when using 'sfm' mode, elprep uses more resources than allowed (e.g. all available threads instead of the given thread count). When using filter, this does not occur.

Cheers
Matthias

How to call population snps/indels?

Hi,

I have 2 questions about haplotypecaller.

How to excute haplotypecaller function only? I have analysis-ready bams and don't need to do any other pre-process step.
How to use elprep to call population snps/indels (input multiple bams and output single vcf)?

Best,
Kun

Unclipped position not present in SAM alignment

Hi,
I'm getting failed tasks when using --mark-optical-duplicates, and a message 'Unclipped position not present in SAM alignment'. Could this be the cause, or should I look for some other reason that the task is failing?

Thank you.

sfm sorting problem

elprep v 4.0.1 SFM functionality results in incorrectly sorted bam.

command : /opt/NGS/binaries/elPrep/4.0.1/elprep sfm 'DNA1802266B_recalibrated.bam' /home/shared_data_medgen_frax/Exome_CNV_TestRuns/CNV_tmp_files/dna1802266b/dna1802266b.full.bam --remove-duplicates --filter-mapping-quality '40' --log-path '/home/shared_data_medgen_frax/Exome_CNV_TestRuns/CNV_Job_Output/dna1802266b/' --filter-unmapped-reads --nr-of-threads '16' --sorting-order 'coordinate' --filter-non-overlapping-reads '/opt/NGS/References/hg19/BedFiles/WGS.bed' --replace-reference-sequences '/opt/NGS/References/hg19_masked_par/samtools/0.1.19/hg19.dict'

results in error when applying samtools index resulting.bam

[bam_index_core] the alignment is not sorted (ST-K00127:290:HNFYKBBXX:1:1216:29904:8330): 249240352 > 752696 in 2-th chr [bam_index_build2] fail to index the BAM file.

running elprep filter instead of elprep sfm results in correct sorting, so the problem seems to be related to the SFM merging. I also tried with --contig-group-size 1, instead of default, with similar errors as a result.

Large amount of diskspace required

We are running a few benchmarks on the MarkDuplicates features of ElPrep (v4.1.5) on WGS samples in isolation.

We started with a 'small' WGS sample with a mean coverage of 41 (BAM: 60GB - Mean Cov. 41+/-15, Median 43). Your paper would suggest it required 1.6 times that of GATK/Picard. We found it needed nearly 4.5 times the space (~550 GB).

A large WGS sample with a mean coverage of 70 (BAM: 101GB - Mean Cov. 70+/-24, Median 74) needed 4.7 times the disk space (~950GB).

The scaling means that we would not be able to use ElPrep on larger WGS samples (120X). We replicated the call on two different compute clusters. Both calls had the same required disk space. Is there something we are missing?

Call

/usr/bin/time --verbose elprep sfm "$local_file_path" "$output_file" \
    --mark-duplicates \
    --mark-optical-duplicates "$metrics_file" \
    --optical-duplicates-pixel-distance 2500 \
    --sorting-order keep \
    --log-path $(pwd) \
    --nr-of-threads 12 \
    --tmp-path $TMPDIR \
    --timed

Method

A du --bytes $TMPDIR ran in the background with an interval of 5 minutes in order to determine the maximum used disk storage.

Estimate RAM usage based on input filesize

Hi,

I'm trying to find a way to get a rough estimate of how much ram I'll need to run elprep filter based on the size of the input bam.

Do you have any way of calculating this, e.g. for when submitting a job to some cloud provider?

Thanks
Matthias

[RFE] haplotypecaller: support for custom GVCF GQ slices

Hi,

GATK4 allows to specify custom boundaries for the GQ slices used to group consecutive non-variant positions into a non-variant block, with --gvcf-gq-bands . This feature is quite useful IMO, as the resulting GVCFs can be much smaller (and much faster to work with downstream). The only "loss" is fine-grained resolution of GQ slices for non-variant blocks, but only in the GQ ranges that the user decided he doesn't care about.
It would be great if a similar feature could be implemented in elPrep!

Regards,
Nicolas

error while trying to split

Hi.
I have a 100X human bam file (about 210G) that I want to mark duplicates inside.

I have a server with about 800Gb of RAM, but judging by your description it would be unsafe to try marking duplicates without splitting first.

When I try the splitting command like this:
elprep-v2.10/elprep split bam.merged.bam splitFiles --output-prefix bamMerged --output-type sam

I get the following error:
"view: invalid option -- '@'
open: No such file or directory
[main_samview] fail to open file for reading.
view: invalid option -- '@'
[main_samview] fail to open file for reading.
view: invalid option -- '@'
[main_samview] fail to open file for reading.
"
I have a hunch that there is a problem with version of samtools, but its only a guess. Do you have any ideas?

vcf-to-elsites

Hi, @caherzee
I am trying to convert a vcf file to elsites but I am getting an error message like this
missing ID in a VCF meta-information line: PEDIGREE=<Child=NA12877,Mother=NA12890,Father=NA12889>
that's the log for the process

elprep version 5.1.1 compiled with go1.16.7 - see http://github.com/exascience/elprep for more information.

2022/01/10 14:00:16 Created log file at ../analysis/elprep_VC/logs/elprep/elprep-2022-01-10-14-00-16-701177358-EET.log
2022/01/10 14:00:16 Command line: [elprep vcf-to-elsites NA12877.vcf NA12877.elsites --log-path ../analysis/elprep_VC/]
2022/01/10 14:00:16 missing ID in a VCF meta-information line: PEDIGREE=<Child=NA12877,Mother=NA12890,Father=NA12889>
panic: missing ID in a VCF meta-information line: PEDIGREE=<Child=NA12877,Mother=NA12890,Father=NA12889>

goroutine 1 [running]:
log.Panicf(0x687f09, 0x2d, 0xc000063b90, 0x1, 0x1)
	/opt/conda/conda-bld/elprep_1633347706013/_build_env/go/src/log/log.go:361 +0xc5
github.com/exascience/elprep/v5/vcf.(*StringScanner).ParseMetaInformation(0xc000063c90, 0xc000022440, 0x36)
	/opt/conda/conda-bld/elprep_1633347706013/work/vcf/vcf-files.go:155 +0x2ed
github.com/exascience/elprep/v5/vcf.ParseHeader(0xc0000104e0, 0x2e, 0xc000074120)
	/opt/conda/conda-bld/elprep_1633347706013/work/vcf/vcf-files.go:293 +0x21b
github.com/exascience/elprep/v5/intervals.FromVcfFile(0x7ffdaa54d41c, 0xb, 0x0)
	/opt/conda/conda-bld/elprep_1633347706013/work/intervals/intervals.go:310 +0xb8
github.com/exascience/elprep/v5/cmd.VcfToElsites()
	/opt/conda/conda-bld/elprep_1633347706013/work/cmd/convert.go:47 +0x228
main.main()
	/opt/conda/conda-bld/elprep_1633347706013/work/main.go:75 +0x38b

I am not sure what could be the issue. the VCF file I am using is for the platinum genome from illumina so can anyone help here ?

python script handing of `single-end` flag

Hi,

when running the following command:

elprep-sfm-gnupar.py \
       elprep_sWGS.bam \
        elprep_sWGS_filtered.bam \
        --mark-duplicates \
        --sorting-order coordinate\
        --nr-of-threads $CORES \
        --nr-of-jobs 2 \
        --intermediate-files-output-type sam \
        --intermediate-files-output-prefix elprep \
        --single-end

I get the following error when the filter step is about to start:

flag provided but not defined: -single-end

It seems to me the sfm scripts do nog handle the --single-end flag correctly by passing it to the filter command, resulting in the error above.

Could you have a look at those?

Thanks
M

Behaviour of markduplicates wrt pairs where one read is mapped to another c'some

I've used elprep split to split a BAM file by chromosome. The test chunk for one chromosome has these counts:

samtools flagstat in.bam
198799 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
198799 + 0 mapped (100.00%:-nan%)
198799 + 0 paired in sequencing
119421 + 0 read1
79378 + 0 read2
115896 + 0 properly paired (58.30%:-nan%)
178706 + 0 with itself and mate mapped
20093 + 0 singletons (10.11%:-nan%)
61880 + 0 with mate mapped to a different chr
21845 + 0 with mate mapped to a different chr (mapQ>=5)

I've run elprep on this chunk:

elprep --mark-duplicates --sorting-order keep in.bam test1.sam

On completion, I run samtools flagstat on the elprep result:

samtools flagstat test1.sam 
136919 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
84287 + 0 duplicates
136919 + 0 mapped (100.00%:-nan%)
136919 + 0 paired in sequencing
64918 + 0 read1
72001 + 0 read2
115896 + 0 properly paired (84.65%:-nan%)
116826 + 0 with itself and mate mapped
20093 + 0 singletons (14.68%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

As a comparison I've run bammarkduplicates from https://github.com/gt1/biobambam. The result of samtools flagstat for this is:

samtools flagstat test2.bam
198799 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
84287 + 0 duplicates
198799 + 0 mapped (100.00%:-nan%)
198799 + 0 paired in sequencing
119421 + 0 read1
79378 + 0 read2
115896 + 0 properly paired (58.30%:-nan%)
178706 + 0 with itself and mate mapped
20093 + 0 singletons (10.11%:-nan%)
61880 + 0 with mate mapped to a different chr
21845 + 0 with mate mapped to a different chr (mapQ>=5)

I didn't ask elprep to remove any reads, so I was surprised to find the number in the output was reduced compared to the input. Comparing the three results, I can see that elprep and bammarkduplicates agree on the duplicate count. Subtracting the number of reads after elprep from the initial value (198799 - 136919 = 61880) yields the number of reads with mate mapped to a different chr.

From this I understand that it removes reads with mates mapped to a different chr. There was no second file written, so are they discarded? Am I using elprep correctly in this case? Thanks.