biod / sambamba Goto Github PK

Tools for working with SAM/BAM data

Home Page: http://thebird.nl/blog/D_Dragon.html

License: GNU General Public License v2.0

Shell 2.12% D 90.93% Ruby 0.29% Makefile 0.38% Python 0.43% Roff 1.62% Meson 0.59% Ragel 1.71% Scheme 0.19% JavaScript 1.43% CSS 0.31%

bam bioinformatics sam

sambamba's Introduction

BioD

BioD is a fast and memory efficient bioinformatics library written in the D programming language whose aim is to:

Provide a platform for developing high-performance computational biology applications using the D programming language through
- Automatic parallelization of tasks where possible
- Avoiding unnecessary memory allocations

Why BioD?

BioD leverages on D programming language features to develop high performance bioinformatics tools (e.g. sambamba). The D programming language is both a low and high-level hybrid object orientated and functional (OOP/FP) programming language with templating/generic features are far easier than that of C++.

D programming language resources

Programming in D is online by Ali Çehreli.
The D Programming Language by Andrei Alexandrecu (great book, slightly out of date)
The D Cookbook by Adam D. Ruppe

Current development

Our aim is to provide a set of D modules to manipulate and work with biological datasets. BioD provides modules for manipulating high throughput data formats by provifing fast and easy to use native BAM file reader and writer with ability to iterate a BAM file a read at a time,a nucleotide at a time (pileup) or via a sliding window.

Note the current Bamreader bails out on recent versions of the LDC compiler. See also https://github.com/biod/BioD/issues/53

Install

The current default is to provide the path to the checked out repo to the D-compiler. For example,

DFLAGS = -wi -I. -IBioD -g

Build environment

After installing ldc and dub

dub
dub test

On a recent Debian (>201911) you can install ldc and compile BioD with

make
make check

It is possible to create a recent build container with the GNU guix transactional package manager

guix environment -C guix --ad-hoc ldc dub zlib gdb binutils-gold vim --network

after getting dropped in the container simply run dub.

If you want to use the make file instead (not requiring the network) use

guix environment -C guix --ad-hoc ldc zlib gdb make binutils-gold vim --no-grafts
make -j 4
make check

Debugging

When using gdb, switch off these handlers

handle SIGUSR1 SIGUSR2 nostop noprint

It can be passed in from the command line

gdb -ex 'handle SIGUSR1 SIGUSR2 nostop noprint' --args ./biod-test-library

Usage

See the examples directory for examples and usage.

Mailing list

The BioD mailing list

Contributing

Simply clone the repository on github and put in a pull request.

BioD contributors and support

See contributors. For support use the issue tracker or contact

License

BioD is free software and licensed under the MIT (expat) license.

BioD includes some files from the undeaD project in ./contrib which are published under a Boost license. This code should be phased out in time.

sambamba's People

Contributors

Stargazers

Watchers

Forkers

csw pjotrp cjfields roryk brainiarc7 rtvt123 fw1121 chapmanb rernst theoryno3 badseby lukeping travc kongdeju xtmgah astrazeneca-ngs hjanime snashraf jiaolongsun sambrightman g1207b davidroberson nosepy biocyberman tw7649116 uterald biorosetics tiehan emi80 cyang-2014 godotgildor andreykoltsov1997 ishwarvh francisfa changcw83 wy2160640 djhshih shengxinbaixiaosheng resurgo-genetics kundajelab yaseswini crobiad lkalbert001 lihb001 julianocristian dpryan79 bgistone timuris liupfskygre inambioinfo yixf-self picsaver shicheng-guo nickroz1 dayedepps joelmartin goodstudychina sichongp linhxxx libingnan11 ewamarek gavinwinner radygenomics xiaohuaniu0032 xjyx ximion wangpanqiao zhengfengran samsonjm biomds dru93 imb-fmg x-bioinformatics stephank clementmo dukc iamh2o wook2014 yang-mj schaudge enhu359 pele0412 bpr4242 mbk0asis nikoandpiko martin-g bioshimmer parthosen hanyanglin daylily-informatics babaxiong hejunyi6721 chanderkantchaudhary luosanmu dengarden cimorgh-it adefewings pwu4 corneliusroemer

sambamba's Issues

Improve performance of D/Ruby interaction

Currently, all operations are not optimized at all and use huge amount of FFI calls. I believe it can be done better.

Inconsistent option name for number of threads

Not a bug, but a bit confusing.
I see flagstat uses -n for threads, but sort uses -t

Flag

Every alignment record should have properties for easy access to flag bits, with human-readable names, and they are to be taken into account in algorithms, at least whether the read is mapped or not.

investigate using std.mmfile for random access

use region allocator instead of alloca

http/ftp support

make unittest fails on dmd 2.059 (i386)

Testing Value code...

object.Exception@utils/range.d(21): Amount of elements to prefetch must be positive

./run_unittests(pure @safe bool std.exception.enforce!(bool, "utils/range.d", 21).enforce(bool, lazy const(char)[])+0x2b) [0x81642df]

Less cumbersome type conversion for Value.

At the moment, type conversions are very restrictive. If Value holds integer, and I want to get integer representation as a string, I have to use first to!int, and then to!string. Moreover, the user should be able to convert value holding byte to short, int, long, string, float, and double. And so on. The user really should not care about the actual type as long as he converts it to a type it can be converted to.

That would make working with tags way easier. I want to be able to do that, for instance:

Value v = 25;
float f = to!float(v); // determine, that value holds int, then convert this int to float
string s = to!string(v); // the same, but convert to string

to!float(to!int(v)) is not what the user wants to type, that's for sure.

make benchmark fails

I realise this is a dev tree, even so, I would like to be able to run the benchmark :)

verona:~/tmp/BAMread/benchmarks$ make
dmd ../bamfile.d ../utils/range.d ../chunkinputstream.d ../bgzfrange.d ../samheader.d ../reference.d ../alignment.d ../tagstorage.d ../tagvalue.d ../utils/switchendianness.d ../virtualoffset.d ../randomaccessmanager.d ../bai/read.d ../bai/bin.d ../bai/chunk.d ../bai/utils/algo.d ../utils/algo.d benchmarks.d -ofrun_benchmarks -O -release -inline -version=serial
../randomaccessmanager.d(14): Error: module memoize is in file 'utils/memoize.d' which cannot be read
import path[0] = /usr/include/i386-linux-gnu/dmd/phobos
import path[1] = /usr/include/i386-linux-gnu/dmd/druntime/import
make: *** [serial] Error 1

Get "sambamba-view: not enough data in stream"

Hi,

On occasion I see this error "sambamba-view: not enough data in stream".
It does not happen always and if I re_run the command it usually works, but it seems it gets this on a regular basis...

We run this on an isilon system that gets heavily used, so maybe there is some timeout setting or something so that if it does not get the data as fast as it wants, it barfs?

Thanks

Thon

put statistics into index file (compatible with samtools idxstats)

support for reading SAM

This can be done relatively easily once there's support for constructing/modifying alignments.

filtering with lucene-like syntax

It is possible to make a simple query parser using Pegged so that users can write queries to get needed reads.

E.g., alignments.select("refseq:(chr1 OR chr2) AND region:[10000 TO 15000] AND quality > 50")

The idea is that it can be useful for web apps, and also can allow easy interfacing with dynamic languages, without passing sophisticated objects through FFI.

update readme and wiki

outdated quite a bit

compiling on CentOS system with no root access

I got stuck with this error:
export LD_LIBRARY_PATH=/home/avilella/src/dmd2/linux/lib64
export PATH=$PATH:/home/avilella/src/dmd2/linux/bin64
cd ~/src/sambamba/latest/sambamba
make clean
make
mkdir -p build/
rdmd --compiler=dmd --build-only -IBioD -version=development -ofbuild/sambamba main.d
sambamba/merge.d(211): Error: non-shared method bio.bam.utils.samheadermerger.SamHeaderMerger.this is not callable using a shared object
sambamba/merge.d(211): Error: no constructor for SamHeaderMerger
make: *** [all] Error 1

Installation script

Many bioinformaticians have never used D, and it can be problematic for them to assess features of the new library. The script should work on all platforms.

weird behaviour when region is of form 'chr' instead of 'chr:beg-end'

With one of files from 1000 Genomes, using index generated by Samtools, when I specify '20' as the region in sambamba view, it throws an exception, trying to read some non-existent BGZF block. However, if '20:1-100000000' is specified in command-line, everything is OK.

(Perhaps, has to do with some erroneous type conversions between size_t and uint... will look into that tomorrow)

Wiki/readme

There should be info about Biogems & Bioruby as well as description how to install and use the library. Having some code examples would be nice.

Test the library on big-endian system

Big-endian system support is absolutely untested. QEMU can emulate big-endianness when using PowerPC architecture.

memory copying/allocation should be avoided

Currently, when constructing alignments, memory is copied from decompressed blocks. That shouldn't be the case, slices should be used instead. That will save about 10% of iteration time, also reducing GC contention and making multithreaded version faster. (By the way, using slices instead of memcpy can really become a killer feature. That's simply impossible in 'traditional' languages.)

Indexing BAM

Creating BAI is a must, maybe also consider alternative indexing schemes (creating nested containment lists, for example).

sambamba view slower with more threads

I ran this on an Ubuntu 64-bit VM with 4-cores (running on a Windows 7 64-bit laptop). It seems like running on more cores actually slows down the view command in this instance.

I repeated the experiment at a cluster farm with isilon filesystem on a smaller file, and I also get slower results when using higher number of cores.

I also tried it on a node that has a solid state drive partition by copying the bam and .bai files there, and I get similar results both in time taken and taking slightly more time on higher number of cores than lower.

Any ideas?

avilella@ubuntu64:~/src/sambamba/latest/sambamba$ for t in seq 4 -1 1; do echo $t===; time ~/src/sambamba/latest/sambamba/build/sambamba view -t $t NA12878.bam 1>/dev/null; done
4===
[info] file size: 4366637089
[info] file size: 4366637089
[info][BGZF range] EOF, current offset is 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637061, end offset = 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637089, end offset = 4366637089
[info][bamRead range] empty, read 0 bytes, expected 4

real 26m28.151s
user 20m1.075s
sys 0m31.534s
3===
[info] file size: 4366637089
[info] file size: 4366637089
[info][BGZF range] EOF, current offset is 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637061, end offset = 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637089, end offset = 4366637089
[info][bamRead range] empty, read 0 bytes, expected 4

real 14m37.343s
user 17m41.930s
sys 0m27.646s
2===
[info] file size: 4366637089
[info] file size: 4366637089
[info][BGZF range] EOF, current offset is 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637061, end offset = 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637089, end offset = 4366637089
[info][bamRead range] empty, read 0 bytes, expected 4

real 13m49.845s
user 16m7.328s
sys 0m21.113s
1===
[info] file size: 4366637089
[info] file size: 4366637089
[info][BGZF range] EOF, current offset is 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637061, end offset = 4366637089
[info][chunkinputstream] len == 0, start offset = 4366637089, end offset = 4366637089
[info][bamRead range] empty, read 0 bytes, expected 4

real 10m40.287s
user 12m16.670s
sys 0m15.185s

I repeated the experiment on a smaller file at a cluster farm with isilon filesystem, and this is what I get:
8===
[info] file size: 1647602665
[info] file size: 1647602665
[info][BGZF range] EOF, current offset is 1647602665
[info][chunkinputstream] len == 0, start offset = 1647602637, end offset = 1647602665
[info][chunkinputstream] len == 0, start offset = 1647602665, end offset = 1647602665
[info][bamRead range] empty, read 0 bytes, expected 4

real 1m34.347s
user 1m56.074s
sys 0m4.066s
7===
[info] file size: 1647602665
[info] file size: 1647602665
[info][BGZF range] EOF, current offset is 1647602665
[info][chunkinputstream] len == 0, start offset = 1647602637, end offset = 1647602665
[info][chunkinputstream] len == 0, start offset = 1647602665, end offset = 1647602665
[info][bamRead range] empty, read 0 bytes, expected 4

real 1m26.204s
user 1m57.226s
sys 0m1.818s
6===
[info] file size: 1647602665
[info] file size: 1647602665
[info][BGZF range] EOF, current offset is 1647602665
[info][chunkinputstream] len == 0, start offset = 1647602637, end offset = 1647602665
[info][chunkinputstream] len == 0, start offset = 1647602665, end offset = 1647602665
[info][bamRead range] empty, read 0 bytes, expected 4

real 1m21.599s
user 1m56.810s
sys 0m1.728s

On the SSD partition:
avilella_scratch]$ for t in seq 8 -2 1; do echo $t===; time /home/avilella/src/sambamba/latest/sambamba/build/sambamba view -t $t TruSeqNano-350.chr19.bwamem.sorted.bam 1>/dev/null; done
8===
[info] file size: 1395101757
[info] file size: 1395101757
[info][BGZF range] EOF, current offset is 1395101757
[info][chunkinputstream] len == 0, start offset = 1395101729, end offset = 1395101757
[info][chunkinputstream] len == 0, start offset = 1395101757, end offset = 1395101757
[info][bamRead range] empty, read 0 bytes, expected 4

real 1m12.507s
user 1m43.656s
sys 0m2.190s
6===
[info] file size: 1395101757
[info] file size: 1395101757
[info][BGZF range] EOF, current offset is 1395101757
[info][chunkinputstream] len == 0, start offset = 1395101729, end offset = 1395101757
[info][chunkinputstream] len == 0, start offset = 1395101757, end offset = 1395101757
[info][bamRead range] empty, read 0 bytes, expected 4

real 1m8.823s
user 1m41.281s
sys 0m1.879s

improve SAM output speed

Even after optimizations, it's about 1.5x slower than samtools.

refactor _validAlignment code

The function is too long and cries out to be split into smaller functions.

range of paired reads

It would be very useful to have an effective implementation of such thing.

For 'QNAME' sorting order it's trivial since paired reads come in adjacent lines, but the typical case is that of coordinate sorted BAM file, and another sorting takes a lot of time (and disk space as well), though paired reads are usually (relatively) near each other since reads are short. But not all the time. I've wrote a simple utility, and for a sample file it showed that average distance between paired reads is about 40, but the distribution has extreeeemely looooong taaaaail, it well might be that first read is at the beginning of the file while the second one is at the end. But the average suggests that additional sorting is not the best way to handle it.

Picard uses such kind of thing for validating mate existence and fields, and it stores temporary data on disk (picard/sam/SamFileValidator.java). Basically, when we find one read from a pair, store its read name; when we find its mate — remove that read from the map.

Amount of lines in Java implementation of disk-stored map is overwhelming... BUT seems to me those guys just like reinventing the wheel. Kyoto Cabinet comes to mind.

flagstat command line options rejected

The help for 0.3.1 says this:

sambamba-flagstat [options] <input.bam>
-n, --nthreads=NTHREADS
-p, --show-progress

But I can't seem to get -p and -n working together:

These work:

sambamba flagsat input.bam 
sambamba flagsat -p input.bam 
sambamba flagsat -n=32 input.bam

These don't:

sambamba flagsat -n 32 input.bam   # typical short option in Unix
sambamba flagsat -n=32 -p input.bam   
sambamba flagsat -n=32 --show-progress input.bam

I won't go through all the combinations, you get my point.

Increase write buffer sizes

The I/O performance data I gathered also indicates that using a large buffer for writes is helpful. Currently, sambamba view uses the default buffer size on stdout; I obtained a significant improvement by using setvbuf(3) like so:

import std.c.stdio : stdout, setvbuf, _IOFBF;
import std.c.stdlib : malloc;
[...]
char *buf = cast(char*)malloc(1048576);
setvbuf(stdout, buf, _IOFBF, 1048576);

This 1 MB buffer resulted in ~2 seconds being spent in write(2) calls, rather than ~4 with the default buffer size.

Similar results were obtained when using writeBAM with a BufferedFile; the default 8K buffer size resulted in 7-8 seconds spent in write(2) calls, vs. about 2.2 seconds when using a 128K buffer. (No significant improvement was observed with a 1M buffer.)

I'm not sure what the best way to standardize this across the codebase would be, given my limited D experience, so I'm not attempting to submit a pull request for this one at this point.

sambamba index segfaults on non-existing file

snpcaller does not work without --stop switch

rdmd ./examples/snpcaller.d NA12043.chrom20.LS454.ssaha2.CEU.low_coverage.20101123.bam.v1

(hang)

Now unittests segfault with GDC

A lot of new functionality was added recently, but all of it was tested only with DMD. Got segfault when tried to run unittests compiled with GDC. Needs investigation.

Progressbar

Sambamba tool lacks progressbar. Most BAM files are quite large, and it's desirable to see progressbar so as to estimate processing time. It should look like wget one.

Now the tool works with local files only, and I can add method, returning current file offset, to BamFile, and call it, say, every 1000 or 10000 alignments to update progressbar.

Include issue labels

Some more issue tags would be useful

in progress
in testing
critical
important
newbie
small fix

validate bin when indexing

The index tool should throw an error if the bin field is set incorrectly, because that would make the resulting index incorrect.

bam output is broken

Turns out that switching to parallel decompression introduced a bug, files are truncated.

pileupChunks is buggy

Seems that heuristic that it relies upon is wrong, or maybe there's a bug in algorithm. By design, number of detected SNPs shouldn't depend on chunk size.

However, there is a dependency:

  chunk size, bytes      |       #SNPs
  -------------------------------------------------
         10_000                    1309
         20_000                     635
         30_000                     964
         40_000                    1037
         50_000                     618
        100_000                     618
        150_000                     436
        200_000                     436

validation tool

It should:

be able to read from config file a list of errors to validate together with their priorities
have an option to check only top N errors from a list sorted by priority
have an option to stop after M errors with priority >= P
have several verbosity modes (e.g. only summary, invalid read names + summary, invalid read names + error messages + summary, invalid reads in SAM format + error messages + summary)
validate mate existence and paired read fields as Picard does (with the aid of KyotoCabinet on-disk hashtable)
effectively use multithreading

Template constraints

As of now, there's a lot of templated code that doesn't restrict compile-time parameters in any way. Concepts should be introduced that specify the required behaviour, e.g. ReadRange, PileupColumnRange, PileupObject, etc.

Basically, the code in Sambamba tries to avoid overhead from D classes. However, object-oriented programming is still widely presented, as it makes it much easier to understand and write the code. As such, typically used programming constructs are

structs - instead of classes
alias this - instead of inheritance
compile-time polymorphism - instead of run-time polymorphism
(currently lacking) concepts - instead of interfaces

The purpose of concepts is two-fold -- making it easier to read the code of this library, and also introduce somewhat stricter typing.

Sorting BAM

External sort algorithms are to be studied and evaluated.

Exceptions

Currently, Ruby interpreter just crashes when D runtime throws an exception. That should be fixed.

Alignment modification and output.

Ideally, including tags. At least modifying alignments should be easy: change properties of primitive types in place, and for arrays use std.array.replaceSlice (with updating offsets before that).
As for tags, the same idea can be applied, and as I don't currently store any offsets, it's going to be even easier a bit.

The benefits of this approach are memory reuse and easy output to bam file (just do endian conversion if needed, and write the whole chunk).

sambamba doesn't work with files >2GB

Need to find a workaround, and submit a patch to Phobos if possible.

Merging/splitting BAM files

Editing SAM header.

Currently, there are no convenient methods to edit header text, such as addComment(), addRgLine(), and so on. All this stuff should be encapsulated, and access should be through properties. This becomes important now that the library has BAM output capabilities.

Validation

The library should support validation of every alignment record, and SAM header. Each entity should have 'is_valid' property, and then functions like filterValid() and createValidationReport() should be available. Ruby code should have access to those high-level functions.

stdin/stdout

The library should support working with standard streams, though stdin clearly has limitations on random access.

pileupChunks gives different probabilities

When changing the chunk size you get the same SNPs, but probabilities may differ (probably on border SNPs). Example:

2474c2474

< 13695679 C T 696.606

13695679 C T 699.458
4656c4656

< 29623222 T T,C 248.637

29623222 T T,C 270.62
5438c5438

< 32990049 G T,G 490.482

32990049 G T,G 496.058
5576c5576

< 33509522 C T,C 492.262

33509522 C T,C 492.342