arq5x / grabix Goto Github PK

a wee tool for random access into BGZF files.

License: MIT License

Makefile 0.39% C 75.49% C++ 18.81% Shell 3.32% Python 2.00%

grabix's Introduction

grabix - a wee tool for random access into BGZF files.

grabix leverages the fantastic BGZF library in samtools to provide random access into text files that have been compressed with bgzip. grabix creates it's own index (.gbi) of the bgzipped file. Once indexed, one can extract arbitrary lines from the file with the grab command. Or choose random lines with the, well, random command.

There's a ton of room for improvement, but I needed something quickly in support of a side project.

Here's a brief example using the simrep.chr1.bed file provided in the repository.

# 1. compress the file with bgzip
bgzip simrep.chr1.bed

# 2. create a grabix index of the file.
#    creates simrep.chr1.bed.gbi
grabix index simrep.chr1.bed.gz

# 3. now, extract the 100th line in the file.
grabix grab simrep.chr1.bed.gz 100
chr1	401285	401444	trf	218

# 4. extract the 100th through 110th lines in the file.
grabix grab simrep.chr1.bed.gz 100 110
chr1	401285	401444	trf	218
chr1	401573	401748	trf	280
chr1	404661	404707	trf	92
chr1	406202	406274	trf	76
chr1	406227	406286	trf	77
chr1	406776	406819	trf	68
chr1	409821	409866	trf	51
chr1	409865	409900	trf	52
chr1	421245	421285	trf	64
chr1	422395	422435	trf	80
chr1	422560	422588	trf	56

You can also use grabix to extract random lines from the file

# extract 10 randome lines from the file using reservoir sampling
grabix random simrep.chr1.bed.gz 10

Is a gzipped file bgzipped?

grabix check simrep.chr1.bed.gz

grabix's People

Contributors

Stargazers

Watchers

Forkers

cc2qe drmjc chapmanb ryanlayer brentp mmterpstra smoe jmarshall alphaneer cih-y2k novapyth crerecombinase

grabix's Issues

Won

Speed up grab

grab can be much faster if we just use a simple getc(). The overhead of getline() is rather absurd. This makes a big difference for large files such as the 1000G VCF.

Expose to Python with Cython interface

Won't Compile !!!

Can you provide the Binary file grabix or add a document . What compiler do you use ? what system do you support ? Windows ? Linux ? I cann't read c++ code . In my system it output error :
grabix.cpp : In fundtion 'int create_grabix_index(std::__cxx11::string) '
error:invalid conversion from 'char' to 'char*' [-fpermissive]
line ->s = '\0'
Can you simply provide the compiled grabix ?
Tks

grabix grab some.vcf.gz 0 returns the 1st row of the file, but should throw an error

grabix grab some.vcf.gz 0 returns the same row as
grabix grab some.vcf.gz 1

Possible off-by-1 error at chunksize boundary

When working on grabix indexed FASTQ files in bcbio, I get an off-by-one error since the latest commit six days ago (version 0.1.5) when crossing the chunksize (which is 10000), possibly due to an earlier off-by-1 fix in 8d81fed that may not have had the intended result (the latter is pure speculation on my side). As a result, I get malformed FASTQ files when grabbing ranges of rows that are multiplies of 4 (the length of a canonical FASTQ record) whenever I perform a grab that starts at more than 10000 lines.

For instance, this is a asking for a grab that starts with the first entry of a FASTQ record (the FASTQ ID), since 9997 % 4 == 1 and it works fine since we start below the chunksize:

$ grabix grab SAMPLE.fq.gz 9997 10012
@2500/1
CATAGCCATGGCCTTTGACAGATACATGGCCATATGTAAACCTCTCCACTACCTGACCATCATGAGCCCAAGAATGTGTCTATACTTTTT
+
GGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGFGGGGGGGGFGGGEGGEGGGGGGEGGGEGGFEGGGGF
@2501/1
CGGTCCGGGGTGGTCCAGTCTGATCCCAACCGGCCCACCCGGGGCATCCGGTGGAAGTCTTCGCCGGAGGATCCGAAGGCAGCATCAACG
+
GGGGGEFFFFBFECFFFFFFGGGGGGGGGGGEFGGGGGGGGGEGG?:EEEECEEEEFEGGG=GEGGGEGGEEFEFBDCDA?DB?DF?EFE
@2502/1
GAGGCTGTGCTGTTTCTCTTTCTCTTCTAGGATATCAAAGTCGTCTTGTGCAGGTATGATGGGCCACCCCAAATGTATTTTGCCTACTGA
+
GGGGGGGGGGGGGGGGGGGGGGFGGGEGDGGGGGGFGGGGBGGGGGGEEGFEFFCEFDEEGEGDGGEGGE=DEEEAGFGGFBFFD=ED=C
@2503/1
AGTCTTATTGTTGGTTTGACTGTGAGTATATGGAACTCTGGCAAAATGAAATGGAGCAAAACACACAAAGAAGACAGCCACGACAACAAA
+
GGGGGGGGGGGGDAGFGBGGGFFGGGEDFEGGGGEGGGGGGGGGGGGGEGGGGDEGGGFDGGGFGGGFDEDGGGGGGGGGGGDGGFGDD=

However, when I start the grab on the exact same file and start 4 rows further along at start position 10001 (10001 % 4 == 1) which should give me the FASTQ ID of the next record as first line, I get:

$ grabix grab SAMPLE.fq.gz 10001 1008
CGGTCCGGGGTGGTCCAGTCTGATCCCAACCGGCCCACCCGGGGCATCCGGTGGAAGTCTTCGCCGGAGGATCCGAAGGCAGCATCAACG
+
GGGGGEFFFFBFECFFFFFFGGGGGGGGGGGEFGGGGGGGGGEGG?:EEEECEEEEFEGGG=GEGGGEGGEEFEFBDCDA?DB?DF?EFE
@2502/1
GAGGCTGTGCTGTTTCTCTTTCTCTTCTAGGATATCAAAGTCGTCTTGTGCAGGTATGATGGGCCACCCCAAATGTATTTTGCCTACTGA
+
GGGGGGGGGGGGGGGGGGGGGGFGGGEGDGGGGGGFGGGGBGGGGGGEEGFEFFCEFDEEGEGDGGEGGE=DEEEAGFGGFBFFD=ED=C
@2503/1

Note that the latter entry does not start with line 10001 but with line 10002, and also ends with line 10009 instead of line 10008. The input FASTQ file can be parsed fine using a hand-made parser that checks for validity. Also, the validity of the input file can be checked here in the first output (see above) that also covers the chunksize boundary and includes all relevant records that are erroneously chunked in the second entry.

Need to a account for headers (e.g., VCF)

grabix hangs on non-bgzipped file

example

echo asdf | gzip -c > t.gz
grabix index t.gz

grabix.cpp:258:58: error: 'getpid' was not declared in this scope

make was exiting with this error :

~/bin/grabix 1063 $ make
gcc -Wall -O2 -o grabix grabix_main.cpp grabix.cpp bgzf.c -lstdc++ -lz
grabix.cpp: In function 'int random(std::string, uint64_t)':
grabix.cpp:258:58: error: 'getpid' was not declared in this scope
         size_t seed = (unsigned)time(0)+(unsigned)getpid();
                                                          ^
make: *** [all] Error 1

Adding #include <unistd.h> to grabix.cpp fixed it.

Grabix hangs when vcfs are not sorted

Hi
I ran a vcf file through vt's (http://genome.sph.umich.edu/wiki/Vt) decomposition and normalisation then tried to run grabix. This hangs, it creates the index file but the file only has one number in it (only first index?) and it hangs.
This issue is also affects gemini (https://github.com/arq5x/gemini) users when using multi core option (--core >1) as it utilises grabix to create the index file before splitting up the job.

Current work around: also use vt's sort before using grabix or multicore gemini.

I used grabix version v0.1.6 and gemini v0.17.2
Example commands:

gemini load -v vcf_file.vqsr.vep.decomposed.normalised.vcf.gz --cores 4  --skip-gerp-bp -t VEP out.db
grabix index vcf_file.vqsr.vep.decomposed.normalised.vcf.gz

Cheers

GEMINI ISSUE

After using this command-$bash master=test.sh.
I am getting the following error
Using gemini found at: /usr/local/bin/gemini
/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/config.py:61: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(in_handle)
CADD scores are being loaded (to skip use:--skip-cadd).
GERP per bp is being loaded (to skip use:--skip-gerp-bp).
Traceback (most recent call last):
File "/usr/local/bin/gemini", line 7, in
gemini_main.main()
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1249, in main
args.func(parser, args)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 204, in load_fn
gemini_load.load(parser, args)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 51, in load
l = load_singlecore(args)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 86, in load_singlecore
l.populate_from_vcf()
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 223, in populate_from_vcf
(variant, variant_impacts) = self._prepare_variation(var, anno_keys)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 406, in _prepare_variation
rs_ids = annotations.get_dbsnp_info(var)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 664, in get_dbsnp_info
for hit in annotations_in_vcf(var, "dbsnp", "vcf", "grch37"):
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 422, in annotations_in_vcf
if isinstance(h, (cyvcf2.Variant, pysam.VariantRecord)):
AttributeError: 'module' object has no attribute 'VariantRecord'
/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/config.py:61: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(in_handle)
Traceback (most recent call last):
File "/usr/local/bin/gemini", line 7, in
gemini_main.main()
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1249, in main
args.func(parser, args)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 204, in load_fn
gemini_load.load(parser, args)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 51, in load
l = load_singlecore(args)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load.py", line 86, in load_singlecore
l.populate_from_vcf()
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 223, in populate_from_vcf
(variant, variant_impacts) = self._prepare_variation(var, anno_keys)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 414, in _prepare_variation
thousandG = annotations.get_1000G_info(var)
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 737, in get_1000G_info
for hit in annotations_in_vcf(var, "1000g", "vcf", "grch37"):
File "/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/gemini/annotations.py", line 422, in annotations_in_vcf
if isinstance(h, (cyvcf2.Variant, pysam.VariantRecord)):
AttributeError: 'module' object has no attribute 'VariantRecord'

Help would be appreciated

Tag a releasee

Hey - this is a great little utility. Any interest in tagging a release so this can be distributed via home brew or Linux brew?

Thanks!

grabix on file with only header is wrong

wget -O - https://github.com/arq5x/gemini/raw/master/test/test.query.vcf | grep  ^# | bgzip -c > o.vcf.gz
grabix index o.vcf.gz
grabix grab o.vcf.gz 1 1

Silent failure on FASTQ files with empty sequence entries (i.e. empty lines)

There seems to be an assumption that the number in the second line of a *.gbi index file should be divisible by four. For example, this is checked in "bcbio": https://github.com/chapmanb/bcbio-nextgen/blob/0db025298367c133e8fad462646acd6a774c42ad/bcbio/ngsalign/alignprep.py#L274

Is this assumption (still) valid? I'm asking because I am running into problems when using the grabix master branch with the latest development version of bcbio. However, grabix 0.1.6 seems to work fine (I tried to pinpoint the issue to one of the changes between 0.1.6 and 0.1.7, but was not successful so far. The sample fastq data as generated in test.sh also seems to work well, but it does fail on other (larger) fastq files. Any ideas?

allow bgzipping while indexing

instead of

stuff | bgzip -c > o.vcf.gz
grabix o.vcf.gz

we should have

stuff | grabix indexcompress o

grabix size returns negative on large file

I have a file with 3102468676 lines compressed with pbgzip and indexed with grabix index. grabix size returns -1192498620. It seems that a long int needs to be updated to int64

Need BAM support. Can index BAMs in the same way.

Makefile gaining flexibility to allow for platform-specific hardening

--- grabix-0.0git20130718.orig/Makefile
+++ grabix-0.0git20130718/Makefile
@@ -1,2 +1,7 @@
+LDFLAGS+= -lstdc++ -lz
+CFLAGS+=-Wall
all:

  gcc -Wall -O2 -o grabix grabix.cpp bgzf.c -lstdc++ -lz

  gcc $(CFLAGS) -o grabix grabix.cpp bgzf.c $(LDFLAGS)

+clean:

```
  rm -f grabix
```

What would it take to enable S3 support

Hi all,

what would it take to enable S3 access in grabix? I was hopeful/naive enough to think that it was as simple as replacing bgzf.c and linking against a newer version of htslib (with build-in S3 support via libcrytpo and libcurl) and recompile. index fails immediately because bgzf_open() complains with "could not open file" (that's after commenting bgzf_is_bgzf() out).

Any pointers as to how this could make to work would help!

Thanks,
Andreas

Seed of a man page

One of the bits created for the Debian package. Enjoy.

Steffen

.TH GRABIX 1 "July 18, 2013"
.SH NAME
grabix - program to do something
.SH SYNOPSIS
.B bgzip
.RI bedfile
.br
.B grabix
index
.RI bedfile.gz
.br
.B grabix
grab
.RI bedfile.gz
.RI linenumber

.SH DESCRIPTION
.PP
In biomedical research it is increasing practice to study
the genetic basis of disease. This now frequently comprises
the sequencing of human sequences. The output of the machine
however is redundant, and the real sequence is the best
sequence to explain the redundancy. The exchange of data
happens only with compressed files - to huge and redundant
to perform otherwise. One should avoid uncompression whenever
possible.
.PP
grabix leverages the fantastic BGZF library of the samtools
package to provide random access into text files that have
been compressed with bgzip. grabix creates it's own index
(.gbi) of the bgzipped file. Once indexed, one can extract
arbitrary lines from the file with the grab command. Or
choose random lines with the, well, random command.

.SH SEE ALSO
.BR https://github.com/arq5x/grabix

Know line number from .gbi?

Is there a way I can extract the number of lines in the file from the grabix index (.gbi). That would come in handy..

patch to avoid uninitialised values, correcting cast int->uint64 for long files

--- grabix-0.0git20130718.orig/grabix.cpp
+++ grabix-0.0git20130718/grabix.cpp
@@ -261,7 +261,7 @@
srand(seed);

     // reservoir sample

```
   size_t s, N, result_size;
```

   uint64_t N=0, result_size=0;
 vector<string> sample;
 kstring_t *line;
 line = new kstring_t;

@@ -282,7 +282,7 @@
}
else
{

           s = (int) ((double)rand()/(double)RAND_MAX \* N);

           uint64_t s = (uint64_t) ((double)rand()/(double)RAND_MAX * N);
         if (s < K)
             sample[s] = line->s;
     }

@@ -327,4 +327,4 @@
}

return EXIT_SUCCESS;
-}
\ No newline at end of file
+}

incorrect output

Hi Aaron,
Whilst on the way to getting up to speed on gemini, I've had a play with grabix. Unfortunately it's not playing nicely on our CentOS systems.
Following the examples in the README.md, 'grab' always returns the first row in BED file, N times & 'random' gives me a seg fault:

$ bgzip simrep.chr1.bed
$ zcat simrep.chr1.bed.gz | head -n 10
chr1 10000 10468 trf 789
chr1 10627 10800 trf 346
chr1 10757 10997 trf 434
chr1 11225 11447 trf 273
chr1 11271 11448 trf 187
chr1 11283 11448 trf 199
chr1 19305 19443 trf 242
chr1 20828 20863 trf 70
chr1 30862 30959 trf 79
chr1 44835 44876 trf 73

$ ./grabix index simrep.chr1.bed.gz
-- worked fine

-- this reports the 1st line, not the 100th
$ ./grabix grab simrep.chr1.bed.gz 100
chr1 10000 10468 trf 789

-- this reports the first line 10x
$ ./grabix grab simrep.chr1.bed.gz 100 110
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789
chr1 10000 10468 trf 789

$ ./grabix random simrep.chr1.bed.gz 10
Segmentation fault

some debugging info

$ uname -a
Linux gamma01.local 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34 BST 2012 x86_64 x86_64 x86_64 GNU/Linux
$ ldd grabix
linux-vdso.so.1 => (0x00007fff57000000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000032cac00000)
libz.so.1 => /lib64/libz.so.1 (0x00000032c8c00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032ca000000)
libc.so.6 => /lib64/libc.so.6 (0x00000032c7c00000)
libm.so.6 => /lib64/libm.so.6 (0x00000032c8000000)
/lib64/ld-linux-x86-64.so.2 (0x00000032c7800000)
$ cat /etc/centos-release
CentOS release 6.2 (Final)

My C's a little rusty, so hopefully you'll know what's going on far faster than me
cheers,
Mark

Unclear license of grabix.cpp

Please add a COPYRIGHT file and/or add a header for the license of grabix.cpp as in the other source files. I was about to finish Debian packaging of grabix when I found this info missing.

Cheers,