arq5x / bedtools Goto Github PK

View Code? Open in Web Editor NEW

140.0 140.0 87.0 498.62 MB

A powerful toolset for genome arithmetic.

Home Page: http://code.google.com/p/bedtools/

License: GNU General Public License v2.0

Shell 4.12% Python 0.13% C++ 91.34% C 0.56% Makefile 3.85%

bedtools's People

Contributors

Stargazers

Watchers

Forkers

idot mikerlindberg cjfields daler verdurin jakebiesinger bbcf pengyu sneumann genome-vendor bjventers shalintripathi nathanweeks charles-plessy pryvkin rspengle specht dvanic cdelzinga lbthrice lparsons alexpenson nicolas-bertin nitandressa brentp dandaman nachocab theoryno3 jsilter ecwooten dangenet cbirger rashesh7 guitss bergeycm apratap bgruening friedue ahak dfjenkins3 bernhardlutz denizk s-es ralowe tulipnandu kontakter gabimerino mrg7 nala5 dongliangxu jennifershelton yxian9 biocyberman zzygyx9119 kitinje atmaivancevic osmanbeh zhoujj2013 faneechen antmarge peterulz happyshi0402 ankitmaroo gynecoloji shaowenj elentevanyan jeffmylife winbuntu adrien-evo tanyamrc edawson olgabot dougbarrows jmorival anandksrao alexjfortuna aishwaryagogate ariane-lozachmeur ayeaton zjsailor kaplans1 my0916 lawcal821 livygrant97 15101538237ren aj-prognomiq sijiewang05

bedtools's Issues

VCF output truncated, if blank line follows header

Hi Aaron -

I have some VCF files with a blank line after the #CHROM... header line. BEDTools programs operating on these VCFs only report 2 columns (chrom, pos) rather than the entire line as expected.

Below is some troubleshooting that reproduces the problem. They're not full-spec VCFs with the whole header, but they do isolate the problem. I finally narrowed it down: truncation only happens when there is a "##", followed by some text on the same line, and a blank line before the data.

That is, here only a2.vcf and a6.vcf show the truncation. The others, (including a3.vcf which has "##", no following characters, and a blank line) are fine.

Knowing this, I can now fix my files to get them to work, but I don't see anything in the VCF spec that disallows blank lines like that -- hence the bug report.

$ cat b.bed
chr10   3100000 3200000

for i in 0 1 2 3 4 5 6
do
 fn=a$i.vcf
 echo; echo "$fn original:"
 cat $fn
 echo; echo "intersected:"
 bedtools intersect -a $fn -b b.bed; echo;
done

a0.vcf original:
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17


a1.vcf original:
##fileformat=VCFv4.1
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17


a2.vcf original:
##fileformat=VCFv4.1

chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949


a3.vcf original:
##

chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17


a4.vcf original:
#comment

chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17


a5.vcf original:
##fileformat=VCFv4.1
#CHROM
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17


a6.vcf original:
##fileformat=VCFv4.1
#CHROM

chr10   3127949 T   A   46.3    .   FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5  GT:PL:GQ    1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17

intersected:
chr10   3127949

-f and -r options for merge?

https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/9cD2Er4wFps

Bug in enforcing sorted files

The following file is tolerated by GentNextBed(sorted=True). It is because the starts are checked but we don't keep track of which chroms have already been seen.

$ cat cluster.bed
chr1    10  20  a   1   +
chr1    25  30  b   2   +
chr1    25  35  c   3   +
chr1    27  33  d   4   -
chr2    27  31  e   5   -
chr1    50  80  f   6   +
chr1    51  81  g   7   -

# strandless clustering works fine:

$ clusterBed -i cluster.bed 
chr1    10  20  a   1   +   1
chr1    25  30  b   2   +   2
chr1    25  35  c   3   +   2
chr1    27  33  d   4   -   2
chr2    27  31  e   5   -   3
chr1    50  80  f   6   +   4
chr1    51  81  g   7   -   4

Controlling delimiter in -nms option of mergeBed

I use mergeBed frequently with the -nms feature to list multiple transcripts that BED intervals belong to, e.g.:

chr1    3195981 3197398 ENSMUST00000162897;ENSMUST00000159265   -
chr1    3203519 3207049 ENSMUST00000159265;ENSMUST00000162897;ENSMUST00000070533    -

This lists the transcripts in ;-separated list. When I call tagBam against a set of BED files that have these semicolon separated lists, you can get results like:

merged_exons:chr10:3426454-3428987,ENSMUST00000078070;ENSMUST00000144622;ENSMUST00000145156;ENSMUST00000165510;ENSMUST00000154998;ENSMUST00000086896;ENSMUST00000105617;ENSMUST00000149708;ENSMUST00000170680,+,;introns:chr10:3323400-3434581,ENSMUST00000086896,1,+,introns:chr10:3366295-3445776,ENSMUST00000165510,1,+,introns:chr10:3374343-3438478,ENSMUST00000129954,1,-

Since tagBam itself uses semicolons to delimit the intervals that a particular BAM read maps to, this makes it a pain to parse, since semicolons delineate the fields of the tagBam optional entry in the BAM, but have additional meaning inside the fields as delimiters of the BED intervals that the reads were mapped against.

Is it possible to add an option to mergeBed that would specify the delimiter? E.g. ; instead of ,? That would be useful and not require post-processing of each BED produced by mergeBed to manually change the delimiter.

A perhaps more general alternative: it would be probably better to allow the user to specify the delimiter for tagBam. That would also solve the problem.

Thanks.

ENH: shuffle within a window or distance

The significance testing used by ENCODE as developed by Bickel et al (http://arxiv.org/pdf/1101.0947.pdf) and available here: http://www.encodestatistics.org/
(IIUC) seems to rely on shuffling within prescribed regions. Part of their method is defining those regions by segmentation, but BEDTools, could ease this type of investigation by allowing one to specify that intervals are only shuffled within the window from which they arise or that they are only shuffled to within a given distance.

The latter should be quite simple to implement, though more difficult within the current framework of bedtools shuffle.

An interface could be like:

bedtools window_shuffle -w windows.bed -i input.bed -g hg18.genome > local-shuffle.bed

with -w exclusive with -d for distance:

bedtools window_shuffle -d 50000 -i input.bed -g hg19.genome > local-shuffle.bed

allow specifying multiple columns and multiple ops to bedtools map (a la groupby)

the current version of bedtools map allows only a single column and op.
it would be nice to specify multiple columns and ops as in groupby.

it would also be nice to share code between these tools so that, ex, all the ops supported by groupby could also be supported by map.

Potential bug causing segfault dump in makewindows

Hi,

I'm trying to make a set of sliding windows using makewindows from a GFF file. Example GFF entry:

chr2    ensGene three_prime_UTR 136907192   136908759   .   -   .   gene_id=ENSMUSG00000027276;region_len=1568;Name=ENSMUST00000028735.utr25;Parent=ENSMUST00000028735;ID=ENSMUST00000028735.utr25

If I try to make 10-bp windows from this, I get a core dump:

$ bedtools makewindows -b myregion.gff -n 10

Yields:

$ bedtools makewindows -b myregion.gff -n 10
    136907191   136907348
    136907348   136907505
    136907505   136907662
    136907662   136907819
    136907819   136907976
    136907976   136908133
    136908133   136908290
    136908290   136908447
    136908447   136908604
    136908604   136908759
*** glibc detected *** bedtools: free(): invalid pointer: 0x00000000012beb10 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7fc8d903db96]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSs6assignERKSs+0x79)[0x7fc8d994c6b9]
bedtools[0x41f821]
bedtools[0x420fa5]
bedtools[0x41a389]
bedtools[0x4cb075]
bedtools[0x4c95c5]
bedtools[0x43790a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fc8d8fe076d]
bedtools[0x406fb9]
======= Memory map: ========
00400000-0054c000 r-xp 00000000 00:1d 4759503922                         /lab/solexa_jaenisch/solexa_jaenisch2/yarden/software/bedtools/bin/bedtools
0074b000-0074d000 r--p 0014b000 00:1d 4759503922                         /lab/solexa_jaenisch/solexa_jaenisch2/yarden/software/bedtools/bin/bedtools
0074d000-0074e000 rw-p 0014d000 00:1d 4759503922                         /lab/solexa_jaenisch/solexa_jaenisch2/yarden/software/bedtools/bin/bedtools
0074e000-0074f000 rw-p 00000000 00:00 0 
012b6000-012d7000 rw-p 00000000 00:00 0                                  [heap]
7fc8d8fbf000-7fc8d9174000 r-xp 00000000 fc:00 937049                     /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d9174000-7fc8d9374000 ---p 001b5000 fc:00 937049                     /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d9374000-7fc8d9378000 r--p 001b5000 fc:00 937049                     /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d9378000-7fc8d937a000 rw-p 001b9000 fc:00 937049                     /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d937a000-7fc8d937f000 rw-p 00000000 00:00 0 
7fc8d937f000-7fc8d9394000 r-xp 00000000 fc:00 917551                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9394000-7fc8d9593000 ---p 00015000 fc:00 917551                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9593000-7fc8d9594000 r--p 00014000 fc:00 917551                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9594000-7fc8d9595000 rw-p 00015000 fc:00 917551                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9595000-7fc8d9690000 r-xp 00000000 fc:00 937060                     /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d9690000-7fc8d988f000 ---p 000fb000 fc:00 937060                     /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d988f000-7fc8d9890000 r--p 000fa000 fc:00 937060                     /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d9890000-7fc8d9891000 rw-p 000fb000 fc:00 937060                     /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d9891000-7fc8d9977000 r-xp 00000000 fc:00 1967661                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9977000-7fc8d9b76000 ---p 000e6000 fc:00 1967661                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9b76000-7fc8d9b7e000 r--p 000e5000 fc:00 1967661                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9b7e000-7fc8d9b80000 rw-p 000ed000 fc:00 1967661                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9b80000-7fc8d9b95000 rw-p 00000000 00:00 0 
7fc8d9b95000-7fc8d9bab000 r-xp 00000000 fc:00 917732                     /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9bab000-7fc8d9daa000 ---p 00016000 fc:00 917732                     /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9daa000-7fc8d9dab000 r--p 00015000 fc:00 917732                     /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9dab000-7fc8d9dac000 rw-p 00016000 fc:00 917732                     /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9dac000-7fc8d9dce000 r-xp 00000000 fc:00 937061                     /lib/x86_64-linux-gnu/ld-2.15.so
7fc8d9fa3000-7fc8d9fa8000 rw-p 00000000 00:00 0 
7fc8d9fca000-7fc8d9fce000 rw-p 00000000 00:00 0 
7fc8d9fce000-7fc8d9fcf000 r--p 00022000 fc:00 937061                     /lib/x86_64-linux-gnu/ld-2.15.so
7fc8d9fcf000-7fc8d9fd1000 rw-p 00023000 fc:00 937061                     /lib/x86_64-linux-gnu/ld-2.15.so
7fff7a4bd000-7fff7a4de000 rw-p 00000000 00:00 0                          [stack]
7fff7a5f3000-7fff7a5f4000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted (core dumped)

My bedtools version is:

$ bedtools --version
bedtools v2.17.0-33-g2a16ea9

Any ideas on this? It's a super helpful feature based on docs so I'd really love to use it.

Thanks, --Yarden

Checked-in .o files in src/utils/gzstream

There are three object files that appear to have been inadvertently checked in:

src/utils/gzstream/gzstream.o
src/utils/gzstream/test_gunzip.o
src/utils/gzstream/test_gzip.o

No big deal as the top-level makefile dependencies call for obj/gzstream.o instead, but it would be as well not to have Mac OS object files in the source distribution...

coverageBed output incorrect with -abam option, correct with -abam stdin

I cant seem to get correct output from coverageBed when providing the bam file directly:

bedfile.bed (51,011 lines)

...
Example1Contig  912832681   912833713   MyFeature
...

using samtools view on bamfile.bam shows no reads in this interval:

$ samtools view bamfile.bam Example1Contig:912832681-912833713
$

Piping this directly into coverageBed reports zero:

$ samtools view bamfile.bam Example1Contig:912832681-912833713 | 
coverageBed -abam stdin -b bedfile.bed | grep MyFeature
Example1Contig  912832681   912833713   MyFeature   0   0   1032    0.0000000

As well as piping the entire bam file into coverageBed:

$ samtools view bamfile.bam | coverageBed -abam stdin -b bedfile.bed | grep MyFeature
Example1Contig  912832681   912833713   MyFeature   0   0   1032    0.0000000

BUT, using the -abam <file> option, a totally different count is registered:

$ coverageBed -abam bamfile.bam -b bedfile.bed | grep MyFeature
Example1Contig  912832681   912833713   MyFeature   1355    1032    1032    1.0000000

Any ideas as to what is going on (or what I missed here!) are appreciated. Thanks,
Nik

error of running genomecov

Hi,

I run the genomecov using the commond line as below:

$ bedtools genomecov -ibam Pp-BWAmemNCBI-90Grmdupl.bam -d > coverage.log

The input bam was generation by using BWA-mem for mapping, picard for sorting and duplicate removing.

Bedtools was runnig for several hours, but after analyzing about 2 million sites (out of 1 billion site), it was corrupted.

I get the error message as below:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000002597fc0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7faa43accb96]
bedtools[0x479f57]
bedtools[0x47a42d]
bedtools[0x47b7ba]
bedtools[0x47e909]
bedtools[0x406bd5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7faa43a6f76d]
bedtools[0x40aff9]
======= Memory map: ========
00400000-00573000 r-xp 00000000 08:02 44317440 /home/joinet/bin/bedtools
00772000-00774000 r--p 00172000 08:02 44317440 /home/joinet/bin/bedtools
00774000-00775000 rw-p 00174000 08:02 44317440 /home/joinet/bin/bedtools
00775000-00776000 rw-p 00000000 00:00 0
01fe1000-0436e000 rw-p 00000000 00:00 0 [heap]
7faa438fa000-7faa439bc000 rw-p 00000000 00:00 0
7faa43a4e000-7faa43c03000 r-xp 00000000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43c03000-7faa43e02000 ---p 001b5000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43e02000-7faa43e06000 r--p 001b4000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43e06000-7faa43e08000 rw-p 001b8000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43e08000-7faa43e0d000 rw-p 00000000 00:00 0
7faa43e0d000-7faa43e22000 r-xp 00000000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa43e22000-7faa44021000 ---p 00015000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa44021000-7faa44022000 r--p 00014000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa44022000-7faa44023000 rw-p 00015000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa44023000-7faa4411e000 r-xp 00000000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4411e000-7faa4431d000 ---p 000fb000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4431d000-7faa4431e000 r--p 000fa000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4431e000-7faa4431f000 rw-p 000fb000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4431f000-7faa44401000 r-xp 00000000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa44401000-7faa44600000 ---p 000e2000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa44600000-7faa44608000 r--p 000e1000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa44608000-7faa4460a000 rw-p 000e9000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa4460a000-7faa4461f000 rw-p 00000000 00:00 0
7faa4461f000-7faa44635000 r-xp 00000000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44635000-7faa44834000 ---p 00016000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44834000-7faa44835000 r--p 00015000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44835000-7faa44836000 rw-p 00016000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44836000-7faa44858000 r-xp 00000000 08:02 41943317 /lib/x86_64-linux-gnu/ld-2.15.so
7faa44a3c000-7faa44a41000 rw-p 00000000 00:00 0
7faa44a54000-7faa44a58000 rw-p 00000000 00:00 0
7faa44a58000-7faa44a59000 r--p 00022000 08:02 41943317 /lib/x86_64-linux-gnu/ld-2.15.so
7faa44a59000-7faa44a5b000 rw-p 00023000 08:02 41943317 /lib/x86_64-linux-gnu/ld-2.15.so
7fff7dad1000-7fff7daf2000 rw-p 00000000 00:00 0 [stack]
7fff7dbc8000-7fff7dbc9000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
core dumped

What could be the problem?

Thank you,

Chih

Does bedtools handle the ##FASTA tag in gff files?

I have been using the bedtools getfasta command with a gff file as the argument to -bed. The gff file contains the ##FASTA section and when I run the bedtools command I get the following message:

It looks as though you have less than 3 columns at line: 4830. Are you sure your files are tab-delimited?

Line 4830 of the gff file is inside the fasta section of the gff file

Fix help for all tools using -abam

It is not clear to users that -abam is a replacement for -a. It should be made more clear.''

https://groups.google.com/forum/?hl=en&fromgroups=#!topic/bedtools-discuss/AeHMIq4rKRQ

Improve the description of -wb

https://groups.google.com/forum/?hl=en&fromgroups=#!topic/bedtools-discuss/X3kZt-Fu0Ao

filename length dependent segfault in tagBam

When filename given to tagBam as -files argument is too long, the program completes but terminates with a Segmentation fault. Test case files available here: tinyurl.com/budlfva

# Long input filename causes segfault "to_be_or_not_to_be_that_is_the_question_whether_tis_nobler_in_the_mind_to_suffer.gff" makes segfault:
$ tagBam -i ./accepted_hits.ribosub.sorted.bam -files ./to_be_or_not_to_be_that_is_the_question_whether_tis_nobler_in_the_mind_to_suffer.gff  -labels gff -intervals -f 1 > /dev/null
/home/yarden/jaen/software/BEDTools-Version-2.15.0/bin/tagBam: line 2: 17451 Segmentation fault      bedtools tag "$@"

If the identical input GFF file is renamed to a shorter filename hamlet.gff, the segfault goes away:

# Identical file with short filename "hamlet.gff" does not segfault
$ cp ./to_be_or_not_to_be_that_is_the_question_whether_tis_nobler_in_the_mind_to_suffer.gff hamlet.gff

$ tagBam -i ./accepted_hits.ribosub.sorted.bam -files ./hamlet.gff  -labels gff -intervals -f 1 > /dev/null
$

The behavior occurs even if the filename is long due to its path length (not just basename length), so an input like: /my/long/path/to/gff/... leads to a segfault as well.

freeze with shuffle -excl

with an genome file like this:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
            "select chrom, size from  hg18.chromInfo" > hg18.genome

an exclusion file like this (excl.bed):

chr1    121100000   124300000   p11.1   acen
chr1    124300000   128000000   q11 acen
chr2    91000000    93300000    p11.1   acen
chr2    93300000    95700000    q11.1   acen
chr3    89400000    91700000    p11.1   acen
chr3    91700000    93200000    q11.1   acen
chr4    48700000    50700000    p11 acen
chr4    50700000    52400000    q11 acen
chr5    45800000    47700000    p11 acen
chr5    47700000    50500000    q11.1   acen
chr6    58400000    60500000    p11.1   acen
chr6    60500000    63400000    q11.1   acen
chr7    57400000    59100000    p11.1   acen
chr7    59100000    61100000    q11.1   acen
chr8    43200000    45200000    p11.1   acen
chr8    45200000    48100000    q11.1   acen
chr9    46700000    51800000    p11.1   acen
chr9    51800000    60300000    q11 acen
chr10   38800000    40300000    p11.1   acen
chr10   40300000    42100000    q11.1   acen
chr11   51400000    52900000    p11.11  acen
chr11   52900000    56400000    q11 acen
chr12   33200000    35400000    p11.1   acen
chr12   35400000    36500000    q11 acen
chr13   13500000    16000000    p11.1   acen
chr13   16000000    18400000    q11 acen
chr14   13600000    15600000    p11.1   acen
chr14   15600000    19100000    q11.1   acen
chr15   14100000    17000000    p11.1   acen
chr15   17000000    18400000    q11.1   acen
chr16   34400000    38200000    p11.1   acen
chr16   38200000    40700000    q11.1   acen
chr17   22100000    22200000    p11.1   acen
chr17   22200000    23200000    q11.1   acen
chr18   15400000    16100000    p11.1   acen
chr18   16100000    17300000    q11.1   acen
chr19   26700000    28500000    p11 acen
chr19   28500000    30200000    q11 acen
chr20   25700000    27100000    p11.1   acen
chr20   27100000    28400000    q11.1   acen
chr21   10000000    12300000    p11.1   acen
chr21   12300000    13200000    q11.1   acen
chr22   11800000    16300000    q11.1   acen
chrX    56600000    59500000    p11.1   acen
chrX    59500000    65000000    q11.1   acen

and an input.bed like this:

chr2    237373226   237378925   0.2431
chr20   6976859 6982547 0.1692
chr20   48108756    48144977    0.1483
chr20   52888740    52894852    0.1443
chr21   19962126    19965043    0.1793
chr3    38411   6057552 0.1728
chr3    6085764 6088062 0.389
chr3    6093366 172418635   0.1707
chr4    20500517    20586746    0.2058
chr4    25339644    25378233    0.1938

the command

 bedtools shuffle -excl excl.bed -i input.bed -g hg18.genome

freezes.

Add -split option to tag sub command

https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/_2ky6yqmxEQ

Enhancement: subtractBed: Add switch to allow -a file to be read in memory and -b from disk

Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.

This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:

subtractBed -a sample_snps.bed -b allknownsnps.bed

This is the error I get when I run it on a machine with 32GiB of RAM:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

When running the command with strace:

strace -o subtractBed.strace subtractBed -a sample_snps.bed -b allknownsnps.bed

This is the end of the log:

read(3, "17\t67979430\t67979431\tG\tA\tPASS\nch"..., 8191) = 8191
brk(0x6a65bf000)                        = 0x6a65bf000
read(3, "7985837\tG\tC\tMinDP\nchr17\t67985847"..., 8191) = 8191
brk(0x6a65e7000)                        = 0x6a65e7000
read(3, "\tA\tG\tPASS\nchr17\t67992208\t6799220"..., 8191) = 8191
read(3, "\nchr17\t67999172\t67999173\tT\tG\tPAS"..., 8191) = 8191
brk(0x6a660b000)                        = 0x6a65e7000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "terminate called after throwing "..., 48) = 48
write(2, "std::bad_alloc", 14)          = 14
write(2, "'\n", 2)                      = 2
write(2, "  what():  ", 11)             = 11
write(2, "std::bad_alloc", 14)          = 14
write(2, "\n", 1)                       = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid()                                = 26915
tgkill(26915, 26915, SIGABRT)           = 0
--- SIGABRT (Aborted) @ 0 (0) ---
+++ killed by SIGABRT (core dumped) +++

# Line number lose to where the PC runs out of memory: 58469306
$ grep -n -m1 '^chr17'$'\t''67999172' allknownsnps.bed
58469306:chr17  67999172    67999173    T   G   PASS

# Total number of lines in the file: 66007044
$ wc -l allknownsnps.bed
66007044

It would be nice to use subtractBed like this:

subtractBed -loada -a sample_snps.bed -b allknownsnps.bed

Where -loada loads file a in memory and b from disk.
So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.

I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly:

intersectBed -wb -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin

Handle \r\n and \n line endings.

This will benefit the IGV/bedtools integration.

Invalid -f Argument in windowBed

What steps will reproduce the problem?

windowBed -a file1.bed -b file2.bed -c f 1.0

What is the expected output? What do you see instead?

I expect counts of the number of entries in file B that overlap with file A
and where the entire faction of entry in file A is overlaped by an entry in File B

I get instead an error says *ERROR: Unrecognized parameter: -f
This indicated that the file of operation I am trying to achieve is not supported... However, if is documented in the usage for the tool under the -c Argument:
-c For each entry in A, report the number of overlaps with B.
- Reports 0 for A entries that have no overlap with B.
- Overlaps restricted by -f.

What version of the product are you using? On what operating system?

bedtools v2.16.2, Linux Debian

Add a 'shift' tool.

shift should move an interval left or right by either a fixed number of bases or a specific fraction of a chromosome.

multicov needs to handle split alignments and BED12

https://groups.google.com/group/bedtools-discuss/browse_thread/thread/40a8bbee8dd35c1a

Have chrom_sizes built into the tool

Instead of providing genome (-g) files, have chrom_sizes for several widely-used genomes burned into the tool.

Obvious issue is chromosome naming schemes. Could have:

'hg19'   [chr1, chr2, ...]
'hg19-int'   [1, 2, ...]

Possibly dangerous however, as the advantage of the -g file approach is that the user KNOWS what they are doing, whereas under this model, results could be wrong owing to unknown differences in the chrom naming schemes (e.g., the 7,459 ways that the mitochondria is named)

Add a 'flip' tool.

Like bit flipping, this would flip the strand of every interval.

Finding closest intervals on _both_ sides of a query interval

https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/_Pinr6dLG1E

While closest doesn't do this in one fell swoop, it should be able to do it in two steps via -D, -id, and -iu. However, I cannot get a toy example to work the way it should. It suggests a bug in the -id and -iu options.

Bug in slop if chrom not present in genome file?

https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/PTalkUVQbwk

Support for remote files.

Self explanatory. Use curlib? boost::asio? We would prefer to not use boost::asio in order to minimize external dependencies.

Is there a stable C++ library for reading remote files?

ENH: implement new tool bedtools sample

This would be useful for stuff related to significance of overlap (along with shuffle).

this would have a command-line interface like:

bedtools sample -i stdin -n 500 > out.500.bed

This could be done with sort -R or nl, or something, but those tools are not on most systems that I work with.

This is very easy to implement with reservoir sampling, here;s a python version mostly taken from wikipedia:

    import random
    lines = []
    for i, line in enumerate(bed):
        if i < n:
            lines.append(line)
        else:
            replace_idx = random.randint(0, i)
            if replace_idx < n:
                lines[replace_idx] = line
    print "".join(lines),

Below may be a completely separate tool, but something that would be useful to add to a suite like this in bedtools....

it may also be nice to do shuffling of labels, e.g.

bedtools label_shuffle -col 4 -i stdin

so instead of shuffling positions, we shuffle the labels of the appropriate column. Again, this can be useful for significance testing.

support both -incl and -excl in shuffle

https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/Zi7_tEF8QTE

Add -f option to window

Build fails on VectorOps, groupBy

Build complains about uint32_t. Need to include cstdint and compile with -std=c++0x.

Revision a228d5f

Zero Length features for coverageBed get enlarged

I have the impression, that coverageBed enlarges 0 length features e.g.
test.bed:
chr1 142142832 142142832 test 0 +
coverageBed -a test.bed -b test.bed
chr1 142142831 142142833 test 0 + 1 2 2 1.0000000

Blank BED files throw an error

bedtools apparently throws an error when trying to read empty files, and treats them as though they cannot be opened. When I try to intersect 2 blank files, this is the result:

Error: The requested bed file (blank.bed) could not be opened. Exiting!

I think it would be better to treat the file as containing no features. This has come up in IGV integration, where sometimes one of the temporary files being operated on is blank. The correct result would vary depending on the command so it's much harder to fix on the IGV end.

Incompatibility of clusterBed with BED-converted BAMs?

I never leave the house without Bedtools, and recently discovered the immensely useful feature clusterBed and am finding many applications for it. I'm having trouble applying it to BAM files, though, and wanted to see if there's a way around it or if this utility is incompatible with BAMs.

I have a BAM file generated by tagBam that I'd like to run clusterBed on. I convert it to BED with bamToBed and then sort it in preparation for clusterBed:

bamToBed -split -i mybam.bam | head -n 30 | sortBed -i -

That works great, but the fourth field of IDs is already populated by the conversion. As far as I can tell from the docs, clusterBed uses the fourth field to group the intervals into clusters since it does not merge them into meta-intervals. It looks like clusterBed cannot work with this output:

# BED-converted BAM
# (Subset of "bamToBed -split -i mybam.bam | head -n 30 | sortBed -i -")
$ cat test.bed
chr1    3197300 3197319 102919-4    1   +
chr1    3197300 3197319 302711-1    50  +
chr1    3197300 3197319 417919-1    3   +
chr1    3197300 3197319 578672-1    3   +
chr1    3197300 3197319 754316-1    3   +
chr1    3197301 3197319 560330-1    0   +
chr1    3197302 3197319 366543-1    0   +
chr1    3197307 3197323 125538-3    0   -
chr1    3197307 3197324 444150-1    1   -

Now clusterBed does not appear to correctly group the intervals:

$ clusterBed -i test.bed
chr1    3197300 3197319 102919-4    1   +   1
chr1    3197300 3197319 302711-1    50  +   1
chr1    3197300 3197319 417919-1    3   +   1
chr1    3197300 3197319 578672-1    3   +   1
chr1    3197300 3197319 754316-1    3   +   1
chr1    3197301 3197319 560330-1    0   +   1
chr1    3197302 3197319 366543-1    0   +   1
chr1    3197307 3197323 125538-3    0   -   1
chr1    3197307 3197324 444150-1    1   -   1

Is there a way around this? I'm basically trying to find a way to run clusterBed on BAM files, and am happy to convert them to BEDs in the process. Thanks.

fjoin fails with empty files

This tool is still in beta, yes? with empty files as inputs, the tool eats all my memory and and never stops.

Repro:
touch a.bed
touch b.bed
fjoin -a a.bed -b b.bed

Support "cut"-like column selection in groupby

From user:

Just feel it would be good if multiple columns indication can be also supported simply like 1-12 for columns 1 to 12, instead of writing like -g 1,2,3,4,5,6,7,8,9,10,11,12

coverageBed output positions wrong for bed input files (regression)

dont know what the official issue tracker is now.
identical to: http://code.google.com/p/bedtools/issues/detail?id=62

simple.bed:
chr1 50 55 name score +

What is the expected output?
/bedtools-2.10.1/bin/coverageBed -a simpleBed.bed -b simpleBed.bed
chr1 50 55 name score + 1 5 5 1.0000000

What do you see instead?
/BEDTools-Version-2.11.0/bin/coverageBed -a simpleBed.bed -b simpleBed.bed
chr1 51 54 name score + 1 5 5 1.0000000

What version of the product are you using? On what operating system?
linux 2.11.0

Please provide any additional information below.
No additional information.

output values from a given column in multiinter

Currently multiinter simply ouputs presense/absense.
It would be nice if it supported printing the value from a particular column (e.g. -c 5). to set value_column to 5 (actually 4).

I started this using PointWithDepth, but realized the getNextMergedBed() only returns chrom, start, end.

And that by using GetNextMergedBed(), it is merging intervals from each file beforehand.

I'd like to make it instead call GetNextBed() if c is specified and then use bed.fields[value_column] to create PointWithDepth() in AddInterval() and then use curr_point.depth as the current_value (instead of 1.

Does this seem like a reasonable approach or should I abandon and write a downstream tool to fill the values after calling multiinter as it is now?

coverageBed coverage/number of features discrepancy

Hello,

I am running coverageBed to determine coverage of a specific regions of a genome by RAD tag data (restriction-site associated DNA reduced representation sequences). I am interested in the coverage for specific regions on 5 chromosomes. The default coverageBed output for this sample (see below) shows that, for the region on chr1, there were 973 features in A that overlapped B. My 'features' are 75 bp long, so if the depth was one this would be a maximum total of 72,975 bp of B covered by A. However, the column for number of bases in B with non-zero coverage by features in A shows 124,757. Why the discrepancy? I would expect the number of bases with non-zero coverage to be less than 72,975.

coverageBed output for my sample:

chr1 12700001 21700000 973 124757 8999999 0.0138619
chr11 12800001 16700000 591 73123 3899999 0.0187495
chr5 4800001 6800000 293 37381 1999999 0.0186905
chr6 13036316 30875458 2514 322205 17839142 0.0180617
chr8 20500001 23600000 391 49935 3099999 0.0161081

My command is like this:

$ coverageBed -a sample.bed -b chr_regions.bed > sample_cov_output.tsv

sample.bed is created from a BAM file using this command:

$ bamToBed -i sample.bam > sample.bed

bedtools coverage should have a -sorted option

bedtools coverage is quite slow for reasonably size files -- at least with -abam.

It seems a common thing to want to do something like:

bedtools coverage -abam my.bam -counts \
             -b <(bedtools makewindows -g hg19.genome -w 1000000 -s 10000)

if we know both are sorted, this could improve the performance quite a bit.

bedtools slop "until end of chr" option

This is possibly a feature request, unless there is a better way around it:

After using bedtools slop, which is a great way of creating windows around features, I noticed that if the window goes beyond the beginning or end of a chr, it will report that coordinate as -1. This is specially common in genomes with lots of small scaffolds. The -1 value messes up with other bedtools commands later on. Would it be possible to have an "until end" option so that in those cases, the window goes until the end (or beginning) of the chr instead of giving value -1?

Thanks

BigWig and BigBed support

Add support for BigWig and BigBed.

commit 455c52e54a025279df28 causes regression in fFB range checking

Commit 455c52 causes a regression in behaviour for cases where the bed file specifies a range past the end of the fasta sequence.

In the previous version this situation would result in a warning of out of range. After 455c52, the overflow is ignored and the id and sequence from the following(?) sequence is returned.

Specify and ignore missing values in groupby

https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/2UzYJ7hMk_0

enhancement: allow flag to report multiple overlaps on same line in closestBed

when annotating large regions, it's common to have multiple genes with 0 distance.
those get created as multiple lines in the output. it would be nice to be able to get all
of them on the same line with a boolean flag like --collapse.

then instead of:

chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc009xpd.2 0 - 714
chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc001jli.2 0 - 714
chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc001jlh.2 0 - 714

it would give:

chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc009xpd.2,uc001jli.2,uc001jlh.2

warning/error in jaccard when attempting "stdin" as input

Since jaccard does the Union and the Intersection. The stdin stream is exhausted on the first one and empty on the second so jaccard returns incorrect results.
It isn't obvious why this happens from the command line.

shuffle -bedpe ignores the -chrom option.

empty bed files

bedtools exhibits incorrect behavior when the bed file is empty (i.e. no records), saying that the file cannot be opened. I've chased this down to the use of ios::good() after the file has been opened instead of ios::fail(). This occurs in 4 different places in the code.

I've fixed locally, but don't really want to download/fork the 500MB git repo to give you a pull request. Nonetheless, thought I'd let you know.

BED/GFF headers

Right now they have to start with '#' (comment) and are generally discarded from output (eg. subtractBed).

It would be nice if headers could be properly handled and printed to the output.
Maybe add an option that would not parse the first line and just print it accordingly.

bedtools git archive is very large due to old binaries and data files

There are a lot of rather large files that had at one time been committed to the git archive and subsequently deleted. Because git keeps track of these old objects, the archive is ~467 mb on despite the current files being less than about 6mb.

Some of the files and their uncompressed file sizes) that could be removed from the history include:

 1.3M   data/knownPhenotypes.bed                     (7110362: 1 year, 12 months ago)
 1.3M   src/intersectBed/chr1_snp129.bed             (f50878e: 2 years, 7 months ago)
 1.4M   src/intersectBed/new.out                     (f50878e: 2 years, 7 months ago)
 1.6M   BEDTools-User-Manual.pdf                     (b854c68: 1 year, 11 months ago)
 1.6M   src/pairToBed/test                           (620895f: 1 year, 12 months ago)
 1.7M   BEDToolsManual.v2.4.0.pdf                    (e7388c6: 1 year, 11 months ago)
 1.7M   data/DNA.lte200MilliDiv.mm9.bed              (7110362: 1 year, 12 months ago)
 1.9M   src/intersectBed/origTest                    (4034c4c: 2 years, 8 months ago)
 2.4M   src/pairToBed/neither.ids                    (323b6f7: 1 year, 12 months ago)
 2.6M   data/SINES.chr14.mm9.bed                     (7110362: 1 year, 12 months ago)
 2.6M   src/pairToBed/xor                            (323b6f7: 1 year, 12 months ago)
 4.4M   scripts/yeast.gtf                            (8e0db9f: 2 years, 1 month ago)
 4.6M   data/dnaMER1.mm9.bed                         (7110362: 1 year, 12 months ago)
 5.9M   data/microSatellites.mm9.track.bed           (7110362: 1 year, 12 months ago)
 6.9M   data/dnaTransposons.bed                      (7110362: 1 year, 12 months ago)
 7.8M   data/DNAS.mm9.bed                            (7110362: 1 year, 12 months ago)
10.4M   data/snps.hg18.chr21.v130.bed                (9a8611d: 1 year, 11 months ago)
11.7M   data/segDupsLiftedFromMM8.bed                (7110362: 1 year, 12 months ago)
16.7M   data/LTRS.lte200MilliDiv.mm9.bed             (7110362: 1 year, 12 months ago)
17.3M   src/intersectBed/hmec.sam.tier1.bam          (323b6f7: 1 year, 12 months ago)
18.9M   data/LINES.lte200MilliDiv.mm9.bed            (7110362: 1 year, 12 months ago)
19.0M   src/pairToBed/hmec.txt.pairs.bam             (323b6f7: 1 year, 12 months ago)
19.7M   src/pairToBed/simrep.hg18.txt                (323b6f7: 1 year, 12 months ago)
22.6M   data/SINES.lte200MilliDiv.mm9.bed            (7110362: 1 year, 12 months ago)
23.9M   src/b6j.test.bed                             (0a67b92: 2 years, 1 month ago)
30.1M   data/allSimpleRepeats.mm9.bed                (7110362: 1 year, 12 months ago)
35.7M   src/intersectBed/simplerepeats.mm8.bed       (932788f: 1 year, 12 months ago)
36.6M   data/simpleReps.mm9.bed                      (7110362: 1 year, 12 months ago)
39.3M   data/LTRS.mm9.bed                            (7110362: 1 year, 12 months ago)
41.9M   data/LINES.mm9.bed                           (7110362: 1 year, 12 months ago)
51.5M   src/intersectBed/old                         (932788f: 1 year, 12 months ago)
53.6M   src/pairToBed/neither                        (323b6f7: 1 year, 12 months ago)
56.6M   src/coverageBed/old                          (82b7108: 1 year, 12 months ago)
56.6M   src/coverageBed/old.sorted                   (82b7108: 1 year, 12 months ago)
59.9M   data/allTransposons.lte200MilliDiv.bed       (7110362: 1 year, 12 months ago)
65.9M   data/simpleRepeats.mm9.fromRepeatMasker.bed  (7110362: 1 year, 12 months ago)
66.9M   data/SINES.mm9.bed                           (7110362: 1 year, 12 months ago)
69.9M   src/intersectBed/hmec.merged.bam             (db3b2cd: 1 year, 12 months ago)
123.1M  data/allTransposons.bed                      (7110362: 1 year, 12 months ago)
138.3M  data/rnaTransposons.bed                      (7110362: 1 year, 12 months ago)
170.0M  scripts/mouse.gtf                            (8e0db9f: 2 years, 1 month ago)
242.4M  data/repeatMasker.mm9.bed                    (7110362: 1 year, 12 months ago)
278.3M  data/repMaskerAll.txt                        (7110362: 1 year, 12 months ago)

These files could be removed from the history following instructions at http://http://help.github.com/remove-sensitive-data/

The downside of doing this would be a need to rebase (or perhaps it would be easiest to move the old repository to a different name and make a new repository without the above files). The upside would be a much smaller git archive,

Report ERROR, then print HELP

Istvan Albert pointed out that when an error is encountered, bedtools currently prints the error following by the help. This forces the user to have to scroll up quote a bit to see the actual error.

A simple solution is to report the error after the help.