arq5x / bedtools Goto Github PK
View Code? Open in Web Editor NEWA powerful toolset for genome arithmetic.
Home Page: http://code.google.com/p/bedtools/
License: GNU General Public License v2.0
A powerful toolset for genome arithmetic.
Home Page: http://code.google.com/p/bedtools/
License: GNU General Public License v2.0
Hi Aaron -
I have some VCF files with a blank line after the #CHROM...
header line. BEDTools programs operating on these VCFs only report 2 columns (chrom, pos) rather than the entire line as expected.
Below is some troubleshooting that reproduces the problem. They're not full-spec VCFs with the whole header, but they do isolate the problem. I finally narrowed it down: truncation only happens when there is a "##", followed by some text on the same line, and a blank line before the data.
That is, here only a2.vcf
and a6.vcf
show the truncation. The others, (including a3.vcf
which has "##", no following characters, and a blank line) are fine.
Knowing this, I can now fix my files to get them to work, but I don't see anything in the VCF spec that disallows blank lines like that -- hence the bug report.
$ cat b.bed
chr10 3100000 3200000
for i in 0 1 2 3 4 5 6
do
fn=a$i.vcf
echo; echo "$fn original:"
cat $fn
echo; echo "intersected:"
bedtools intersect -a $fn -b b.bed; echo;
done
a0.vcf original:
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
a1.vcf original:
##fileformat=VCFv4.1
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
a2.vcf original:
##fileformat=VCFv4.1
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949
a3.vcf original:
##
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
a4.vcf original:
#comment
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
a5.vcf original:
##fileformat=VCFv4.1
#CHROM
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
a6.vcf original:
##fileformat=VCFv4.1
#CHROM
chr10 3127949 T A 46.3 . FQ=-28.7;DP4=0,0,4,1;AC1=6.0;VDB=0.08892796;MQ=20;AF1=1.0;DP=5 GT:PL:GQ 1/1:0,0,0:4 1/1:0,0,0:4 1/1:79,15,0:17
intersected:
chr10 3127949
The following file is tolerated by GentNextBed(sorted=True). It is because the starts are checked but we don't keep track of which chroms have already been seen.
$ cat cluster.bed
chr1 10 20 a 1 +
chr1 25 30 b 2 +
chr1 25 35 c 3 +
chr1 27 33 d 4 -
chr2 27 31 e 5 -
chr1 50 80 f 6 +
chr1 51 81 g 7 -
# strandless clustering works fine:
$ clusterBed -i cluster.bed
chr1 10 20 a 1 + 1
chr1 25 30 b 2 + 2
chr1 25 35 c 3 + 2
chr1 27 33 d 4 - 2
chr2 27 31 e 5 - 3
chr1 50 80 f 6 + 4
chr1 51 81 g 7 - 4
I use mergeBed
frequently with the -nms
feature to list multiple transcripts that BED intervals belong to, e.g.:
chr1 3195981 3197398 ENSMUST00000162897;ENSMUST00000159265 -
chr1 3203519 3207049 ENSMUST00000159265;ENSMUST00000162897;ENSMUST00000070533 -
This lists the transcripts in ;
-separated list. When I call tagBam
against a set of BED files that have these semicolon separated lists, you can get results like:
merged_exons:chr10:3426454-3428987,ENSMUST00000078070;ENSMUST00000144622;ENSMUST00000145156;ENSMUST00000165510;ENSMUST00000154998;ENSMUST00000086896;ENSMUST00000105617;ENSMUST00000149708;ENSMUST00000170680,+,;introns:chr10:3323400-3434581,ENSMUST00000086896,1,+,introns:chr10:3366295-3445776,ENSMUST00000165510,1,+,introns:chr10:3374343-3438478,ENSMUST00000129954,1,-
Since tagBam
itself uses semicolons to delimit the intervals that a particular BAM read maps to, this makes it a pain to parse, since semicolons delineate the fields of the tagBam
optional entry in the BAM, but have additional meaning inside the fields as delimiters of the BED intervals that the reads were mapped against.
Is it possible to add an option to mergeBed
that would specify the delimiter? E.g. ;
instead of ,
? That would be useful and not require post-processing of each BED produced by mergeBed
to manually change the delimiter.
A perhaps more general alternative: it would be probably better to allow the user to specify the delimiter for tagBam
. That would also solve the problem.
Thanks.
The significance testing used by ENCODE as developed by Bickel et al (http://arxiv.org/pdf/1101.0947.pdf) and available here: http://www.encodestatistics.org/
(IIUC) seems to rely on shuffling within prescribed regions. Part of their method is defining those regions by segmentation, but BEDTools, could ease this type of investigation by allowing one to specify that intervals are only shuffled within the window from which they arise or that they are only shuffled to within a given distance.
The latter should be quite simple to implement, though more difficult within the current framework of bedtools shuffle.
An interface could be like:
bedtools window_shuffle -w windows.bed -i input.bed -g hg18.genome > local-shuffle.bed
with -w
exclusive with -d
for distance:
bedtools window_shuffle -d 50000 -i input.bed -g hg19.genome > local-shuffle.bed
the current version of bedtools map allows only a single column and op.
it would be nice to specify multiple columns and ops as in groupby.
it would also be nice to share code between these tools so that, ex, all the ops supported by groupby could also be supported by map.
Hi,
I'm trying to make a set of sliding windows using makewindows
from a GFF file. Example GFF entry:
chr2 ensGene three_prime_UTR 136907192 136908759 . - . gene_id=ENSMUSG00000027276;region_len=1568;Name=ENSMUST00000028735.utr25;Parent=ENSMUST00000028735;ID=ENSMUST00000028735.utr25
If I try to make 10-bp windows from this, I get a core dump:
$ bedtools makewindows -b myregion.gff -n 10
Yields:
$ bedtools makewindows -b myregion.gff -n 10
136907191 136907348
136907348 136907505
136907505 136907662
136907662 136907819
136907819 136907976
136907976 136908133
136908133 136908290
136908290 136908447
136908447 136908604
136908604 136908759
*** glibc detected *** bedtools: free(): invalid pointer: 0x00000000012beb10 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7fc8d903db96]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSs6assignERKSs+0x79)[0x7fc8d994c6b9]
bedtools[0x41f821]
bedtools[0x420fa5]
bedtools[0x41a389]
bedtools[0x4cb075]
bedtools[0x4c95c5]
bedtools[0x43790a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fc8d8fe076d]
bedtools[0x406fb9]
======= Memory map: ========
00400000-0054c000 r-xp 00000000 00:1d 4759503922 /lab/solexa_jaenisch/solexa_jaenisch2/yarden/software/bedtools/bin/bedtools
0074b000-0074d000 r--p 0014b000 00:1d 4759503922 /lab/solexa_jaenisch/solexa_jaenisch2/yarden/software/bedtools/bin/bedtools
0074d000-0074e000 rw-p 0014d000 00:1d 4759503922 /lab/solexa_jaenisch/solexa_jaenisch2/yarden/software/bedtools/bin/bedtools
0074e000-0074f000 rw-p 00000000 00:00 0
012b6000-012d7000 rw-p 00000000 00:00 0 [heap]
7fc8d8fbf000-7fc8d9174000 r-xp 00000000 fc:00 937049 /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d9174000-7fc8d9374000 ---p 001b5000 fc:00 937049 /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d9374000-7fc8d9378000 r--p 001b5000 fc:00 937049 /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d9378000-7fc8d937a000 rw-p 001b9000 fc:00 937049 /lib/x86_64-linux-gnu/libc-2.15.so
7fc8d937a000-7fc8d937f000 rw-p 00000000 00:00 0
7fc8d937f000-7fc8d9394000 r-xp 00000000 fc:00 917551 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9394000-7fc8d9593000 ---p 00015000 fc:00 917551 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9593000-7fc8d9594000 r--p 00014000 fc:00 917551 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9594000-7fc8d9595000 rw-p 00015000 fc:00 917551 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fc8d9595000-7fc8d9690000 r-xp 00000000 fc:00 937060 /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d9690000-7fc8d988f000 ---p 000fb000 fc:00 937060 /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d988f000-7fc8d9890000 r--p 000fa000 fc:00 937060 /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d9890000-7fc8d9891000 rw-p 000fb000 fc:00 937060 /lib/x86_64-linux-gnu/libm-2.15.so
7fc8d9891000-7fc8d9977000 r-xp 00000000 fc:00 1967661 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9977000-7fc8d9b76000 ---p 000e6000 fc:00 1967661 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9b76000-7fc8d9b7e000 r--p 000e5000 fc:00 1967661 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9b7e000-7fc8d9b80000 rw-p 000ed000 fc:00 1967661 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
7fc8d9b80000-7fc8d9b95000 rw-p 00000000 00:00 0
7fc8d9b95000-7fc8d9bab000 r-xp 00000000 fc:00 917732 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9bab000-7fc8d9daa000 ---p 00016000 fc:00 917732 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9daa000-7fc8d9dab000 r--p 00015000 fc:00 917732 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9dab000-7fc8d9dac000 rw-p 00016000 fc:00 917732 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fc8d9dac000-7fc8d9dce000 r-xp 00000000 fc:00 937061 /lib/x86_64-linux-gnu/ld-2.15.so
7fc8d9fa3000-7fc8d9fa8000 rw-p 00000000 00:00 0
7fc8d9fca000-7fc8d9fce000 rw-p 00000000 00:00 0
7fc8d9fce000-7fc8d9fcf000 r--p 00022000 fc:00 937061 /lib/x86_64-linux-gnu/ld-2.15.so
7fc8d9fcf000-7fc8d9fd1000 rw-p 00023000 fc:00 937061 /lib/x86_64-linux-gnu/ld-2.15.so
7fff7a4bd000-7fff7a4de000 rw-p 00000000 00:00 0 [stack]
7fff7a5f3000-7fff7a5f4000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted (core dumped)
My bedtools version is:
$ bedtools --version
bedtools v2.17.0-33-g2a16ea9
Any ideas on this? It's a super helpful feature based on docs so I'd really love to use it.
Thanks, --Yarden
There are three object files that appear to have been inadvertently checked in:
src/utils/gzstream/gzstream.o
src/utils/gzstream/test_gunzip.o
src/utils/gzstream/test_gzip.o
No big deal as the top-level makefile dependencies call for obj/gzstream.o instead, but it would be as well not to have Mac OS object files in the source distribution...
I cant seem to get correct output from coverageBed when providing the bam file directly:
bedfile.bed (51,011 lines)
...
Example1Contig 912832681 912833713 MyFeature
...
using samtools view on bamfile.bam shows no reads in this interval:
$ samtools view bamfile.bam Example1Contig:912832681-912833713
$
Piping this directly into coverageBed reports zero:
$ samtools view bamfile.bam Example1Contig:912832681-912833713 |
coverageBed -abam stdin -b bedfile.bed | grep MyFeature
Example1Contig 912832681 912833713 MyFeature 0 0 1032 0.0000000
As well as piping the entire bam file into coverageBed:
$ samtools view bamfile.bam | coverageBed -abam stdin -b bedfile.bed | grep MyFeature
Example1Contig 912832681 912833713 MyFeature 0 0 1032 0.0000000
BUT, using the -abam <file>
option, a totally different count is registered:
$ coverageBed -abam bamfile.bam -b bedfile.bed | grep MyFeature
Example1Contig 912832681 912833713 MyFeature 1355 1032 1032 1.0000000
Any ideas as to what is going on (or what I missed here!) are appreciated. Thanks,
Nik
Hi,
I run the genomecov using the commond line as below:
$ bedtools genomecov -ibam Pp-BWAmemNCBI-90Grmdupl.bam -d > coverage.log
The input bam was generation by using BWA-mem for mapping, picard for sorting and duplicate removing.
Bedtools was runnig for several hours, but after analyzing about 2 million sites (out of 1 billion site), it was corrupted.
I get the error message as below:
*** glibc detected *** bedtools: double free or corruption (out): 0x0000000002597fc0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7faa43accb96]
bedtools[0x479f57]
bedtools[0x47a42d]
bedtools[0x47b7ba]
bedtools[0x47e909]
bedtools[0x406bd5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7faa43a6f76d]
bedtools[0x40aff9]
======= Memory map: ========
00400000-00573000 r-xp 00000000 08:02 44317440 /home/joinet/bin/bedtools
00772000-00774000 r--p 00172000 08:02 44317440 /home/joinet/bin/bedtools
00774000-00775000 rw-p 00174000 08:02 44317440 /home/joinet/bin/bedtools
00775000-00776000 rw-p 00000000 00:00 0
01fe1000-0436e000 rw-p 00000000 00:00 0 [heap]
7faa438fa000-7faa439bc000 rw-p 00000000 00:00 0
7faa43a4e000-7faa43c03000 r-xp 00000000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43c03000-7faa43e02000 ---p 001b5000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43e02000-7faa43e06000 r--p 001b4000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43e06000-7faa43e08000 rw-p 001b8000 08:02 41943289 /lib/x86_64-linux-gnu/libc-2.15.so
7faa43e08000-7faa43e0d000 rw-p 00000000 00:00 0
7faa43e0d000-7faa43e22000 r-xp 00000000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa43e22000-7faa44021000 ---p 00015000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa44021000-7faa44022000 r--p 00014000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa44022000-7faa44023000 rw-p 00015000 08:02 41946637 /lib/x86_64-linux-gnu/libgcc_s.so.1
7faa44023000-7faa4411e000 r-xp 00000000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4411e000-7faa4431d000 ---p 000fb000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4431d000-7faa4431e000 r--p 000fa000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4431e000-7faa4431f000 rw-p 000fb000 08:02 41943305 /lib/x86_64-linux-gnu/libm-2.15.so
7faa4431f000-7faa44401000 r-xp 00000000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa44401000-7faa44600000 ---p 000e2000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa44600000-7faa44608000 r--p 000e1000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa44608000-7faa4460a000 rw-p 000e9000 08:02 16260627 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7faa4460a000-7faa4461f000 rw-p 00000000 00:00 0
7faa4461f000-7faa44635000 r-xp 00000000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44635000-7faa44834000 ---p 00016000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44834000-7faa44835000 r--p 00015000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44835000-7faa44836000 rw-p 00016000 08:02 41946727 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7faa44836000-7faa44858000 r-xp 00000000 08:02 41943317 /lib/x86_64-linux-gnu/ld-2.15.so
7faa44a3c000-7faa44a41000 rw-p 00000000 00:00 0
7faa44a54000-7faa44a58000 rw-p 00000000 00:00 0
7faa44a58000-7faa44a59000 r--p 00022000 08:02 41943317 /lib/x86_64-linux-gnu/ld-2.15.so
7faa44a59000-7faa44a5b000 rw-p 00023000 08:02 41943317 /lib/x86_64-linux-gnu/ld-2.15.so
7fff7dad1000-7fff7daf2000 rw-p 00000000 00:00 0 [stack]
7fff7dbc8000-7fff7dbc9000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
core dumped
What could be the problem?
Thank you,
Chih
I have been using the bedtools getfasta
command with a gff file as the argument to -bed
. The gff file contains the ##FASTA
section and when I run the bedtools
command I get the following message:
It looks as though you have less than 3 columns at line: 4830. Are you sure your files are tab-delimited?
Line 4830 of the gff file is inside the fasta section of the gff file
It is not clear to users that -abam is a replacement for -a. It should be made more clear.''
https://groups.google.com/forum/?hl=en&fromgroups=#!topic/bedtools-discuss/AeHMIq4rKRQ
When filename given to tagBam
as -files
argument is too long, the program completes but terminates with a Segmentation fault. Test case files available here: tinyurl.com/budlfva
# Long input filename causes segfault "to_be_or_not_to_be_that_is_the_question_whether_tis_nobler_in_the_mind_to_suffer.gff" makes segfault:
$ tagBam -i ./accepted_hits.ribosub.sorted.bam -files ./to_be_or_not_to_be_that_is_the_question_whether_tis_nobler_in_the_mind_to_suffer.gff -labels gff -intervals -f 1 > /dev/null
/home/yarden/jaen/software/BEDTools-Version-2.15.0/bin/tagBam: line 2: 17451 Segmentation fault bedtools tag "$@"
If the identical input GFF file is renamed to a shorter filename hamlet.gff
, the segfault goes away:
# Identical file with short filename "hamlet.gff" does not segfault
$ cp ./to_be_or_not_to_be_that_is_the_question_whether_tis_nobler_in_the_mind_to_suffer.gff hamlet.gff
$ tagBam -i ./accepted_hits.ribosub.sorted.bam -files ./hamlet.gff -labels gff -intervals -f 1 > /dev/null
$
The behavior occurs even if the filename is long due to its path length (not just basename length), so an input like: /my/long/path/to/gff/...
leads to a segfault as well.
with an genome file like this:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg18.chromInfo" > hg18.genome
an exclusion file like this (excl.bed):
chr1 121100000 124300000 p11.1 acen
chr1 124300000 128000000 q11 acen
chr2 91000000 93300000 p11.1 acen
chr2 93300000 95700000 q11.1 acen
chr3 89400000 91700000 p11.1 acen
chr3 91700000 93200000 q11.1 acen
chr4 48700000 50700000 p11 acen
chr4 50700000 52400000 q11 acen
chr5 45800000 47700000 p11 acen
chr5 47700000 50500000 q11.1 acen
chr6 58400000 60500000 p11.1 acen
chr6 60500000 63400000 q11.1 acen
chr7 57400000 59100000 p11.1 acen
chr7 59100000 61100000 q11.1 acen
chr8 43200000 45200000 p11.1 acen
chr8 45200000 48100000 q11.1 acen
chr9 46700000 51800000 p11.1 acen
chr9 51800000 60300000 q11 acen
chr10 38800000 40300000 p11.1 acen
chr10 40300000 42100000 q11.1 acen
chr11 51400000 52900000 p11.11 acen
chr11 52900000 56400000 q11 acen
chr12 33200000 35400000 p11.1 acen
chr12 35400000 36500000 q11 acen
chr13 13500000 16000000 p11.1 acen
chr13 16000000 18400000 q11 acen
chr14 13600000 15600000 p11.1 acen
chr14 15600000 19100000 q11.1 acen
chr15 14100000 17000000 p11.1 acen
chr15 17000000 18400000 q11.1 acen
chr16 34400000 38200000 p11.1 acen
chr16 38200000 40700000 q11.1 acen
chr17 22100000 22200000 p11.1 acen
chr17 22200000 23200000 q11.1 acen
chr18 15400000 16100000 p11.1 acen
chr18 16100000 17300000 q11.1 acen
chr19 26700000 28500000 p11 acen
chr19 28500000 30200000 q11 acen
chr20 25700000 27100000 p11.1 acen
chr20 27100000 28400000 q11.1 acen
chr21 10000000 12300000 p11.1 acen
chr21 12300000 13200000 q11.1 acen
chr22 11800000 16300000 q11.1 acen
chrX 56600000 59500000 p11.1 acen
chrX 59500000 65000000 q11.1 acen
and an input.bed like this:
chr2 237373226 237378925 0.2431
chr20 6976859 6982547 0.1692
chr20 48108756 48144977 0.1483
chr20 52888740 52894852 0.1443
chr21 19962126 19965043 0.1793
chr3 38411 6057552 0.1728
chr3 6085764 6088062 0.389
chr3 6093366 172418635 0.1707
chr4 20500517 20586746 0.2058
chr4 25339644 25378233 0.1938
the command
bedtools shuffle -excl excl.bed -i input.bed -g hg18.genome
freezes.
Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.
This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:
subtractBed -a sample_snps.bed -b allknownsnps.bed
This is the error I get when I run it on a machine with 32GiB of RAM:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
When running the command with strace:
strace -o subtractBed.strace subtractBed -a sample_snps.bed -b allknownsnps.bed
This is the end of the log:
read(3, "17\t67979430\t67979431\tG\tA\tPASS\nch"..., 8191) = 8191
brk(0x6a65bf000) = 0x6a65bf000
read(3, "7985837\tG\tC\tMinDP\nchr17\t67985847"..., 8191) = 8191
brk(0x6a65e7000) = 0x6a65e7000
read(3, "\tA\tG\tPASS\nchr17\t67992208\t6799220"..., 8191) = 8191
read(3, "\nchr17\t67999172\t67999173\tT\tG\tPAS"..., 8191) = 8191
brk(0x6a660b000) = 0x6a65e7000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "terminate called after throwing "..., 48) = 48
write(2, "std::bad_alloc", 14) = 14
write(2, "'\n", 2) = 2
write(2, " what(): ", 11) = 11
write(2, "std::bad_alloc", 14) = 14
write(2, "\n", 1) = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid() = 26915
tgkill(26915, 26915, SIGABRT) = 0
--- SIGABRT (Aborted) @ 0 (0) ---
+++ killed by SIGABRT (core dumped) +++
# Line number lose to where the PC runs out of memory: 58469306
$ grep -n -m1 '^chr17'$'\t''67999172' allknownsnps.bed
58469306:chr17 67999172 67999173 T G PASS
# Total number of lines in the file: 66007044
$ wc -l allknownsnps.bed
66007044
It would be nice to use subtractBed like this:
subtractBed -loada -a sample_snps.bed -b allknownsnps.bed
Where -loada loads file a in memory and b from disk.
So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.
I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly:
intersectBed -wb -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin
This will benefit the IGV/bedtools integration.
What steps will reproduce the problem?
What is the expected output? What do you see instead?
I expect counts of the number of entries in file B that overlap with file A
and where the entire faction of entry in file A is overlaped by an entry in File B
I get instead an error says *ERROR: Unrecognized parameter: -f
This indicated that the file of operation I am trying to achieve is not supported... However, if is documented in the usage for the tool under the -c Argument:
-c For each entry in A, report the number of overlaps with B.
- Reports 0 for A entries that have no overlap with B.
- Overlaps restricted by -f.
What version of the product are you using? On what operating system?
bedtools v2.16.2, Linux Debian
shift
should move an interval left or right by either a fixed number of bases or a specific fraction of a chromosome.
Instead of providing genome (-g) files, have chrom_sizes for several widely-used genomes burned into the tool.
Obvious issue is chromosome naming schemes. Could have:
'hg19' [chr1, chr2, ...]
'hg19-int' [1, 2, ...]
Possibly dangerous however, as the advantage of the -g file approach is that the user KNOWS what they are doing, whereas under this model, results could be wrong owing to unknown differences in the chrom naming schemes (e.g., the 7,459 ways that the mitochondria is named)
Like bit flipping, this would flip the strand of every interval.
https://groups.google.com/forum/?fromgroups=#!topic/bedtools-discuss/_Pinr6dLG1E
While closest doesn't do this in one fell swoop, it should be able to do it in two steps via -D, -id, and -iu. However, I cannot get a toy example to work the way it should. It suggests a bug in the -id and -iu options.
Self explanatory. Use curlib? boost::asio? We would prefer to not use boost::asio in order to minimize external dependencies.
Is there a stable C++ library for reading remote files?
This would be useful for stuff related to significance of overlap (along with shuffle).
this would have a command-line interface like:
bedtools sample -i stdin -n 500 > out.500.bed
This could be done with sort -R or nl, or something, but those tools are not on most systems that I work with.
This is very easy to implement with reservoir sampling, here;s a python version mostly taken from wikipedia:
import random
lines = []
for i, line in enumerate(bed):
if i < n:
lines.append(line)
else:
replace_idx = random.randint(0, i)
if replace_idx < n:
lines[replace_idx] = line
print "".join(lines),
Below may be a completely separate tool, but something that would be useful to add to a suite like this in bedtools....
it may also be nice to do shuffling of labels, e.g.
bedtools label_shuffle -col 4 -i stdin
so instead of shuffling positions, we shuffle the labels of the appropriate column. Again, this can be useful for significance testing.
Build complains about uint32_t. Need to include cstdint and compile with -std=c++0x.
Revision a228d5f
I have the impression, that coverageBed enlarges 0 length features e.g.
test.bed:
chr1 142142832 142142832 test 0 +
coverageBed -a test.bed -b test.bed
chr1 142142831 142142833 test 0 + 1 2 2 1.0000000
bedtools apparently throws an error when trying to read empty files, and treats them as though they cannot be opened. When I try to intersect 2 blank files, this is the result:
Error: The requested bed file (blank.bed) could not be opened. Exiting!
I think it would be better to treat the file as containing no features. This has come up in IGV integration, where sometimes one of the temporary files being operated on is blank. The correct result would vary depending on the command so it's much harder to fix on the IGV end.
I never leave the house without Bedtools, and recently discovered the immensely useful feature clusterBed
and am finding many applications for it. I'm having trouble applying it to BAM files, though, and wanted to see if there's a way around it or if this utility is incompatible with BAMs.
I have a BAM file generated by tagBam
that I'd like to run clusterBed
on. I convert it to BED with bamToBed
and then sort it in preparation for clusterBed
:
bamToBed -split -i mybam.bam | head -n 30 | sortBed -i -
That works great, but the fourth field of IDs is already populated by the conversion. As far as I can tell from the docs, clusterBed
uses the fourth field to group the intervals into clusters since it does not merge them into meta-intervals. It looks like clusterBed
cannot work with this output:
# BED-converted BAM
# (Subset of "bamToBed -split -i mybam.bam | head -n 30 | sortBed -i -")
$ cat test.bed
chr1 3197300 3197319 102919-4 1 +
chr1 3197300 3197319 302711-1 50 +
chr1 3197300 3197319 417919-1 3 +
chr1 3197300 3197319 578672-1 3 +
chr1 3197300 3197319 754316-1 3 +
chr1 3197301 3197319 560330-1 0 +
chr1 3197302 3197319 366543-1 0 +
chr1 3197307 3197323 125538-3 0 -
chr1 3197307 3197324 444150-1 1 -
Now clusterBed
does not appear to correctly group the intervals:
$ clusterBed -i test.bed
chr1 3197300 3197319 102919-4 1 + 1
chr1 3197300 3197319 302711-1 50 + 1
chr1 3197300 3197319 417919-1 3 + 1
chr1 3197300 3197319 578672-1 3 + 1
chr1 3197300 3197319 754316-1 3 + 1
chr1 3197301 3197319 560330-1 0 + 1
chr1 3197302 3197319 366543-1 0 + 1
chr1 3197307 3197323 125538-3 0 - 1
chr1 3197307 3197324 444150-1 1 - 1
Is there a way around this? I'm basically trying to find a way to run clusterBed
on BAM files, and am happy to convert them to BEDs in the process. Thanks.
This tool is still in beta, yes? with empty files as inputs, the tool eats all my memory and and never stops.
Repro:
touch a.bed
touch b.bed
fjoin -a a.bed -b b.bed
From user:
Just feel it would be good if multiple columns indication can be also supported simply like 1-12 for columns 1 to 12, instead of writing like -g 1,2,3,4,5,6,7,8,9,10,11,12
dont know what the official issue tracker is now.
identical to: http://code.google.com/p/bedtools/issues/detail?id=62
simple.bed:
chr1 50 55 name score +
What is the expected output?
/bedtools-2.10.1/bin/coverageBed -a simpleBed.bed -b simpleBed.bed
chr1 50 55 name score + 1 5 5 1.0000000
What do you see instead?
/BEDTools-Version-2.11.0/bin/coverageBed -a simpleBed.bed -b simpleBed.bed
chr1 51 54 name score + 1 5 5 1.0000000
What version of the product are you using? On what operating system?
linux 2.11.0
Please provide any additional information below.
No additional information.
Currently multiinter simply ouputs presense/absense.
It would be nice if it supported printing the value from a particular column (e.g. -c 5). to set value_column to 5 (actually 4).
I started this using PointWithDepth, but realized the getNextMergedBed() only returns chrom, start, end.
And that by using GetNextMergedBed(), it is merging intervals from each file beforehand.
I'd like to make it instead call GetNextBed() if c is specified and then use bed.fields[value_column] to create PointWithDepth() in AddInterval() and then use curr_point.depth as the current_value (instead of 1.
Does this seem like a reasonable approach or should I abandon and write a downstream tool to fill the values after calling multiinter as it is now?
Hello,
I am running coverageBed to determine coverage of a specific regions of a genome by RAD tag data (restriction-site associated DNA reduced representation sequences). I am interested in the coverage for specific regions on 5 chromosomes. The default coverageBed output for this sample (see below) shows that, for the region on chr1, there were 973 features in A that overlapped B. My 'features' are 75 bp long, so if the depth was one this would be a maximum total of 72,975 bp of B covered by A. However, the column for number of bases in B with non-zero coverage by features in A shows 124,757. Why the discrepancy? I would expect the number of bases with non-zero coverage to be less than 72,975.
coverageBed output for my sample:
chr1 12700001 21700000 973 124757 8999999 0.0138619
chr11 12800001 16700000 591 73123 3899999 0.0187495
chr5 4800001 6800000 293 37381 1999999 0.0186905
chr6 13036316 30875458 2514 322205 17839142 0.0180617
chr8 20500001 23600000 391 49935 3099999 0.0161081
My command is like this:
$ coverageBed -a sample.bed -b chr_regions.bed > sample_cov_output.tsv
sample.bed is created from a BAM file using this command:
$ bamToBed -i sample.bam > sample.bed
bedtools coverage is quite slow for reasonably size files -- at least with -abam.
It seems a common thing to want to do something like:
bedtools coverage -abam my.bam -counts \
-b <(bedtools makewindows -g hg19.genome -w 1000000 -s 10000)
if we know both are sorted, this could improve the performance quite a bit.
This is possibly a feature request, unless there is a better way around it:
After using bedtools slop, which is a great way of creating windows around features, I noticed that if the window goes beyond the beginning or end of a chr, it will report that coordinate as -1. This is specially common in genomes with lots of small scaffolds. The -1 value messes up with other bedtools commands later on. Would it be possible to have an "until end" option so that in those cases, the window goes until the end (or beginning) of the chr instead of giving value -1?
Thanks
Add support for BigWig and BigBed.
Commit 455c52 causes a regression in behaviour for cases where the bed file specifies a range past the end of the fasta sequence.
In the previous version this situation would result in a warning of out of range. After 455c52, the overflow is ignored and the id and sequence from the following(?) sequence is returned.
when annotating large regions, it's common to have multiple genes with 0 distance.
those get created as multiple lines in the output. it would be nice to be able to get all
of them on the same line with a boolean flag like --collapse.
then instead of:
chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc009xpd.2 0 - 714
chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc001jli.2 0 - 714
chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc001jlh.2 0 - 714
it would give:
chr10 62635514 62636138 chr10:62633793-62636189 2.932e-07 18 chr10 62634711 62634800 uc009xpd.2,uc001jli.2,uc001jlh.2
Since jaccard does the Union and the Intersection. The stdin stream is exhausted on the first one and empty on the second so jaccard returns incorrect results.
It isn't obvious why this happens from the command line.
bedtools exhibits incorrect behavior when the bed file is empty (i.e. no records), saying that the file cannot be opened. I've chased this down to the use of ios::good() after the file has been opened instead of ios::fail(). This occurs in 4 different places in the code.
I've fixed locally, but don't really want to download/fork the 500MB git repo to give you a pull request. Nonetheless, thought I'd let you know.
Right now they have to start with '#' (comment) and are generally discarded from output (eg. subtractBed).
It would be nice if headers could be properly handled and printed to the output.
Maybe add an option that would not parse the first line and just print it accordingly.
There are a lot of rather large files that had at one time been committed to the git archive and subsequently deleted. Because git keeps track of these old objects, the archive is ~467 mb on despite the current files being less than about 6mb.
Some of the files and their uncompressed file sizes) that could be removed from the history include:
1.3M data/knownPhenotypes.bed (7110362: 1 year, 12 months ago)
1.3M src/intersectBed/chr1_snp129.bed (f50878e: 2 years, 7 months ago)
1.4M src/intersectBed/new.out (f50878e: 2 years, 7 months ago)
1.6M BEDTools-User-Manual.pdf (b854c68: 1 year, 11 months ago)
1.6M src/pairToBed/test (620895f: 1 year, 12 months ago)
1.7M BEDToolsManual.v2.4.0.pdf (e7388c6: 1 year, 11 months ago)
1.7M data/DNA.lte200MilliDiv.mm9.bed (7110362: 1 year, 12 months ago)
1.9M src/intersectBed/origTest (4034c4c: 2 years, 8 months ago)
2.4M src/pairToBed/neither.ids (323b6f7: 1 year, 12 months ago)
2.6M data/SINES.chr14.mm9.bed (7110362: 1 year, 12 months ago)
2.6M src/pairToBed/xor (323b6f7: 1 year, 12 months ago)
4.4M scripts/yeast.gtf (8e0db9f: 2 years, 1 month ago)
4.6M data/dnaMER1.mm9.bed (7110362: 1 year, 12 months ago)
5.9M data/microSatellites.mm9.track.bed (7110362: 1 year, 12 months ago)
6.9M data/dnaTransposons.bed (7110362: 1 year, 12 months ago)
7.8M data/DNAS.mm9.bed (7110362: 1 year, 12 months ago)
10.4M data/snps.hg18.chr21.v130.bed (9a8611d: 1 year, 11 months ago)
11.7M data/segDupsLiftedFromMM8.bed (7110362: 1 year, 12 months ago)
16.7M data/LTRS.lte200MilliDiv.mm9.bed (7110362: 1 year, 12 months ago)
17.3M src/intersectBed/hmec.sam.tier1.bam (323b6f7: 1 year, 12 months ago)
18.9M data/LINES.lte200MilliDiv.mm9.bed (7110362: 1 year, 12 months ago)
19.0M src/pairToBed/hmec.txt.pairs.bam (323b6f7: 1 year, 12 months ago)
19.7M src/pairToBed/simrep.hg18.txt (323b6f7: 1 year, 12 months ago)
22.6M data/SINES.lte200MilliDiv.mm9.bed (7110362: 1 year, 12 months ago)
23.9M src/b6j.test.bed (0a67b92: 2 years, 1 month ago)
30.1M data/allSimpleRepeats.mm9.bed (7110362: 1 year, 12 months ago)
35.7M src/intersectBed/simplerepeats.mm8.bed (932788f: 1 year, 12 months ago)
36.6M data/simpleReps.mm9.bed (7110362: 1 year, 12 months ago)
39.3M data/LTRS.mm9.bed (7110362: 1 year, 12 months ago)
41.9M data/LINES.mm9.bed (7110362: 1 year, 12 months ago)
51.5M src/intersectBed/old (932788f: 1 year, 12 months ago)
53.6M src/pairToBed/neither (323b6f7: 1 year, 12 months ago)
56.6M src/coverageBed/old (82b7108: 1 year, 12 months ago)
56.6M src/coverageBed/old.sorted (82b7108: 1 year, 12 months ago)
59.9M data/allTransposons.lte200MilliDiv.bed (7110362: 1 year, 12 months ago)
65.9M data/simpleRepeats.mm9.fromRepeatMasker.bed (7110362: 1 year, 12 months ago)
66.9M data/SINES.mm9.bed (7110362: 1 year, 12 months ago)
69.9M src/intersectBed/hmec.merged.bam (db3b2cd: 1 year, 12 months ago)
123.1M data/allTransposons.bed (7110362: 1 year, 12 months ago)
138.3M data/rnaTransposons.bed (7110362: 1 year, 12 months ago)
170.0M scripts/mouse.gtf (8e0db9f: 2 years, 1 month ago)
242.4M data/repeatMasker.mm9.bed (7110362: 1 year, 12 months ago)
278.3M data/repMaskerAll.txt (7110362: 1 year, 12 months ago)
These files could be removed from the history following instructions at http://http://help.github.com/remove-sensitive-data/
The downside of doing this would be a need to rebase (or perhaps it would be easiest to move the old repository to a different name and make a new repository without the above files). The upside would be a much smaller git archive,
Istvan Albert pointed out that when an error is encountered, bedtools currently prints the error following by the help. This forces the user to have to scroll up quote a bit to see the actual error.
A simple solution is to report the error after the help.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.