hall-lab / speedseq Goto Github PK
View Code? Open in Web Editor NEWA flexible framework for rapid genome analysis and interpretation
License: MIT License
A flexible framework for rapid genome analysis and interpretation
License: MIT License
I think the -g feature might be broken.
This will greatly reduce memory requirements in merging parallel runs of freebayes
annotate the read depth of intrachromosomal BND variants
speedseq somatic -v fails with following output:
Calling somatic variants...
create temporary directory
/opt/speedseq//bin/freebayes -f /opt/vcp_files/human_g1k_v37.fasta \
--pooled-discrete \
--min-repeat-entropy 1 \
--genotype-qualities \
--min-alternate-fraction 0.05 \
--min-alternate-count 2 \
--region $chrom:$start..$end \
/bamdir/126_normal.bam /bamdir/126_1.bam \
| somatic_filter 10 18 0 \
> /workdir/126_1.$chrom:$start..$end.vcf
cat /workdir/var_command.txt | /opt/speedseq//bin/parallel -j 32
grep "^##" /workdir/126_1.MT:12136..12498.vcf \
| cat - <(echo '##INFO=<ID=SSC,Number=1,Type=Float,Description="Somatic score">') <(grep "^#CHROM" /workdir/126_1.MT:12136..12498.vcf) > /workdir/header.txt
cat /workdir/126_1."$chrom:$start..$end".vcf | grep -v "^#" \
| sort -k1,1 -k2,2n | cat /workdir/header.txt - \
| /opt/speedseq//bin/bgzip -c > /workdir/126_1.vcf.gz
/opt/speedseq//bin/tabix -f -p vcf /workdir/126_1.vcf.gz
[ti_index_core] the chromosome blocks not continuous at line 1567, is the file sorted? [pos12247310]
The vcf file is indeed not correctly sorted. The problem seems to be related locale settings:
$ env | grep LANG
LANG=en_US.UTF-8
$ sort --help
<snip>
*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.
<snip>
I suggest setting LC_ALL=C in SpeedSeq to avoid issues with different locale settings.
Hi,
There are some problems in the installation of speedseq.
The error log,
$make
make align
make[1]: Entering directory /faststorage/home/siyang/USER/yeweijian/PipelineTest/Speedseq/speedseq' make -C src/bwa make[2]: Entering directory
/faststorage/home/siyang/USER/yeweijian/PipelineTest/Speedseq/speedseq/src/bwa'
make[2]: Nothing to be done for all'. make[2]: Leaving directory
/faststorage/home/siyang/USER/yeweijian/PipelineTest/Speedseq/speedseq/src/bwa'
cp src/bwa/bwa bin
cp src/sambamba bin
make -C src/samblaster
make[2]: Entering directory /faststorage/home/siyang/USER/yeweijian/PipelineTest/Speedseq/speedseq/src/samblaster' make[2]: *** No targets specified and no makefile found. Stop. make[2]: Leaving directory
/faststorage/home/siyang/USER/yeweijian/PipelineTest/Speedseq/speedseq/src/samblaster'
make[1]: *** [samblaster] Error 2
make[1]: Leaving directory `/faststorage/home/siyang/USER/yeweijian/PipelineTest/Speedseq/speedseq'
make: *** [all] Error 2
Could you help?
Hi Colby,
I want to let you know that I am having speedseq running OK for all my samples except this one which
gives me segmentation fault.
I am not sure why this particular one gives me trouble. I have 50 others finished without error.
Also let you know that the new svtyper now gives me correct genotypes.
/risapps/rhel6/speedseq/0.1.0//bin/lumpyexpress: line 411: 31825 Segmentation fault (core dumped) $LUMPY $PROB_CURVE -t ${TEMP_DIR}/${OUTBASE} -msw $MIN_SAMPLE_WEIGHT -tt $TRIM_THRES $EXCLUDE_BED_FMT $LUMPY_DISC_STRING $LUMPY_SPL_STRING > $OUTPUT
Thanks,
Ming
This is something of a nit but ...
If I already have freebayes, bwa, etc. in my $PATH, it seems redundant to have to update speedseq.config. I see some of the vars have the format:
BEDTOOLS=`which bedtools || true`
Is there a reason that others aren't like that?
Setting $SPEEDSEQ_HOME=/usr/local/
works if the binaries are in the same place, but that's not always the case...
@s-boardman forked from #40
Hi Colby,
I've been able to run this today and no longer have the chromosome naming error.
However, what we now have is what looks like a root error (output below):
/opt/gridware/pkg/apps/speedseq/0.0.3a/gcc-4.4.6+root-5.34.30+samtools-0.1.19+python-2.7.3/bin/cnvnator-multi: error while loading shared libraries: libCore.so.5.34: cannot open shared object file: No such file or directory
Should I open a new ticket or do you think this related?
I am trying to speeseq aln, and although it appears I have all the prerequisites installed the process just sits there at 0% CPU indefinitely.
Here are my inputs: (http://clavius.bc.edu/~erik/speedseq/)
http://clavius.bc.edu/~erik/speedseq/chr20_bit.fa
http://clavius.bc.edu/~erik/speedseq/sample005.fa_1.fastq
http://clavius.bc.edu/~erik/speedseq/sample005.fa_2.fastq
Please let me know what obvious thing I'm doing wrong :)
thanks for creating a wonderful tool.
Is there a way to use bam files as input for the alignment step.
Its very time consuming to generate fastqs (sorting bams and bam2fastx steps) from bams.
I'm testing speedseq on our cluster and am running into an issue where calling speedseq sv with CNVnator doesn't find the bam file passed to it.
The command I'm using is:
qsub -V -e e_r -o o_r -b Y -cwd -N speedseq_sv_rd \
speedseq sv -R /mnt/lustre/references/hg19/hg19_validated.fa \
-o NA12877 -g -k -d -v \
-B /mnt/archive/analysis/projects/HiSeq/CNV_Genome_Project/External_Data/Illumina_Platinum_Genomes/NA12877/bam_fromfastq_frombam/NA12877.bam \
-S /mnt/archive/analysis/projects/HiSeq/CNV_Genome_Project/External_Data/Illumina_Platinum_Genomes/NA12877/bam_fromfastq_frombam/NA12877.splitters.bam \
-D /mnt/archive/analysis/projects/HiSeq/CNV_Genome_Project/External_Data/Illumina_Platinum_Genomes/NA12877/bam_fromfastq_frombam/NA12877.discordants.bam
And the error I receive is:
Traceback (most recent call last):
File "/opt/gridware/pkg/apps/speedseq/0.0.3a/gcc-4.4.6+root-5.34.30+samtools-0.1.19+python-2.7.3/bin/cnvnator_wrapper.py", line 350, in <module>
chroms_list = get_chroms_list(args.bam)
File "/opt/gridware/pkg/apps/speedseq/0.0.3a/gcc-4.4.6+root-5.34.30+samtools-0.1.19+python-2.7.3/bin/cnvnator_wrapper.py", line 153, in get_chroms_list
proc = subprocess.Popen(['samtools', 'view', '-H', bam_fn], stdout = subprocess.PIPE)
File "/opt/gridware/pkg/apps/python/2.7.3/gcc-4.4.6/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/opt/gridware/pkg/apps/python/2.7.3/gcc-4.4.6/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
And from the verbose output I can see that this is the echoed CNVnator command:
# run cnvnator-multi
/opt/gridware/pkg/apps/python/2.7.3/gcc-4.4.6/bin/python2.7 /opt/gridware/pkg/apps/speedseq/0.0.3a/gcc-4.4.6+root-5.34.30+samtools-0.1.19+python-2.7.3/bin/cnvnator_wrapper.py \
--cnvnator /opt/gridware/pkg/apps/speedseq/0.0.3a/gcc-4.4.6+root-5.34.30+samtools-0.1.19+python-2.7.3/bin/cnvnator-multi \
-T NA12877.V0rHAeDGFjgn/cnvnator-temp -t 1 -w 100 \
-b /mnt/archive/analysis/projects/HiSeq/CNV_Genome_Project/External_Data/Illumina_Platinum_Genomes/NA12877/bam_fromfastq_frombam/NA12877.bam \
NA12877.V0rHAeDGFjgn/NA12877.bam.readdepth \
-c /mnt/lustre/references/speedseq/annotations/cnvnator_chroms -g GRCh37
Therefore /mnt/archive/analysis/projects/HiSeq/CNV_Genome_Project/External_Data/Illumina_Platinum_Genomes/NA12877/bam_fromfastq_frombam/NA12877.bam
is args.bam and gives and OSError. However, this file definitely exists and was generated by speedseq align without issue. If I run the speedseq sv command without CNVnator (-d flag) the process completes without errors and I get a genotyped vcf.
Any thoughts/ideas would be greatly appreciated!
Feature request: option to change read groups such as ID and SM while using realign
may be implementing using bamaddrg (https://github.com/ekg/bamaddrg) or any of your favorite tool
It looks like gawk is utilized explicitly in speedseq, but not currently listed as a prerequisite. I'm seeing errors in a Docker image where I did not explicitly install gawk.
Hi,
The HPC staff is installing speedseq for me and she told me that CNVnator can not be installed:
CNVnator failed -- it's missing a header file: TFrame.h
Any ideas?
Thanks,
Ming
Hi Colby,
I want to run the speedseq sv functionality on existing bwa-mem produced Tumor/Normal bam file.
I tried using samblaster to generate this but i always get the splitter bam file as empty. Do you have any suggestions or other ways to quickly get these files.
Thanks you for your help in advance.
Best,
Ronak
When running Var module : Getting error messages
"/speedseq/bin/freebayes: unrecognized option '--experimental-gls'
did you mean --use-best-n-alleles ? "
Seems "experimental-gls" option is decapitated
hope it does not effect the outcome?
Much easier and more stable:
http://gemini.readthedocs.org/en/latest/content/installation.html#automated-installation
make CNVnator run on multi sample LUMPY alignments.
Note that the current version of bwa mem (0.7.7) occasionally marks extremely distant reads as concordant (~250 Mb insert). This will seriously bias the LUMPY insert distribution, as calculated by pairend_distro.pl.
Hello,
I am trying to run 'speedseq var' on the 'speedseq aln' output bam file.
I am getting error when speedseq var tries to run freebayes. It seems speedseq requires older version of freebayes.
I have the most recent version freebayes installed (v9.9.13) and the error I am getting is unrecognized option --region
Can you please suggest which version of freebayes will be compatible with the speedseq var command?
Also , if speedseq is being upgraded to use most recent version of freebayes.
Thanks
Priti
As we know, GATK HaplotypeCaller is used for SNP calling.
As speedseq is used for SNV calling, what is the difference between these two tools?
Is it possible to use speedseq for SNP calling?
i think it will stall with current config. should query entire genome.
don't forget to change the '-d' flag
Hi
i found out a weird behaving of BWA where it maps on the wrong part
Here my reads which cause the problem pb.fastq:
@M01342:47:000000000-A9VMJ:1:2110:28593:14009 1:N:0:1
ATCGGACCAGGCTTCATTCCC
+
CBCCCCCCCFCCGGGGGGGGG
@M01342:47:000000000-A9VMJ:1:2112:10072:14449 1:N:0:1
ATCGGACCAGGCTTCATTCCC
+
AAABABBBBFAAFGGGGGGGG
here is my reference genome brassica_pb.fa (actually they are micro RNA but it should work also, right?)
bna-miR167c_MIMAT0005628_Brassica_napus_miR167c
TGAAGCTGCCAGCATGATCTA
bna-miR166a_MIMAT0005629_Brassica_napus_miR166a
TCGGACCAGGCTTCATTCCCC
here are my commands:
/home/ctuser/Documents/programmes/bwa-0.7.10/bwa index brassica_pb.fa
/home/ctuser/Documents/programmes/bwa-0.7.10/bwa aln -t 12 -n 0 -k 0 brassica_pb.fa pb.fastq > ./res_BWA_pb_newVersion/pb.sai
/home/ctuser/Documents/programmes/bwa-0.7.10/bwa samse brassica_pb.fa ./res_BWA_pb_newVersion/pb.sai $f > ./res_BWA_pb_newVersion/pb.sam
here is my sam file:
@sq SN:bna-miR167c_MIMAT0005628_Brassica_napus_miR167c LN:21
@sq SN:bna-miR166a_MIMAT0005629_Brassica_napus_miR166a LN:21
@pg ID:bwa PN:bwa VN:0.7.10-r789 CL:/home/ctuser/Documents/programmes/bwa-0.7.10/bwa samse brassica_pb.fa ./res_BWA_pb_newVersion/pb.sai pb.fastq
M01342:47:000000000-A9VMJ:1:2110:28593:14009 4 bna-miR167c_MIMAT0005628_Brassica_napus_miR167c 21 25 21M * 0 0 ATCGGACCAGGCTTCATTCCC CBCCCCCCCFCCGGGGGGGGG XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:21
M01342:47:000000000-A9VMJ:1:2112:10072:14449 4 bna-miR167c_MIMAT0005628_Brassica_napus_miR167c 21 25 21M * 0 0 ATCGGACCAGGCTTCATTCCC AAABABBBBFAAFGGGGGGGG XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:21
we can see that both of the read have been mapped to the first mir but it should not have mapped at all (or maybe to the second but I have put into parameter no mismatch)!!
I tried three different version of BWA (0.7.5, 0.6.1 and 0.7.10) but same behaviour.
Did I do something wrong?
I must say it usually work fine and found the right mir but for two cases includind this one, it found the mir just before the right one. Is there a problem of length?
Thank for any help
git clone --recursive git://github.com/cc2qe/speedseq
…
…
Initialized empty Git repository in /speedseq/src/parallel/.git/
fatal: reference is not a tree: 8a570f867dde07fbe9a025f2eec706d44b82966c
Unable to checkout '8a570f867dde07fbe9a025f2eec706d44b82966c' in submodule path 'src/parallel'
Appears the reference is to a commit that’s not been pushed to the main repo
TypeError: %d format: a number is required, not numpy.float64
Hi there,
I am sorry to bother you again. The speedseq realign
was working for me, but when the HPC staff re-install it, I had a problem with the bamtofastq python script. Can you tell what's wrong here? Thank you! Note, it is the same bam file (which was successfully remapped) that gives me error. My bam file should contain the cigar information for the reads.
Ming
Traceback (most recent call last):
File "/risapps/src6/speedseq//bin/bamtofastq.py", line 157, in
sys.exit(main())
File "/risapps/src6/speedseq//bin/bamtofastq.py", line 153, in main
args.header)
File "/risapps/src6/speedseq//bin/bamtofastq.py", line 64, in bamtofastq
if 5 in [x[0] for x in al.cigar]:
TypeError: 'NoneType' object is not iterable
samblaster: Version 0.1.21
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Opening temp/disc_pipe for write.
samblaster: Opening temp/spl_pipe for write.
[gzclose] buffer error
samblaster: Loaded 84 header sequence entries.
Traceback (most recent call last):
File "/risapps/src6/speedseq//bin/bamtofastq.py", line 157, in
sys.exit(main())
File "/risapps/src6/speedseq//bin/bamtofastq.py", line 153, in main
args.header)
File "/risapps/src6/speedseq//bin/bamtofastq.py", line 64, in bamtofastq
if 5 in [x[0] for x in al.cigar]:
TypeError: 'NoneType' object is not iterable
Variant calling fails after several hours with "Please specify a BAM file or files" error message when the bed file contains headers:
$ head -n 5 ~/SureSelect_AllExon_V5_hg19_target_coordinates.bed
browser position chr1:65510-65625
track name="Covered" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5 - Genomic regions covered by probes" color=0,0,128
chr1 65509 65625 -
chr1 65831 65973 -
chr1 69481 69600 ens|ENST00000335137,ccds|CCDS30547.1,ref|NM_001005484,ref|OR4F5
Nice pipeline, especially the CNV part! CNVnator gives many valid CNV calls complementary to lumpy.
When installing root locally for CNVnator, it has to be compiled without prefix, otherwise it won't run through the entire speedseq pipeline (at least for me). I know it is counterintuitive. Just type
./configure
make
no "make install" required. Everything gets compiled locally and with source /pathto/root/bin/thisroot.sh libs gets linked.
use wider stdev, either 4 or 5 in constructing the insert distribution for lumpy
I am having an issue identical to the closed issue #28 . I have sourced the root installation and tired the suggestion from the seqanswers link you found (http://seqanswers.com/forums/showthread.php?t=16665).
When we run the make cnvnator-multi we are getting the error:
g++ -m64 -O3 -DCNVNATOR_VERSION="v0.3" -I/net/gs/vol3/software/modules-sw/ROOT/5.34.14/Linux/RHEL6/x86_64/include -Isrc/samtools -c src/cnvnator.cpp -o src/obj/cnvnator.o
In file included from src/cnvnator.cpp:8:
src/HisMaker.hh:11:20: error: TFrame.h: No such file or directory
I'm wondering if it may be related to the version of Root we are using (5.34.14) or if that matters.
Otherwise, the speedseq installation is working and we can run the speedseq sv without the "-d" flag.
Also, other than reading about the impact on sensitivity and FDR in the Lumpy paper, I don't have a full understanding of how the read depth is utilized by Lumpy the the impact on the output. I'm sure there is a line in the README or something I have missed - could you point me to the best spot for that?
Thanks very much for your support and work on this software.
Kind regards,
Seamus Ragan
Particularly useful for low read sequencing analysis.
Hi there,
I was running speedseq realign, and it complains that pysam not installed. But I do have pysam installed, and the error message looks truncated... I can open python and import pysam without problem.
Do you have an idea what's the problem? Thanks.
Sourcing executables from /risapps/rhel6/speedseq/0.0.3/bin/speedseq.config ...
Checking for required python modules ()...
Program: speedseq
Version: 0.0.3a
Author: Colby Chiang ([email protected])
usage: speedseq [options]
command: align align FASTQ files with BWA-MEM
var call SNV and indel variants with FreeBayes
somatic call somatic SNV and indel variants in a tumor/normal pair with FreeBayes
sv call SVs with LUMPY
realign re-align from a coordinate sorted BAM file
options: -h show this message
Error: pysam is not installed for
Do auto name-sorting and split/duplicate read extraction
Hi!
In the paper you state that the code is open source
It would be awesome to see an explicit license for the repo 😃
Would it be possible for speedseq_setup to check pre-existing versions against the version that speedseq is trying to install. Additionally, can you pass a make -j <num_cores> into the make commands or run certain steps in parallel since installation takes a while.
SAM spec allows * in the query and quality strings. Since these fields are not used by LUMPY, we can use a * and greatly reduce file size.
However, sambamba 0.4.7 has a bug that causes it to barf with those BAMs. Need to upgrade to sambamba 0.5.1, which has patched it. However, there are issues with creation and destruction of sort temp directories with sambamba 0.5.1 which need to be resolved before upgrading.
Hi,
I was used "speedseq sv" command to call SVs on the test data. It reported that pysam is not installed, while I really installed pysam (version 0.8.3) and imported it locally without reporting any errors.
The version of speedseq I used is 0.1.0. Does anyone have the same problem in running speedseq?
Many thanks.
Hi
Just a quick q...what kind of run times are you getting for freebayes on 100bp 30x PE exome-seq, for example?
Cheers
Steve
Thanks for developing speedseq, I've been using it to call SVs in quads and it usually works except occasionally I get the following error (pasted below). Could you please help resolve this issue? N.b. sometimes I resubmit failed jobs and they work.
Error in TFile::WriteBuffer: error writing all requested bytes to file /oasis/tscc/scratch/wb/lumpy/temp/speedseq_sv_74-0115/cnvnator-temp/03C14334-sorted-rmdups-realigned-bqsr.bam.root, wrote 1230 of 4272
Error in TTree::Fill: Failed filling branch:7.rd_parity, nbytes=-1, entry=4405788
This error is symptomatic of a Tree created as a memory-resident Tree Instead of doing:
TTree *T = new TTree(...)
TFile *f = new TFile(...) you should do:
TFile *f = new TFile(...)
TTree *T = new TTree(...)R__unzip: error -5 in inflate (zlib)
Error in TBasket::ReadBasketBuffers: fNbytes = 4272, fKeylen = 73, fObjlen = 31926, noutot = 0, nout=0, nin=4199, nbuf=31
926Error in TBranch::GetBasket: File: /oasis/tscc/scratch/wb/lumpy/temp/speedseq_sv_74-0115/cnvnator-temp/03C14334-so
rted-rmdups-realigned-bqsr.bam.root at byte:2721053490, branch:rd_parity, entry:4389825, badread=1, nerrors=1, basketnumber
=275R__unzip: error -5 in inflate (zlib)
Error in TBasket::ReadBasketBuffers: fNbytes = 4272, fKeylen = 73, fObjlen = 31926, noutot = 0, nout=0, nin=4199, nbuf=31
926
.
.
.
.
(and so on...)
Speedseq should have a command line parameter to pass the ploidy level to freebayes.
Hi
On Ubuntu Linux, the for awk for-loops in the execution script are making problems, as the used syntax is only supported by an extension to the GNU-awk, as described in this issue at stack overflow: http://stackoverflow.com/questions/16921493/awk-illegal-reference-to-array-a
I changed the loops to the following notation: for(i=1;i in fmt;i++)
and it seems to work. Could you adapt that to the source code in order to support all awk flavours?
Thanks
We have installed Root and tested that it works. The SpeedSeq sv also works without the -d option. We have added
source /mnt/pan/Data4/speedseq/root-v5-34/bin/thisroot.sh to the end of the speedseq.config.
When we run the speedseq example (run_speedseq.sh) with the -d option to get CNV, we receive the following message:
--Example script:
../bin/speedseq sv
-o example
-B example.bam
-S example.splitters.bam
-D example.discordants.bam
-R data/human_g1k_v37_20_42220611-42542245.fasta
-d
***_Below is the message on the terminal_
Calculating read depth
Traceback (most recent call last):
File "/mnt/pan/Data4/local/vxv89/sxs1528/speedseq/speedseq/bin/cnvnator_wrapper.py", line 350, in
chroms_list = get_chroms_list(args.bam)
File "/mnt/pan/Data4/local/vxv89/sxs1528/speedseq/speedseq/bin/cnvnator_wrapper.py", line 153, in get_chroms_list
proc = subprocess.Popen(['samtools', 'view', '-H', bam_fn], stdout = subprocess.PIPE)
File "/home/sxs1528/anaconda/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/home/sxs1528/anaconda/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Please give us suggestions as what to do next.
Thank you
check for BWA index before alignment rather than hanging forever while throwing a completely uninformative error
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.