fulcrumgenomics / fgbio Goto Github PK
View Code? Open in Web Editor NEWTools for working with genomic and high throughput sequencing data.
Home Page: http://fulcrumgenomics.github.io/fgbio/
License: MIT License
Tools for working with genomic and high throughput sequencing data.
Home Page: http://fulcrumgenomics.github.io/fgbio/
License: MIT License
Hi again,
Just noticed a format field has no header entry:
$ java -jar fgbio-0.1.2-SNAPSHOT.jar HapCutToVcf -v ../NA12878.1000G.vcf -i na12878.output_haplotype_file -o test.vcf.gz
$ bcftools view test.vcf.gz > /dev/null
[W::vcf_parse_format] FORMAT 'PR' is not defined in the header, assuming Type=String
Instead of this:
val parIter = iter.par
parIter.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threads))
val result = parIter.map(x => doWork(x))
something like:
val result = iter.withPar(threads).map(x => doWork(x))
and something like this:
val threadPool: ForkJoinPool = ...
val result = iter.withPar(threadPool).map(x => doWork(x))
The latter is useful if we have multiple parallel collections.
Not sure why since this isn't usually a problem:
java -Xmx4096m -jar fgbio.jar CallMolecularConsensusReads -i grouped.unsorted.bam -o /dev/stdout -r consensus.rejects.bam -t MI -P false > tmp.bam
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: /dev/stdout (Permission denied)
at htsjdk.samtools.util.BinaryCodec.close(BinaryCodec.java:610)
at htsjdk.samtools.BAMFileWriter.finish(BAMFileWriter.java:143)
at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:219)
at com.fulcrumgenomics.umi.CallMolecularConsensusReads.execute(CallMolecularConsensusReads.scala:134)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
Caused by: java.io.FileNotFoundException: /dev/stdout (Permission denied)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at htsjdk.samtools.util.BlockCompressedInputStream.checkTermination(BlockCompressedInputStream.java:471)
at htsjdk.samtools.util.BlockCompressedOutputStream.close(BlockCompressedOutputStream.java:286)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at htsjdk.samtools.util.BinaryCodec.close(BinaryCodec.java:606)
... 7 more
Tracked it down to this: samtools/htsjdk#692
As it stands the way GroupReadsByUmi checks whether UMIs are equivalent is just by counting the number of bases that are not identical (after upper-casing). This is problematic for a few corner cases:
My first instinct is that we should do two things:
There are a number of options that are common to many, but not all, tools in fgbio
that it would be nice to be able to have standard names for. For example things that control some aspects of BAM and VCF writing like whether or not to create indices, what compression level to use, whether or not to enable async-io.
Perhaps we could convert fgbio
to use the same sub-command
model as dagr
uses, and then define options to fgbio
separately from options to the tools? Then we could write things like:
java -jar fgbio.jar --async=true DemuxFastqs --input ...
I was recently experimenting with CallMolecularConsensusReads, and it appears to strip the alignment information out of the resulting bam when I use it. Is this the desired behavior?
@tfenne what do you think?
It would be great to have it depend on a stable version of dagr
for the release, so perhaps depend on fulcrumgenomics/dagr#201?
Requires the spec to be a bit more settled. See:
We've been getting started with an AnnotateBamWithUmis, GroupReadsByUmi and CallMolecularConsensusReads pipeline to consolidate UMI tagged input fastqs and a couple of conceptual questions and feature requests came up and the grouping process:
Thanks much for putting together these tools.
Hi,
How will the AnnotateBamWithUmis tool work with two UMIs? I have reads with one UMI 5' of insert and one 3' of it.
Hello,
I was wondering if soft clipped bases, in Illumina HiSeq4000 data, are removed before consensus building? We have noticed a lot more longer insertions in our analyses when use all reads for a given UMI but don''t not in consensus build from 3 or more reads of a given UMI.
Thanks,
Sakina
Hi folks,
First of all, many thanks for providing this nice package. I just started using fgbio to deal with sequences that use some customized UMIs. I could successfully annotate my bam file with UMI tags. Bam file looks like the following after annotating with AnnotateBamWithUmis:
NS500607:25:HGWNLBGXY:4:11404:10042:8605 99 chr1 836282 60 74M = 836455 247 GTGAATCTAAAACCGTTCTTTAAAAAGAGTTTATTTAGTAGGCTGGGCATGGTGGCTCACACCTGCAATCCCAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEE X0:i:1 X1:i:0 MD:Z:74 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U RX:Z:AGGGCTTAACGC
NS500607:25:HGWNLBGXY:4:11406:25080:7099 99 chr1 836282 60 74M = 836455 247 GTGAATCTAAAACCGTTCTTTAAAAAGAGTTTATTTAGTAGGCTGGGCATGGTGGCTCACACCTGCAATCCCAG AAAAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE X0:i:1 X1:i:0 MD:Z:74 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U RX:Z:AGGGCTTAACGC
Please note that mates are also in the bam file, but not shown here.
So the issue is with GroupreadsByUmi: nothing is written in the output bam (empty file). Following is the log output:
[2016/11/02 17:17:28 | FgBioMain | Info] Executing GroupReadsByUmi from fgbio version 0.1.2-SNAPSHOT as mh1111 JRE 1.8.0_73-b02
[2016/11/02 17:17:28 | GroupReadsByUmi | Info] Filtering and sorting input.
[2016/11/02 17:17:44 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
[2016/11/02 17:17:44 | FgBioMain | Info] GroupReadsByUmi completed. Elapsed time: 0.33 minutes.
I also run the code as:
java -Xmx32g -jar ~/src/fgbio/target/scala-2.11/fgbio-0.1.2-SNAPSHOT.jar GroupReadsByUmi -i input_bam.bam -s edit -e 2 -m 20 -o output_bam.bam --tmp-dir=../tmp
I would be very grateful if you could please help me out to fix this. It is very hard to figure this out as I do not get any error message.
Cheers,
MH
It probably belongs in homebrew-science. See the contributing guide.
class Fgbio < Formula
desc "Tools for working with genomic and high throughput sequencing data."
homepage "https://github.com/fulcrumgenomics/fgbio"
url "https://github.com/fulcrumgenomics/fgbio/archive/0.1.1.tar.gz"
sha256 "4b2414ab8af5e8cfa384cde44fc9a282cec83c51cd93efe0113866dc2840e696"
head do
url "https://github.com/fulcrumgenomics/fgbio.git"
end
bottle :unneeded
depends_on :java => "1.8+"
depends_on "sbt" => :build
def install
system "sbt", "assembly"
libexec.install Dir["target/scala-*/fgbio-*.jar"][0]
bin.write_jar_script Dir[libexec/"fgbio-*.jar"][0], "fgbio"
end
test do
assert_match "USAGE", shell_output("#{bin}/fgbio -h 2>&1", 1)
end
end
See #199
I find this personally very useful, so I'll assign this to myself.
It would be nice to have the counter objects implement the CanBuildFrom interfaces so that you can manipulate their contents and get another counter back without a lot of extra effort.
Similarly it might be nice to implement immutable Map too to provide Map-like accessors?
E.g. at startup should they report their environment similar to the way Picard tools do? At the end should they print out the runtime and any other useful stats?
This is useful to ensure that there is at least one block per chromosome.
There are really three parts to this proposal:
Simplify the consensus caller (particularly the usage thereof) by removing parameters that we have found are not useful and are overlapped with other parameters
Add tags to the consensus reads that are generated so that downstream tools can know a lot more about the properties of each consensus read and it's supporting raw reads
Move filtering of consensus reads and bases largely into a separate program that can be run after consensus calling, removing the need to call consensuses multiple times to experiment with different cutoffs
To recap, we have a lot of parameters to consensus calling:
errorRatePreUmi: PhredScore
errorRatePostUmi: PhredScore
minInputBaseQuality: PhredScore
maxRawBaseQuality: PhredScore
rawBaseQualityShift: PhredScore
minConsensusBaseQuality: PhredScore
minReads: Int
minMeanConsensusBaseQuality: PhredScore
requireConsensusForBothPairs: Boolean
I'm finding that in general there are two sets of highly overlapped parameters where I favor one of the set, and would advocate for removal of the others:
errorRatePostUmi
, maxRawBaseQuality
and rawBaseQualityShift
all work to effectively reduce the probability we construct from the raw read base qualities, albeit in slightly different ways. I'm finding that I tend to use errorRatePostUmi
exclusively and then set maxRawBaseQuality
to something high that will have little to no effect, and rawBaseQualityShift
to 0. I propose removing maxRawBaseQuality
and rawBaseQualityShift
.
minConsensusBaseQuality
and minMeanConsensusBaseQuality
. While not exactly redundant I find that I set minConsensusBaseQuality high enough (i.e. close enough to errorRatePreUmi
that then the minMeanConsensusBaseQuality
becomes meaningless. I would propose getting rid of minMeanConsensusBaseQuality
, perhaps in favor of introducing a filter on the number or proportion of no-calls generated in a consensus read (more in the filtering section).
It would help significantly to encode information about the consensus reads into tags that are added to the consensus reads in the BAM file for three reasons:
One could generate a single consensus BAM, re-align it, and then filter it different ways to experiment with filtering settings much more easily and quickly
It would allow for the information to be used in filtering variants (e.g. you could imagine a performing rank-sum based tests on things like raw-read-depth or consensus-error-rate between reads supporting ref and alt alleles at a site).
It would aid when manually reviewing false positive or questionable variant calls to have more knowledge about the consensus read.
The real question is what tags we should add. We can break this down into per-read and per-base information. My first pass is that we'd want to encode the following per read:
Then per-base I think we'd want to capture:
This does bring up what the best way to store the per-base tags is. The first could be stored as unsigned 8, 16 or 32 bit int[]
s. I'd vote for either 8 or 16 as I'm not really sure there's value in discriminating between a consensus at 256 deep vs. deeper, and definitely not between 65k deep and deeper! The disagreeing bases could be stored in the same way, or we could store the error rates at a 32-bit float[]
(seems overkill), or (and I'm not keen on this) do some kind of scaling/logging of the values and store them as chars. Maybe we should test and see if using 16 or 32-bit ints has any overhead in BAM vs. 8-bit when all values are << 256?
Any other suggestions for things we definitely want to store? I can think of some other things, but until I feel a need for them I'm tempted to stop here.
I think this would be really beneficial for the reasons outlined above. Obviously input-read filtering would still have to happen in consensus calling. And I think I would like to keep a min-reads
filter there too for efficiency sake (e.g. if you know you're never going to go below min=2 or min=3 you can save a lot of compute and filesize by enforcing that during consensus building). But beyond that I'd suggest moving everything else off to a post-filtering tool that uses the tags described above and features of the consensus read (number of Ns, quality scores, etc.) to do both per-read and per-base filtering.
Ideally such a tool will run on an aligned BAM in coordinate order; we should discuss whether it is acceptable to produce a semi-invalid BAM (e.g. if one read in a mate is tossed, is it necessary to toss the other one too, or fix it's flags?). If the goal is to filter before producing QC metrics and variant calls, I'm not so sure that's important.
I am trying to identify the haplotypes and its alleles and my command line is posted below when am running this am getting following error , help me in fixing this error
calling MAXCUT based haplotype assembly algorithm
./command.sh: line 11: 10635 Segmentation fault
/share/apps/hapcut/extractHAIRS --VCF filtered_variants.vcf --bam aln_sorted.bam --maxIS 600 --ref 2017_V2_ND886_assembly.fa --indels 1 > fragment_matrix_file
/share/apps/hapcut/HAPCUT --fragments fragment_matrix_file --VCF filtered_variants.vcf --output output_haplotype.vcf --maxiter 100 > hapcut.log
The Metric class does not support optional values via Option[Type]
. This would be nice as there are definitely cases where some values are optional, and not necessarily present.
Should we leave any existing @pgs in the header? That's going to make chaining much more complicated.
... so that we can call it when we have completed recording to get a final count.
Hello! I am getting the following error when trying to run HapCutToVcf. I can share some test data but it is quite large.
$ java -jar /illumina/thirdparty/joconnell/fgbio/target/scala-2.11/fgbio-0.1.2-SNAPSHOT.jar HapCutToVcf -v ../NA12878.1000G.vcf -i na12878.output_haplotype_file -o test.vcf
[2016/09/19 20:52:24 | FgBioMain | Info] Executing HapCutToVcf from fgbio version 0.1.2-SNAPSHOT as [email protected] on JRE 1.8.0_40-b26
[2016/09/19 20:52:25 | FgBioMain | Info] HapCutToVcf failed. Elapsed time: 0.07 minutes.
Exception in thread "main" java.lang.NumberFormatException: For input string: "1|0"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
at com.fulcrumgenomics.vcf.HapCutCall$$anonfun$9.apply(HapCutToVcf.scala:503)
at com.fulcrumgenomics.vcf.HapCutCall$$anonfun$9.apply(HapCutToVcf.scala:503)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at com.fulcrumgenomics.vcf.HapCutCall.toVariantContext(HapCutToVcf.scala:503)
at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.next(HapCutToVcf.scala:269)
at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.next(HapCutToVcf.scala:240)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.foreach(HapCutToVcf.scala:240)
at com.fulcrumgenomics.vcf.HapCutToVcf.execute(HapCutToVcf.scala:111)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
just to check that my inputs are sane:
$ head na12878.output_haplotype_file
BLOCK: offset: 921 len: 195 phased: 2 SPAN: 1719 fragments 3
921 0 1 chr1 568214 C T 1|0:255,0,255:84:26,28,14,16:127 0 0.000000 0.000000
1115 0 1 chr1 569933 G A 1|0:221,0,255:97:36,36,15,10:127 0 0.000000 0.000000
********
BLOCK: offset: 1553 len: 59 phased: 2 SPAN: 2070 fragments 1
1553 0 1 chr1 724869 C A 1|0:91,0,165:16:11,0,0,5:90 0 0.000000 -0.000087
1611 0 1 chr1 726939 G C 1|0:255,0,255:52:13,11,7,21:127 0 0.000000 -0.000087
********
BLOCK: offset: 1952 len: 2 phased: 2 SPAN: 45 fragments 9
1952 0 1 chr1 749644 T C 1|0:41,0,78:35:6,14,11,4:42 0 0.000000 0.000000
$ bcftools view -H ../NA12878.1000G.vcf | head
chr1 10539 . C A 0 . DP=61;MQSB=0.805878;MQ0F=0;AC=0;AN=2;DP4=40,19,0,0;MQ=44 GT:PL:DP:DP4:GQ 0|0:0,178,255:59:40,19,0,0:255
chr1 10642 . G A 56 . DP=20;MQSB=0.818182;MQ0F=0.7;AC=0;AN=2;DP4=2,11,0,0;MQ=0 GT:PL:DP:DP4:GQ 0|0:0,39,29:13:2,11,0,0:28
chr1 11008 . C G 54 . DP=21;MQSB=1;MQ0F=1;AC=0;AN=2;DP4=12,9,0,0;MQ=0 GT:PL:DP:DP4:GQ 0|0:0,63,27:21:12,9,0,0:27
chr1 11012 . C G 52 . DP=21;MQSB=1;MQ0F=1;AC=0;AN=2;DP4=12,8,0,0;MQ=0 GT:PL:DP:DP4:GQ 0|0:0,60,25:20:12,8,0,0:25
chr1 13011 . T G 0 . DP=107;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.55873;BQB=1;MQ0F=0.635514;AC=0;AN=2;DP4=36,66,0,1;MQ=8 GT:PL:DP:DP4:GQ 0|0:0,255,196:103:36,66,0,1:255
chr1 13110 . G A 999 . DP=51;MQSB=0.0687032;MQ0F=0.313726;AC=0;AN=2;DP4=7,42,0,0;MQ=15 GT:PL:DP:DP4:GQ 0|0:0,148,179:49:7,42,0,0:255
chr1 13116 . T G 97 . DP=46;VDB=0.0517943;SGB=-0.693132;RPB=0.000248006;MQB=0.00218214;MQSB=0.29013;BQB=0.0112228;MQ0F=0.413043;AC=2;AN=2;DP4=3,7,3,31;MQ=11 GT:PL:DP:DP4:GQ 1|1:124,38,0:44:3,7,3,31:46
chr1 13118 . A G 94 . DP=46;VDB=0.0555373;SGB=-0.693132;RPB=0.0001001;MQB=0.00136901;MQSB=0.197653;BQB=0.155903;MQ0F=0.456522;AC=2;AN=2;DP4=4,7,3,31;MQ=11 GT:PL:DP:DP4:GQ 1|1:121,35,0:45:4,7,3,31:42
chr1 13259 . G A 999 . DP=65;MQSB=0.896401;MQ0F=0.707692;AC=0;AN=2;DP4=26,37,0,0;MQ=5 GT:PL:DP:DP4:GQ 0|0:0,190,135:63:26,37,0,0:255
chr1 13273 . G C 999 . DP=55;MQSB=0.987527;MQ0F=0.709091;AC=0;AN=2;DP4=22,32,0,0;MQ=6 GT:PL:DP:DP4:GQ 0|0:0,163,132:54:22,32,0,0:255
Remove: src/scripts/com/fulcrumgenomics/metrics/wgsHistogram.R
The previous implementation of NumericTypes.or(values: Array[LogProbability]): LogProbability
did not use a rolling sum, but now we do: #50
Figure out a way to make this more efficient. Likely the array is only a few elements long (four in a lot of cases). See the use of this method in the ConsensusCaller
.
The default mechanism for writing out values in Metric.write causes Boolean
values to be emitted as either Y
or N
. However, the code that reads the values back only appears to recognize true
as a value value for Boolean.True. So upon roundtrip all boolean values become false
.
Hi folks,
I am getting the following error when I run GroupReadsByUmi:
###############################
[2016/11/21 15:32:07 | GroupReadsByUmi | Info] Sorted 76,000,000 record. Elapsed time: 00:14:55s. Time for last 1,000,000: 11s. Last read position: chr2:198,257,003
[2016/11/21 15:32:19 | GroupReadsByUmi | Info] Sorted 77,000,000 record. Elapsed time: 00:15:07s. Time for last 1,000,000: 12s. Last read position: chr7:148,525,795
[2016/11/21 15:32:24 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
[2016/11/21 15:32:27 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 15.29 minutes.
Exception in thread "main" java.lang.NullPointerException
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$14.apply(GroupReadsByUmi.scala:477)
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$14.apply(GroupReadsByUmi.scala:477)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.immutable.List.map(List.scala:285)
at com.fulcrumgenomics.umi.GroupReadsByUmi.assignUmiGroups(GroupReadsByUmi.scala:477)
at com.fulcrumgenomics.umi.GroupReadsByUmi.execute(GroupReadsByUmi.scala:448)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
###############################
I could successfully run it on smaller BAM files, e.g., with 17M reads. Could it be because of the large size of my BAM file or some sort of memory issue?
Thanks for your continued support.
Cheers,
MH
It would be really nice if the tools in the consensus chain could all emit @pg tags in the header. Even if they're not stamped on the records, it would be great to capture the command lines used in grouping, calling and filtering.
Hi folks,
I have a question about the write-up on the likelihood model of CallMolecularConsensusReads available here: https://github.com/fulcrumgenomics/fgbio/wiki/Calling-Consensus-Reads
Almost all of the parameters used in the model are described in the write-up, but unfortunately I could not find how Err_{post} and Err_{pre} are determined. More precisely, what are the factors affecting these error rates? And how one can set these two parameters given the quality of grouped reads by UMI?
Thank you for your continued support,
MH
It would be interesting to try this. The thought being that if quality does tail off at the end of reads we may end up with more errors or masked bases at the end of the read. We could, during filtering, apply some kind of end-trimming to reads after masking and prior to the "too many Ns" test. The trimming could do any of:
FilterConsensusReads
works on unmapped BAMs, but does an unnecessary second sort into coordinate order(!) in order to recalculate tags that don't exist. I think we could check and if we don't see any mapped reads at all, output the file in queryname order without doing the whole second sort.
Could also consider the same for the reference.
Hi folks,
The output of CallMolecularConsensusReads seems incomplete (no coordinate info is included):
C:0 77 * 0 0 * * 0 0 TGGCTGGGCGCGGTGGCTCACGCCTGTAATCCCAGCATATNNGGAGGCTGAGGCGNGNGNNTCNNGNGNNCANG LLLLLMMMMMMMMMMMMMMAMMMMMMMMAMMMMMMM==AM##MM2MMMMMMMMMA#M#M##M;##M#M##MM#M RG:Z:A MI:Z:0
C:0 141 * 0 0 * * 0 0 CCACCTCCCGGGTTCACGCCATTCTCTNGCCTCAGCCACCTGAGTAGCTGGGACTACAGGCNNCNGCCACCNCN LLLLLMMMMMMMMMMMMMMMMMMMMMM#MMMMMMMMMMMMMMMLMMMMMMMMMAMMMMMMM##8#MMMMMM#M# RG:Z:A MI:Z:0
C:1 77 * 0 0 * * 0 0 AGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAGTGCAAAAATTAGCCAGGCATG LLLLLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM RG:Z:A MI:Z:1
C:1 141 * 0 0 * * 0 0 CCCAGGCTGGAGTGCAGTGGTGTGATATTGGCTCACTGCAACCTCCGCCTTCCGGGTTCAAGCGATTCTCCTGC LLLLLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM RG:Z:A MI:Z:1
Would you please let me know what do you think could be causing this?
Thanks a lot,
MH
reported by @tfenne: bwa mem
runs slowly with reads with even a small # of Ns.
In VanillaUmiConsensusCaller we have a method consensusReadLength
which calculates the output read length based on the the longest length at which we still have min-reads
available.
I think that probably we should quality trim the source reads prior to invoking this method. We could simply trim off bases below min-input-base-quality
at the ends of the reads without doing any harm. We could also do a phred/bwa style trim where we trim until we're consistently above min-input-base-quality
. The goal would be to ensure that we don't create consensus reads with strings of Ns at the end if we have source reads that trail off into low quality at the end. E.g. if we have 10 reads at 100bp and min=3
, but all the reads turn to low quality junk at cycle 80, I believe we'll currently create a 100bp consensus read with 20 Ns at the end.
It would be helpful to output, at the end of execution, the counts of reads that were filtered out, along with the reasons, in order to help users understand why reads are being dropped and particularly to diagnose pathalogical situations.
Hi
I am currently trying to handle sequence data that contains UMIs. I came across fgbio and have attempted to use some of its applications with my data. I have successfully annotated my bams with UMIs and am now looking to collapse the reads into a consensus sequence. I have tried using the 'GroupReadsByUmi' and 'CallMolecularConsensusReads' applications (I understand that the latter is dependent on the former?) but in both cases I encounter errors;
With regards to the former, the program seems to successfully complete the 'filtering and sorting input' step but stops during the 'Assigning reads to UMIs and outputting' step, giving the error;
Reads out of order @ D00408:358:C8KK0ANXX:7:1105:12567:37050
at scala.Predef$.assert(Predef.scala:165)
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$execute$8.apply(GroupReadsByUmi.scala:447)
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$execute$8.apply(GroupReadsByUmi.scala:447)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.fulcrumgenomics.umi.GroupReadsByUmi.execute(GroupReadsByUmi.scala:447)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
When I try running 'CallMolecularConsensusReads' command (ignoring the fact that I haven't grouped reads by UMI) I also get an error;
Alignments added out of order in SAMFileWriterImpl.addAlignment…
…
Sort order is coordinate. Offending records are at [chr1:1257827] and [chr1:1257750]
Do you have any idea what might be causing this issue - and what I can do to get around it. I have tried sorting the UMI-annotated BAM by query name and coordinate (even though in the notes it says this shouldn't matter) and it doesn't make a difference.
I would appreciate any help you could provide. I am relatively new to bioinformatics, so I apologise if the solution is obvious.
Thanks
C
Hi folks,
We have successfully used AnnotateBamWithUmis, GroupReadsByUmi and CallMolecularConsensusReads pipeline and we are very pleased with the fgbio package so far. Hats off to the developers!
I was wondering if you have additional documentation online beyond what is provided by –help command. For example, in order to properly interpret the consensus reads, we are very interested to learn more about the likelihood model used in CallMolecularConsensusReads. Such information would be very helpful to adjust the input parameters such as --error-rate-pre-umi, --error-rate-post-umi, --max-base-quality, and --base-quality-shift to name a few. Another important question is about masking consensus bases. Does the package call ‘N’ solely based on --min-consensus-base-quality?
Thank you very much in advance for your help.
Cheers,
MH
java -Xmx48G -jar /path/to/fgbio/target/scala-2.11/fgbio-0.1.2-SNAPSHOT.jar \
PickIlluminaIndices \
-n 1000000 \
-o /path/to/output.txt \
--vienna-rna-dir /path/to/ViennaRNA-2.2.5/src \
-l 10 \
-e 2 \
--allow-reverses=false \
--allow-reverse-complements=false \
--allow-palindromes=false \
--max-homopolymer=2 \
--min-gc=0.0 \
--max-gc=0.7 \
-t=32 \
--min-delta-g=-10.0 \
--adapters=Nil
I tried both master and the additions on #68. In the latter case, no Exception is thrown inside each execute
statement in the IndexSet.remove
method, but we get:
INFO 2016-07-22 14:09:16 PickIlluminaIndicesCommand Down to 186400 indices.
INFO 2016-07-22 14:09:16 PickIlluminaIndicesCommand Down to 186300 indices.
INFO 2016-07-22 14:09:16 PickIlluminaIndicesCommand Down to 186200 indices.
INFO 2016-07-22 14:09:16 PickIlluminaIndicesCommand Down to 186100 indices.
INFO 2016-07-22 14:09:17 PickIlluminaIndicesCommand Down to 186000 indices.
INFO 2016-07-22 14:09:17 PickIlluminaIndicesCommand Down to 185900 indices.
INFO 2016-07-22 14:09:17 PickIlluminaIndicesCommand Down to 185800 indices.
[2016/07/22 14:09:17 | FgBioMain | Info] PickIlluminaIndices failed. Elapsed time: 3.20 minutes.
Exception in thread "main" java.lang.IllegalStateException: Unable to remove the index from the ranked indices: {sequence: TACTATTCCA, score: -2143048647, minEditDistance:1}
at com.fulcrumgenomics.util.PickIlluminaIndicesCommand.execute(PickIlluminaIndicesCommand.java:323)
at com.fulcrumgenomics.util.PickIlluminaIndices.execute(PickIlluminaIndices.scala:59)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
This does not happen each time I run the tool, so it is obviously there's a concurrency issue.
The following terms need to be unified in CallMolecularConsensusReads
, VanillaUmiConsensusCaller
, and VanillaUmiConsensusCallerOptions
:
errorRatePreUmi -> errorRatePreLabeling
errorRatePostUmi -> errorRatePostLabeling
maxBaseQuality -> maxRawBaseQuality
baseQualityShift -> rawBaseQualityShift
Right now it outputs an unsorted BAM and it would be more useful to have it sorted by queryname.
I believe there was a tool called FilterConsensusReads. Is this tool still available? Is it considered necessary when dealing with consensus reads?
Currently it is at DEBUG
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.