hammerlab / guacamole Goto Github PK
View Code? Open in Web Editor NEWSpark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
License: Apache License 2.0
Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
License: Apache License 2.0
I was saying '-loci chr20:11200000-11300000', but my contig was actually called "20". This should be an error.
(summary of @arahuja enumerating types of SVs we are interested in, for reference):
@arahuja seemed to think that we might be able to do a first pass of SVs before a first-pass of simple INDELs, since the leading submissions seem to be doing complex things for INDELs but there is a fair amount of low-hanging-fruit in SVs.
This is a nice to have for synth4 submit on Saturday, August 2.
We have
def writeVariants(args: Arguments.Output, genotypes: RDD[ADAMGenotype]): Unit =
and
def writeVariantsToPath(path: String, args: ParquetArgs, genotypes: RDD[ADAMGenotype]): Unit = {
Do we need both?
I'm working on this now
When calling variants at only a subset of loci, it would be great if we could push that into an ADAM predicate to avoid loading the whole dataset. This would mean writing a predicate that takes a LociSet and a window size, and filters overlapping reads. It would also involve some refactoring, including the loadReads() and loci() functions in Common, since right now the filtering is done separately after loading.
Right now only M, I, D are supported.
The test at PileupSuite:170 (branch pileup_testing_i2_seb
)
fails with:
Not a match, mismatch, deletion, or insertion: =
See also http://picard.sourceforge.net/javadoc/net/sf/samtools/CigarOperator.html
Maybe a git commit hook? Or a github plugin? Jenkins?
All the read indexing logic in Pileup and SlidingWindow assume that the read's "start" field indicates where in the reference genome the first base of the read's sequence aligns to. It seems this may not be the whole story though: when read's are soft clipped, is the read's start then set to the index of the first non-clipped base? Need to write a Pileup test that includes soft clipped reads and fix the code if necessary.
Guacamole run script should snapshot:
git prune
))All of this should be logged in Cycledash.
Additionally, evaluation script runs should hook into Cycledash and be joinable on the guac-repo-states that generated the data that they're run on.
Some data model changes to support INDELs:
Add calling of INDELs to some guac caller TBD, possibly BayesianLikelihood?
If you omit the -out argument, you currently get a bunch of JSON blobs printed to stdout giving the genotypes called. It'd be nice if that were pretty printed.
This bam is from the gatk "mini bundle" at http://gatkforums.broadinstitute.org/discussion/2936/tutorial-materials-july-2013
Running
tim@peach ~/sinai/git/guacamole]$ scripts/guacamole ~/sinai/data/adam/wgs_b37_20.bam.adam OUT.vcf.adam -loci 20:11200000-11300000
Gives this error:
2014-04-17 19:47:08.516 java[22529:d07] Unable to load realm info from SCDynamicStore
2014-04-17 19:47:08 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--> Starting.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
--> Loaded 0 reference fragments and 7807420 reads
--> Filtered: 7807420 reads total -> 935409 mapped and relevant reads
2014-04-17 19:50:09 WARN TaskSetManager:62 - Lost TID 282 (task 7.0:2)
2014-04-17 19:50:09 ERROR Executor:87 - Exception in task ID 282
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 752
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.write(Output.java:183)
at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:50)
at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:37)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:124)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:221)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Possible contenders:
Scalameter http://scalameter.github.io/
Used by: http://scala-miniboxing.org/, but not sure how to connect to maven/build process
Google Caliper https://code.google.com/p/caliper/
Java, used by: https://github.com/scalanlp/breeze
For structural variants as in #130 we need to output the following in the VCF file:
In the alternate field:
<DUP>, <INV>, <DEL> or <INS>
In the info field:
See http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20(Variant%20Call%20Format)%20version%204.0/encoding-structural-variants for more information
Should probably add a DistributedUtilSuite for these.
Should test running on loci with multiple contigs and a wide range of tasks numbers (e.g. more tasks than loci, 1 task total)
The current QualityAlignedReadsFilter removes all reads below some alignment quality threshold, but does so on a pileup by pileup basis. We could remove all of these reads upfront. This would help avoid pileups we will skip anyways and help with read distribution across tasks (as we would have a more accurate count of reads that will actually be processed)
The caveat is, even if we remove these reads, we use these reads to compute a measure of complexity - % reads below some alignment quality defines complexity currently. But this filter hasn't proved to be that useful, and can probably be replaced with something else.
@arahuja is working on being the right amount of suspicious about variants that e.g. only appear on one strand as potentially being caused by errors sequencing that strand.
Especially important to verify the atGreaterLocus method in Pileup.Element
The use of takeWhile on sortedReads iterator seems to cause unexpected behavior. https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/main/scala/org/bdgenomics/guacamole/SlidingReadWindow.scala#L75
From the scala docs it seems this is not recommended -
"Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well."
http://www.scala-lang.org/api/2.10.4/index.html#scala.collection.Iterator
Because of way deletions are currently handled in pileups they have blank variant allele and appear as below:
variant:
.contig:
..contigName = 20
.position = 16866006
.referenceAllele = T
.variantAllele =
This breaks output to VCF as the variantAllele is null and makes it difficult to compute concordance as this is not the usual representation of a deletion.
This invocation hangs:
scripts/guacamole threshold -loci chrM:2000-2050 -reads guacamole-core/src/test/resources/chrM.sorted.bam -threshold 0 -out /tmp/OUT.gt.adam -spark_kryo_buffer_size 64 -parallelism 3
but parallelism != 3 works (i.e. makes it to the Kryo error)
(summarizing @arahuja on the subject)
Simple Version
When many SNVs are found close together, consider dropping them under the assumption that something may have been wrong with the reads there. @arahuja has a PoC of this.
Complex Version
In regions where SNVs/SVs are found, run more advanced assembly algorithms, e.g. Heng Li's fermi, contrail, other de-Bruijn's-based ones.
Now that bigdatagenomics/adam#222 is in ADAM, we have the power.
Add scripts to the "scripts" directory to download the test dataset
We need to know how long we're spending doing actual variant calling ( threshold.callVariantsAtLocus) vs. Spark data shuffling, etc. One way to do this would be to log a message in callVariantsAtLocus every X loci, and aggregate those logs. Another way would be to rig up a profiler.
This line
https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/main/scala/org/bdgenomics/guacamole/Pileup.scala#L242
does:
assume(locus <= read.record.end.get)
but RichADAMRecord says that it is “exclusive”
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichADAMRecord.scala#L77
// Returns the exclusive end position if the read is mapped, None otherwise
lazy val end: Option[Long] = {
So I guess, we need strict comparison (or better naming in Adam?)
With parallelism=0, Guacamole seems to run successfully, but then we have a hang at the end:
threshold -loci chrM:2000-2050 -reads guacamole-core/src/test/resources/chrM.sorted.bam -threshold 0 -out /tmp/OUT.gt.adam -spark_kryo_buffer_size 64 -parallelism 0
--> [Fri May 02 14:46:44 EDT 2014]: Guacamole starting.
2014-05-02 14:46:45.971 java[12456:d07] Unable to load realm info from SCDynamicStore
2014-05-02 14:46:45 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-05-02 14:46:46 WARN ADAMContext:283 - Predicate is ignored when loading a BAM file
--> [3.94 sec. later]: Loaded 38461 reads.
--> [0.56 sec. later]: Filtered to 38461 mapped non-duplicate reads
--> [0.05 sec. later]: Considering 50 loci across 1 contig(s): chrM:2000-2050
--> [0.00 sec. later]: Collecting reads onto spark master.
--> [2.29 sec. later]: Done collecting reads.
--> [1.03 sec. later]: Writing 50 genotypes to: /tmp/OUT.gt.adam.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
--> [1.70 sec. later]: Done writing.
< ... hangs here ... >
The Args4j (inherited from ADAM) library seems pretty clunky. We should figure out how to address some of these annoyances with it or switch to a better library if there is one (needs investigating):
This is the first thing Spark folks always suggest doing when hitting issues, so I think we should have it done when Sandy visits.
Working on this now.
Our header on every file mentions a NOTICE file, but we don't have one.
@hammer , if you have one we should use, let us know. thanks!
From Towards Better Understanding of Artifacts in Variant
Calling from High-Coverage Samples
"On our data with depth d ≈ 50, a maximum depth threshold between d + 3√d and d + 4√d removes many false positives with little effect on the sensitivity. These false positives are mostly caused by copy number variations (CNVs) or paralogous sequences not present in the human reference genome."
To explain further:
Reference: starting at locus 1
TCGATCGA
Read:
TCGA_CCC_TCGA, 4M3I4M
The 3bp insertion is currently considered to be at reference locus 5 and would be evaluated with the other T's at position 5 as opposed to the A's position 4. Ideally, we'd like to present the insertion with the previous reference base.
Also, currently we create 3 Pileup.Elements, at locus 5, 6, 7 each with readBase CCC, CC, CT. Clearly the last one is incorrect (or has some odd interpretation) but in the current variant calling model this should be a single pileup element
However this probably requires some more discussion as there are many tests that explicitly rely on this functionality :https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/test/scala/org/bdgenomics/guacamole/PileupSuite.scala#L115-L123
Not urgent. At some point we'll want to stop loading all ADAMRecord fields, and instead use parquet projections to load only the fields we actually need.
I'm guessing the performance advantage of this for file loading IO will be less important than the memory and network usage improvements.
We'll have to think about how to do this. We may want to have a way for things to register what ADAMRecord fields they care about.
When attempting to load a guacamole-generated vcf into IGV, I get this error:
@arahuja said: "For some reason if you run it all locally all the field seems properly filled... but when run on the cluster, the GT and INFO fields come out empty"
java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: A
at org.broadinstitute.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1391)
at org.broadinstitute.variant.variantcontext.VariantContext.(VariantContext.java:356)
at org.broadinstitute.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:493)
at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:251)
at org.bdgenomics.adam.rdd.variation.ADAMVariationContext$$anonfun$1.apply(ADAMVariationContext.scala:80)
at org.bdgenomics.adam.rdd.variation.ADAMVariationContext$$anonfun$1.apply(ADAMVariationContext.scala:78)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:632)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$9.apply(PairRDDFunctions.scala:648)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$9.apply(PairRDDFunctions.scala:648)
...
ADAM's VCF output currently gives a directory of VCF fragments. We should support writing out a proper VCF file. Possibly this belongs in ADAM not guacamole.
Instead of having boolean flags for the read's classification with separate state, let's combine that into a single datatype IsMatch(seq) | Deletion | Insertion(seq) | etc...
Currently we collect the reads to the spark master and call variants locally. We should implement calling variants on the spark workers. For now the distributed implementation doesn't have to be super efficient -- we just need to convince ourselves that it will be possible to do this efficiently later
With the introduction of our own Read class, we lost support for reading adam data. That should be added. We can just load the adam data with ADAM and then have a simple function that converts each adam record to a guacamole Read.
There are probably some off by one errors in the all the interval logic
Using the windowFlatMap I see the the following error when processing reads:
14/07/28 09:54:26 WARN TaskSetManager: Loss was due to java.lang.UnsupportedOperationException
java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.reduceLeft(List.scala:84)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
at org.bdgenomics.adam.util.MdTag.start(MdTag.scala:292)
at org.bdgenomics.adam.util.MdTag.toString(MdTag.scala:392)
at org.bdgenomics.guacamole.MappedReadSerializer.write(Read.scala:432)
at org.bdgenomics.guacamole.MappedReadSerializer.write(Read.scala:419)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:101)
at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.