hammerlab / guacamole Goto Github PK

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly

License: Apache License 2.0

Shell 2.13% Scala 97.87%

guacamole's People

Contributors

Stargazers

Watchers

guacamole's Issues

specifying contigs or loci that don't exist in your BAMs should be an error

I was saying '-loci chr20:11200000-11300000', but my contig was actually called "20". This should be an error.

Emit Structural Variants

(summary of @arahuja enumerating types of SVs we are interested in, for reference):

translocations
- should be straightforward using paired reads: if the pair ended up on another chromosome, you almost definitely have a translocation.
inversions
- paired reads that are both from the same strand?
- how can this happen?
duplication
- @arahuja gave an example of this as being when the positive strand is mapped after the negative strand (and often there being no read depth immediately before the loci where the negative strand is mapped), meaning that the positive-strand-reads were possibly erroneously mapped to the wrong locus.
copy-number variants
- some confusion on what this means, whether it is an overloaded term:
  - wrong number of chromosomes?
  - "repitition errors", where a segment is repeated the wrong number of times

@arahuja seemed to think that we might be able to do a first pass of SVs before a first-pass of simple INDELs, since the leading submissions seem to be doing complex things for INDELs but there is a fair amount of low-hanging-fruit in SVs.

This is a nice to have for synth4 submit on Saturday, August 2.

Multiple output variants functions

We have

def writeVariants(args: Arguments.Output, genotypes: RDD[ADAMGenotype]): Unit =

and

def writeVariantsToPath(path: String, args: ParquetArgs, genotypes: RDD[ADAMGenotype]): Unit = {

Do we need both?

partition loci by read depth instead of uniformly to reduce skew

I'm working on this now

write Kryo serializer for LociMap

support for ADAM format variant output

push down filtering by locus into a predicate

When calling variants at only a subset of loci, it would be great if we could push that into an ADAM predicate to avoid loading the whole dataset. This would mean writing a predicate that takes a LociSet and a window size, and filters overlapping reads. It would also involve some refactoring, including the loadReads() and loci() functions in Common, since right now the filtering is done separately after loading.

Handle all CIGAR operators

Right now only M, I, D are supported.

The test at PileupSuite:170 (branch pileup_testing_i2_seb)
fails with:

Not a match, mismatch, deletion, or insertion: =

make scaladoc generated html available somewhere automatically

Maybe a git commit hook? Or a github plugin? Jenkins?

Handle clipped reads

All the read indexing logic in Pileup and SlidingWindow assume that the read's "start" field indicates where in the reference genome the first base of the read's sequence aligns to. It seems this may not be the whole story though: when read's are soft clipped, is the read's start then set to the index of the first non-clipped base? Need to write a Pileup test that includes soft clipped reads and fix the code if necessary.

test DistributedUtil.partitionLociUniformlyAmongTasks

Make Guac runs hermetic / reproducable, logged in Cycledash

Guacamole run script should snapshot:

everything relevant about local repo:
- git SHA
- some representation of local/uncommitted changes
- better: create a dummy commit that is pushed to origin (not necessarily on any branch, but retrievable later (assuming github doesn't automatically git prune))
- some representation of host/user the run is occurring on.
some or all cmdline args
- will probably have to be aware of specific guac params, but that is OK
- hashes of input files

All of this should be logged in Cycledash.

Additionally, evaluation script runs should hook into Cycledash and be joinable on the guac-repo-states that generated the data that they're run on.

Emit simple INDELs

Some data model changes to support INDELs:

assign locus immediately preceding deletions
change data model to support [many reference base pairs] --> [many sample base pairs]

Add calling of INDELs to some guac caller TBD, possibly BayesianLikelihood?

nicer stdout representation of variants called

If you omit the -out argument, you currently get a bunch of JSON blobs printed to stdout giving the genotypes called. It'd be nice if that were pretty printed.

README should contain usage instructions

buffer overflow when calling variants on wgs_b37_20.bam.adam

This bam is from the gatk "mini bundle" at http://gatkforums.broadinstitute.org/discussion/2936/tutorial-materials-july-2013

Running

tim@peach ~/sinai/git/guacamole]$ scripts/guacamole ~/sinai/data/adam/wgs_b37_20.bam.adam OUT.vcf.adam -loci 20:11200000-11300000

Gives this error:

2014-04-17 19:47:08.516 java[22529:d07] Unable to load realm info from SCDynamicStore
2014-04-17 19:47:08 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--> Starting.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
--> Loaded 0 reference fragments and 7807420 reads


--> Filtered: 7807420 reads total -> 935409 mapped and relevant reads
2014-04-17 19:50:09 WARN  TaskSetManager:62 - Lost TID 282 (task 7.0:2)
2014-04-17 19:50:09 ERROR Executor:87 - Exception in task ID 282
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 752
        at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
        at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
        at com.esotericsoftware.kryo.io.Output.write(Output.java:183)
        at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:50)
        at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:37)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:124)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:221)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

Investigate possible microbenchmark utilities

Possible contenders:

Scalameter http://scalameter.github.io/
Used by: http://scala-miniboxing.org/, but not sure how to connect to maven/build process
Google Caliper https://code.google.com/p/caliper/
Java, used by: https://github.com/scalanlp/breeze

Structural variant output

For structural variants as in #130 we need to output the following in the VCF file:

In the alternate field:

type of variant: <DUP>, <INV>, <DEL> or <INS>

In the info field:

CIPOS, confidence interval around start position
CIEND, confidence interval around end position
SVTYPE, the type in the alternate fields
END, end position of the variant
SVLEN, length of the variant

See http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20(Variant%20Call%20Format)%20version%204.0/encoding-structural-variants for more information

test DistributedUtil.{pileupFlatMap, windowFlatMap, windowTaskFlatmap}

Should probably add a DistributedUtilSuite for these.

Should test running on loci with multiple contigs and a wide range of tasks numbers (e.g. more tasks than loci, 1 task total)

QualityAlignedReadsFilter is wasteful and hurts read/task distribution

The current QualityAlignedReadsFilter removes all reads below some alignment quality threshold, but does so on a pileup by pileup basis. We could remove all of these reads upfront. This would help avoid pileups we will skip anyways and help with read distribution across tasks (as we would have a more accurate count of reads that will actually be processed)

The caveat is, even if we remove these reads, we use these reads to compute a measure of complexity - % reads below some alignment quality defines complexity currently. But this filter hasn't proved to be that useful, and can probably be replaced with something else.

add tests for LociMap

Incorporate "strand bias" in SNV calling

@arahuja is working on being the right amount of suspicious about variants that e.g. only appear on one strand as potentially being caused by errors sequencing that strand.

Add unit tests for Pileup

Especially important to verify the atGreaterLocus method in Pileup.Element

SlidingWindow issue with of takeWhile on sortedReads

The use of takeWhile on sortedReads iterator seems to cause unexpected behavior. https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/main/scala/org/bdgenomics/guacamole/SlidingReadWindow.scala#L75

From the scala docs it seems this is not recommended -

"Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well."
http://www.scala-lang.org/api/2.10.4/index.html#scala.collection.Iterator

Deletion variants have blank variant allele

Because of way deletions are currently handled in pileups they have blank variant allele and appear as below:

variant:
.contig:
..contigName = 20
.position = 16866006
.referenceAllele = T
.variantAllele =

This breaks output to VCF as the variantAllele is null and makes it difficult to compute concordance as this is not the usual representation of a deletion.

fix hang on chrM bam when parallelism=3

This invocation hangs:

scripts/guacamole threshold     -loci chrM:2000-2050     -reads guacamole-core/src/test/resources/chrM.sorted.bam     -threshold 0     -out /tmp/OUT.gt.adam     -spark_kryo_buffer_size 64 -parallelism 3

but parallelism != 3 works (i.e. makes it to the Kryo error)

Apply assembly algorithms to variant-heavy regions

(summarizing @arahuja on the subject)

Simple Version
When many SNVs are found close together, consider dropping them under the assumption that something may have been wrong with the reads there. @arahuja has a PoC of this.

Complex Version
In regions where SNVs/SVs are found, run more advanced assembly algorithms, e.g. Heng Li's fermi, contrail, other de-Bruijn's-based ones.

Handle multi-allelic variants in the Threshold Variant Caller

Now that bigdatagenomics/adam#222 is in ADAM, we have the power.

put together a small dataset to test guacamole on

Add scripts to the "scripts" directory to download the test dataset

time how long it takes to do the actual variant calling in threshold caller

We need to know how long we're spending doing actual variant calling ( threshold.callVariantsAtLocus) vs. Spark data shuffling, etc. One way to do this would be to log a message in callVariantsAtLocus every X loci, and aggregate those logs. Another way would be to rig up a profiler.

possible off-by-one locus-check (Pileup.Element)

This line
https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/main/scala/org/bdgenomics/guacamole/Pileup.scala#L242
does:

assume(locus <= read.record.end.get)

but RichADAMRecord says that it is “exclusive”
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichADAMRecord.scala#L77

  // Returns the exclusive end position if the read is mapped, None otherwise
  lazy val end: Option[Long] = {

So I guess, we need strict comparison (or better naming in Adam?)

fix hang after Guacamole has run

With parallelism=0, Guacamole seems to run successfully, but then we have a hang at the end:

threshold     -loci chrM:2000-2050     -reads guacamole-core/src/test/resources/chrM.sorted.bam     -threshold 0     -out /tmp/OUT.gt.adam     -spark_kryo_buffer_size 64 -parallelism 0

--> [Fri May 02 14:46:44 EDT 2014]: Guacamole starting.

2014-05-02 14:46:45.971 java[12456:d07] Unable to load realm info from SCDynamicStore

2014-05-02 14:46:45 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-05-02 14:46:46 WARN  ADAMContext:283 - Predicate is ignored when loading a BAM file

--> [3.94 sec. later]: Loaded 38461 reads.
--> [0.56 sec. later]: Filtered to 38461 mapped non-duplicate reads
--> [0.05 sec. later]: Considering 50 loci across 1 contig(s): chrM:2000-2050
--> [0.00 sec. later]: Collecting reads onto spark master.
--> [2.29 sec. later]: Done collecting reads.
--> [1.03 sec. later]: Writing 50 genotypes to: /tmp/OUT.gt.adam.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
--> [1.70 sec. later]: Done writing.

< ... hangs here ... >

Improve command line argument handling

The Args4j (inherited from ADAM) library seems pretty clunky. We should figure out how to address some of these annoyances with it or switch to a better library if there is one (needs investigating):

The single dash (-) arguments feel ancient. Can it be made to work with GNU style arguments?
Args4j sorts all the arguments alphabetically, so we lose the logical grouping of arguments, and end up with e.g. obscure things at the top.
Args4j truncates the help message strings for the arguments, so we can't have detailed descriptions.
Need to be able to have a general usage statement / example at the top of the help output for a command instead of just the plain Args4j argument summary.

write Kryo serializer for {Mapped,UnMapped}Read

This is the first thing Spark folks always suggest doing when hitting issues, so I think we should have it done when Sandy visits.

Working on this now.

get rid of guacamole-core

add NOTICE file referred to by our header

Our header on every file mentions a NOTICE file, but we don't have one.

@hammer , if you have one we should use, let us know. thanks!

maximum read depth filter

From Towards Better Understanding of Artifacts in Variant
Calling from High-Coverage Samples

"On our data with depth d ≈ 50, a maximum depth threshold between d + 3√d and d + 4√d removes many false positives with little effect on the sensitivity. These false positives are mostly caused by copy number variations (CNVs) or paralogous sequences not present in the human reference genome."

test DistributedUtil.windowTaskFlatMap

Pileup should keep inserts that occur between reference (x, y) at pos x

To explain further:
Reference: starting at locus 1
TCGATCGA
Read:
TCGA_CCC_TCGA, 4M3I4M

The 3bp insertion is currently considered to be at reference locus 5 and would be evaluated with the other T's at position 5 as opposed to the A's position 4. Ideally, we'd like to present the insertion with the previous reference base.

Also, currently we create 3 Pileup.Elements, at locus 5, 6, 7 each with readBase CCC, CC, CT. Clearly the last one is incorrect (or has some odd interpretation) but in the current variant calling model this should be a single pileup element

However this probably requires some more discussion as there are many tests that explicitly rely on this functionality :https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/test/scala/org/bdgenomics/guacamole/PileupSuite.scala#L115-L123

load only the ADAMRecord fields actually used for variant calling

Not urgent. At some point we'll want to stop loading all ADAMRecord fields, and instead use parquet projections to load only the fields we actually need.

I'm guessing the performance advantage of this for file loading IO will be less important than the memory and network usage improvements.

We'll have to think about how to do this. We may want to have a way for things to register what ADAMRecord fields they care about.

Guacamole on Demeter omits "FORMAT" column name, "somatic" column

When attempting to load a guacamole-generated vcf into IGV, I get this error:

@arahuja said: "For some reason if you run it all locally all the field seems properly filled... but when run on the cluster, the GT and INFO fields come out empty"

SomaticThresholdVariantCallerSuite fails with a "duplicate allele" error (in the refactoring-and-optimizations branch)

java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: A
at org.broadinstitute.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1391)
at org.broadinstitute.variant.variantcontext.VariantContext.(VariantContext.java:356)
at org.broadinstitute.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:493)
at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:251)
at org.bdgenomics.adam.rdd.variation.ADAMVariationContext$$anonfun$1.apply(ADAMVariationContext.scala:80)
at org.bdgenomics.adam.rdd.variation.ADAMVariationContext$$anonfun$1.apply(ADAMVariationContext.scala:78)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:632)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$9.apply(PairRDDFunctions.scala:648)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$9.apply(PairRDDFunctions.scala:648)
...

enable proper VCF output

ADAM's VCF output currently gives a directory of VCF fragments. We should support writing out a proper VCF file. Possibly this belongs in ADAM not guacamole.

Implement MuTect variant caller

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3833702/

Make ADT for isMatch/isDeletion/isInsertion

Instead of having boolean flags for the read's classification with separate state, let's combine that into a single datatype IsMatch(seq) | Deletion | Insertion(seq) | etc...

support parallelism > 0 when calling variants

Currently we collect the reads to the spark master and call variants locally. We should implement calling variants on the spark workers. For now the distributed implementation doesn't have to be super efficient -- we just need to convince ourselves that it will be possible to do this efficiently later

Add unit tests for SlidingReadWindow

support loading Reads in ADAM format

With the introduction of our own Read class, we lost support for reading adam data. That should be added. We can just load the adam data with ADAM and then have a simple function that converts each adam record to a guacamole Read.

better tests for LociSet

There are probably some off by one errors in the all the interval logic

mdTag serialization issue

Using the windowFlatMap I see the the following error when processing reads:

14/07/28 09:54:26 WARN TaskSetManager: Loss was due to java.lang.UnsupportedOperationException
java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.reduceLeft(List.scala:84)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
at org.bdgenomics.adam.util.MdTag.start(MdTag.scala:292)
at org.bdgenomics.adam.util.MdTag.toString(MdTag.scala:392)
at org.bdgenomics.guacamole.MappedReadSerializer.write(Read.scala:432)
at org.bdgenomics.guacamole.MappedReadSerializer.write(Read.scala:419)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:101)
at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

hammerlab / guacamole Goto Github PK

guacamole's People

Contributors

Stargazers

Watchers

Forkers

guacamole's Issues

Recommend Projects

Recommend Topics

Recommend Org