Giter VIP home page Giter VIP logo

guacamole's People

Contributors

arahuja avatar armish avatar danvk avatar e5c avatar hammer avatar iskandr avatar martijnab avatar ryan-williams avatar smondet avatar tavinathanson avatar timodonnell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

guacamole's Issues

Emit Structural Variants

(summary of @arahuja enumerating types of SVs we are interested in, for reference):

  • translocations
    • should be straightforward using paired reads: if the pair ended up on another chromosome, you almost definitely have a translocation.
  • inversions
    • paired reads that are both from the same strand?
    • how can this happen?
  • duplication
    • @arahuja gave an example of this as being when the positive strand is mapped after the negative strand (and often there being no read depth immediately before the loci where the negative strand is mapped), meaning that the positive-strand-reads were possibly erroneously mapped to the wrong locus.
  • copy-number variants
    • some confusion on what this means, whether it is an overloaded term:
      • wrong number of chromosomes?
      • "repitition errors", where a segment is repeated the wrong number of times

@arahuja seemed to think that we might be able to do a first pass of SVs before a first-pass of simple INDELs, since the leading submissions seem to be doing complex things for INDELs but there is a fair amount of low-hanging-fruit in SVs.

This is a nice to have for synth4 submit on Saturday, August 2.

Multiple output variants functions

We have

def writeVariants(args: Arguments.Output, genotypes: RDD[ADAMGenotype]): Unit = 

and

def writeVariantsToPath(path: String, args: ParquetArgs, genotypes: RDD[ADAMGenotype]): Unit = {

Do we need both?

push down filtering by locus into a predicate

When calling variants at only a subset of loci, it would be great if we could push that into an ADAM predicate to avoid loading the whole dataset. This would mean writing a predicate that takes a LociSet and a window size, and filters overlapping reads. It would also involve some refactoring, including the loadReads() and loci() functions in Common, since right now the filtering is done separately after loading.

Handle clipped reads

All the read indexing logic in Pileup and SlidingWindow assume that the read's "start" field indicates where in the reference genome the first base of the read's sequence aligns to. It seems this may not be the whole story though: when read's are soft clipped, is the read's start then set to the index of the first non-clipped base? Need to write a Pileup test that includes soft clipped reads and fix the code if necessary.

Make Guac runs hermetic / reproducable, logged in Cycledash

Guacamole run script should snapshot:

  • everything relevant about local repo:
    • git SHA
    • some representation of local/uncommitted changes
    • better: create a dummy commit that is pushed to origin (not necessarily on any branch, but retrievable later (assuming github doesn't automatically git prune))
    • some representation of host/user the run is occurring on.
  • some or all cmdline args
    • will probably have to be aware of specific guac params, but that is OK
    • hashes of input files

All of this should be logged in Cycledash.

Additionally, evaluation script runs should hook into Cycledash and be joinable on the guac-repo-states that generated the data that they're run on.

Emit simple INDELs

Some data model changes to support INDELs:

  • assign locus immediately preceding deletions
  • change data model to support [many reference base pairs] --> [many sample base pairs]

Add calling of INDELs to some guac caller TBD, possibly BayesianLikelihood?

buffer overflow when calling variants on wgs_b37_20.bam.adam

This bam is from the gatk "mini bundle" at http://gatkforums.broadinstitute.org/discussion/2936/tutorial-materials-july-2013

Running

tim@peach ~/sinai/git/guacamole]$ scripts/guacamole ~/sinai/data/adam/wgs_b37_20.bam.adam OUT.vcf.adam -loci 20:11200000-11300000

Gives this error:

2014-04-17 19:47:08.516 java[22529:d07] Unable to load realm info from SCDynamicStore
2014-04-17 19:47:08 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--> Starting.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
--> Loaded 0 reference fragments and 7807420 reads


--> Filtered: 7807420 reads total -> 935409 mapped and relevant reads
2014-04-17 19:50:09 WARN  TaskSetManager:62 - Lost TID 282 (task 7.0:2)
2014-04-17 19:50:09 ERROR Executor:87 - Exception in task ID 282
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 752
        at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
        at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
        at com.esotericsoftware.kryo.io.Output.write(Output.java:183)
        at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:50)
        at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:37)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:124)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:221)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

Structural variant output

For structural variants as in #130 we need to output the following in the VCF file:

In the alternate field:

  • type of variant: <DUP>, <INV>, <DEL> or <INS>

In the info field:

  • CIPOS, confidence interval around start position
  • CIEND, confidence interval around end position
  • SVTYPE, the type in the alternate fields
  • END, end position of the variant
  • SVLEN, length of the variant

See http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20(Variant%20Call%20Format)%20version%204.0/encoding-structural-variants for more information

QualityAlignedReadsFilter is wasteful and hurts read/task distribution

The current QualityAlignedReadsFilter removes all reads below some alignment quality threshold, but does so on a pileup by pileup basis. We could remove all of these reads upfront. This would help avoid pileups we will skip anyways and help with read distribution across tasks (as we would have a more accurate count of reads that will actually be processed)

The caveat is, even if we remove these reads, we use these reads to compute a measure of complexity - % reads below some alignment quality defines complexity currently. But this filter hasn't proved to be that useful, and can probably be replaced with something else.

SlidingWindow issue with of takeWhile on sortedReads

The use of takeWhile on sortedReads iterator seems to cause unexpected behavior. https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/main/scala/org/bdgenomics/guacamole/SlidingReadWindow.scala#L75

From the scala docs it seems this is not recommended -

"Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well."
http://www.scala-lang.org/api/2.10.4/index.html#scala.collection.Iterator

Deletion variants have blank variant allele

Because of way deletions are currently handled in pileups they have blank variant allele and appear as below:

variant:
.contig:
..contigName = 20
.position = 16866006
.referenceAllele = T
.variantAllele =

This breaks output to VCF as the variantAllele is null and makes it difficult to compute concordance as this is not the usual representation of a deletion.

fix hang on chrM bam when parallelism=3

This invocation hangs:

scripts/guacamole threshold     -loci chrM:2000-2050     -reads guacamole-core/src/test/resources/chrM.sorted.bam     -threshold 0     -out /tmp/OUT.gt.adam     -spark_kryo_buffer_size 64 -parallelism 3

but parallelism != 3 works (i.e. makes it to the Kryo error)

Apply assembly algorithms to variant-heavy regions

(summarizing @arahuja on the subject)

Simple Version
When many SNVs are found close together, consider dropping them under the assumption that something may have been wrong with the reads there. @arahuja has a PoC of this.

Complex Version
In regions where SNVs/SVs are found, run more advanced assembly algorithms, e.g. Heng Li's fermi, contrail, other de-Bruijn's-based ones.

possible off-by-one locus-check (Pileup.Element)

This line
https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/main/scala/org/bdgenomics/guacamole/Pileup.scala#L242
does:

assume(locus <= read.record.end.get)

but RichADAMRecord says that it is “exclusive”
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichADAMRecord.scala#L77

  // Returns the exclusive end position if the read is mapped, None otherwise
  lazy val end: Option[Long] = {

So I guess, we need strict comparison (or better naming in Adam?)

fix hang after Guacamole has run

With parallelism=0, Guacamole seems to run successfully, but then we have a hang at the end:

threshold     -loci chrM:2000-2050     -reads guacamole-core/src/test/resources/chrM.sorted.bam     -threshold 0     -out /tmp/OUT.gt.adam     -spark_kryo_buffer_size 64 -parallelism 0

--> [Fri May 02 14:46:44 EDT 2014]: Guacamole starting.

2014-05-02 14:46:45.971 java[12456:d07] Unable to load realm info from SCDynamicStore

2014-05-02 14:46:45 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-05-02 14:46:46 WARN  ADAMContext:283 - Predicate is ignored when loading a BAM file

--> [3.94 sec. later]: Loaded 38461 reads.
--> [0.56 sec. later]: Filtered to 38461 mapped non-duplicate reads
--> [0.05 sec. later]: Considering 50 loci across 1 contig(s): chrM:2000-2050
--> [0.00 sec. later]: Collecting reads onto spark master.
--> [2.29 sec. later]: Done collecting reads.
--> [1.03 sec. later]: Writing 50 genotypes to: /tmp/OUT.gt.adam.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
--> [1.70 sec. later]: Done writing.

< ... hangs here ... >

Improve command line argument handling

The Args4j (inherited from ADAM) library seems pretty clunky. We should figure out how to address some of these annoyances with it or switch to a better library if there is one (needs investigating):

  • The single dash (-) arguments feel ancient. Can it be made to work with GNU style arguments?
  • Args4j sorts all the arguments alphabetically, so we lose the logical grouping of arguments, and end up with e.g. obscure things at the top.
  • Args4j truncates the help message strings for the arguments, so we can't have detailed descriptions.
  • Need to be able to have a general usage statement / example at the top of the help output for a command instead of just the plain Args4j argument summary.

Pileup should keep inserts that occur between reference (x, y) at pos x

To explain further:
Reference: starting at locus 1
TCGATCGA
Read:
TCGA_CCC_TCGA, 4M3I4M

The 3bp insertion is currently considered to be at reference locus 5 and would be evaluated with the other T's at position 5 as opposed to the A's position 4. Ideally, we'd like to present the insertion with the previous reference base.

Also, currently we create 3 Pileup.Elements, at locus 5, 6, 7 each with readBase CCC, CC, CT. Clearly the last one is incorrect (or has some odd interpretation) but in the current variant calling model this should be a single pileup element

However this probably requires some more discussion as there are many tests that explicitly rely on this functionality :https://github.com/hammerlab/guacamole/blob/master/guacamole-core/src/test/scala/org/bdgenomics/guacamole/PileupSuite.scala#L115-L123

load only the ADAMRecord fields actually used for variant calling

Not urgent. At some point we'll want to stop loading all ADAMRecord fields, and instead use parquet projections to load only the fields we actually need.

I'm guessing the performance advantage of this for file loading IO will be less important than the memory and network usage improvements.

We'll have to think about how to do this. We may want to have a way for things to register what ADAMRecord fields they care about.

SomaticThresholdVariantCallerSuite fails with a "duplicate allele" error (in the refactoring-and-optimizations branch)

java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: A
at org.broadinstitute.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1391)
at org.broadinstitute.variant.variantcontext.VariantContext.(VariantContext.java:356)
at org.broadinstitute.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:493)
at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:251)
at org.bdgenomics.adam.rdd.variation.ADAMVariationContext$$anonfun$1.apply(ADAMVariationContext.scala:80)
at org.bdgenomics.adam.rdd.variation.ADAMVariationContext$$anonfun$1.apply(ADAMVariationContext.scala:78)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:632)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$9.apply(PairRDDFunctions.scala:648)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$9.apply(PairRDDFunctions.scala:648)
...

enable proper VCF output

ADAM's VCF output currently gives a directory of VCF fragments. We should support writing out a proper VCF file. Possibly this belongs in ADAM not guacamole.

Make ADT for isMatch/isDeletion/isInsertion

Instead of having boolean flags for the read's classification with separate state, let's combine that into a single datatype IsMatch(seq) | Deletion | Insertion(seq) | etc...

support parallelism > 0 when calling variants

Currently we collect the reads to the spark master and call variants locally. We should implement calling variants on the spark workers. For now the distributed implementation doesn't have to be super efficient -- we just need to convince ourselves that it will be possible to do this efficiently later

support loading Reads in ADAM format

With the introduction of our own Read class, we lost support for reading adam data. That should be added. We can just load the adam data with ADAM and then have a simple function that converts each adam record to a guacamole Read.

mdTag serialization issue

Using the windowFlatMap I see the the following error when processing reads:

14/07/28 09:54:26 WARN TaskSetManager: Loss was due to java.lang.UnsupportedOperationException
java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.reduceLeft(List.scala:84)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
at org.bdgenomics.adam.util.MdTag.start(MdTag.scala:292)
at org.bdgenomics.adam.util.MdTag.toString(MdTag.scala:392)
at org.bdgenomics.guacamole.MappedReadSerializer.write(Read.scala:432)
at org.bdgenomics.guacamole.MappedReadSerializer.write(Read.scala:419)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:101)
at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.