Giter VIP home page Giter VIP logo

Comments (9)

ctsa avatar ctsa commented on September 2, 2024

Thanks for your feedback Brad,

I agree this is a good opportunity to make our output easier to interpret.

We're currently using (almost) the same somatic calling output format from the original Strelka release in 2012. At that time, the objections to filling in the GT field directly were:

  1. For the tumor sample, expressing a somatic variant as a heterozygous variant, ie. 0/1, is misleading and somewhat limiting when we encounter more complex cases.
  2. For the normal sample, the most likely genotype is a useful artifact of the somatic variant modeling process, but the normal sample model in the somatic caller is less complete/less accurate than the dedicated germline caller.

I think the practical benefit of simpler interpretation for most somatic calls and moving closer to a community standard should outweigh these reservations. Let me start the conversation here and then see if we can move to a broader discussion.

from strelka.

chapmanb avatar chapmanb commented on September 2, 2024

Chris -- brilliant, thank you for considering this. I'm agreed on the downsides, VCF representation here is imperfect. However, I'm also agreed on the upside that is helps to have a consistent way of representing calls across different methods. We use separate germline calling for identifying pre-existing mutations from the normal, and strictly use the normal in the pair for somatic detection purposes so shouldn't have any issues with your second point. Thanks again for the discussion and thoughts.

from strelka.

ctsa avatar ctsa commented on September 2, 2024

I had an initial discussion with Sangtae and others on this yesterday. We have some preliminary proposals but I also want to get more detail from you on which tools are asserting that a genotype is required (below).

Our preliminary thoughts on this:

  • In the context of highly mutable loci (STRs…), we’re moving away from explicitly genotyping all alleles, even in germline calling, towards a unary model. In this model we can make an assertion on a given allele and then provide counts/likelihoods for “all other alleles” as a pool.
  • Implicitly, most somatic callers are already operating in a mode closer to the unary model.
    • Significant simplifications for somatic calling:
      • Don’t have to represent multiple candidate somatic alleles on the same record (allowing separate filters, etc….)
      • Don’t have to resolve transitive overlap of multiple (germline and/or candidate somatic) indel alleles into a single merged VCF record.
        • As a result of the above, we don’t have to represent the somatic allele as part of a VCF record at a different position.
        • Filters, again, become complicated in any VCF record merging multiple overlapping indel alleles, so advantageous to avoid this.

This leads us down a road of some proposals that could be very simple (Normal GT is always “./.” and Tumor GT is always “./1”), and others that may take some time to gain consensus on and/or implement.

For the short term, another side of this question we should be examining is whether the downstream tools in question have a good reason to require a GT field for somatic alleles?

If you could provide a short list of a few major tools that have this requirement I’d be interested in following up on this discussion.

from strelka.

chapmanb avatar chapmanb commented on September 2, 2024

Chris;
Thanks much for this proposal. The assumptions and simplifications all make sense to me. In my mind this is similar to how we collapse likelihoods into explicit genotype calls -- the likelihoods are an actual representation of what is going on but to productively communicate them we need to make a decision at some point and represent it to users.

The main place where I caught missing genotypes being an issue was in rtg vcfeval (https://github.com/RealTimeGenomics/rtg-tools). We could work around this with --squash-ploidy as well but I think it's a good point to include them as downstream users are also used to looking at them.

For your simplified proposal, why do you prefer the half no calls instead of 0/0, 0/1 ref calls? For somatic calls we're asserting that it's present in tumor and not normal so seems like we have enough evidence to classify as reference at this position, even if it's not an explicit germline call. Half calls have been problematic in the past, although I haven't used them recently (not much CGI data these days). I'd like to have the imperfect representation be something matching what most other callers do.

from strelka.

ctsa avatar ctsa commented on September 2, 2024

Thanks Brad for your comments.

It might be best to motivate my above points with a concrete example. The image below shows a segment from a real tumor cell line (normal on top, tumor below). We have orthogonal data suggesting that the 2 base deletion in the tumor sample is likely to be a real somatic variant, and the normal is heterzygous for a 1 base deletion. As our STR models improve these types of examples are not difficult to find.

image

In terms of VCF representation, as you point out, my aside on the simplest possible solution (using unknown/half genotypes) is not ideal. What we want in principal is an extension of the <*> allele introduced for gVCF representation in VCF 4.3, but it would need to include not only "all unspeci fied alternate alleles", but also the reference. If we had such an abstract alternate allele then the above somatic call can be represented, but this would break the AD/ADF/ADR counts, because this symbolic ALT would include REF counts.

This suggests either just using the "reference allele", as understood to represent all alleles besides the somatic allele, OR working with custom tags outside of the regular FORMAT/GT tag system (as strelka originally elected to do).

Note in the former proposal, the putative somatic indel in the above IGV snapshot would be reported as REF=ATT, ALT=A, NORMAL/GT=0/0, TUMOR/GT=0/1, where (outside the letter of the VCF spec) we understand that allele "0" means the union of the reference and all deletion alleles besides the 2 base deletion here. I'd be curious to hear your feedback on such a representation?

from strelka.

chapmanb avatar chapmanb commented on September 2, 2024

Chris;
Thanks for the example and detailed explanation. Your proposal for having 0 represent reference as anything non-somatic seems reasonable and inline with what is happening implicitly otherwise. As a discussion point, if we tried to represent this explicitly I guess it would be something like:

REF=ATT, ALT=AT,A, NORMAL/GT=0/1, TUMOR/GT=1/2

but TUMOR/GT should really be a mix of 1/2 and 0/2 (and I guess that 3 and 4 bp deletion) so I see your point in making 0 represent anything non-normal. I think there is value in also capturing this in the INFO fields so we have the explicit calls even if GT has the simplified approach. Thanks again for the examples and great discussion on this.

from strelka.

mkkuhner avatar mkkuhner commented on September 2, 2024

I am using Strelka upstream of snpEff to determine the predicted severity of mutations in tumor. For this purpose I want to know what changed in going from the germline to the tumor: NOT what changed in going from ref to tumor. I am frustrated that every tool I try focuses on the latter. (snpEff itself does not handle cases where ref was A, germline was GG, tumor was AG correctly in non-coding regions; it refuses to admit that a mutation has happened....)

So I'd like to put in a plug for not losing this information.

from strelka.

DarioS avatar DarioS commented on September 2, 2024

Interesting example. I thought homopolymers were sequences that Illumina's sequencing technology struggled to correctly sequence. Are such results published in a journal yet?

from strelka.

famosab avatar famosab commented on September 2, 2024

I also ran into this when trying to evaluate the vcfs with rtg vcfeval. I used the following command:

rtg vcfeval --threads 12 --ref-overlap --all-records --output-mode ga4gh --baseline results/variants/imgag-somatic-10perc.truth.cov-medium.vcf.gz --calls results/stratified-variants/sarek-strelka-vep-10perc/medium.vcf.gz --output results/vcfeval/sarek-strelka-vep-10perc/medium --template resources/reference/genome-sdf --squash-ploidy --sample truth,TUMOR

which always lead to:

Error: VCF record does not contain GT field

Is there any hint, how I could handle this (or even adapt the strelka output) in regards to benchmarking results of different callers / mappers?

from strelka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.