<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've added functionality to allow annotation of any <a href="https://github.com/lunter

Adding AD Field in VCF _and_ info about VAF_CR field,about luntergroup/octopus

Comments (10)

dancooke commented on August 30, 2024

Hi Chris,

Thanks for your interest. The VAF_CR field is a credible interval for the variant allele frequency as calculated from the calling model used for inference (i.e. not empirically). You can read more about it here. Note the MAP_VAF is a new annotation that is not in v0.4.1-alpha.

As for your question regarding an allele depth annotation - I have always been a little hesitant to include this type of field in regular output as it is not always well understood what it means, or how it is calculated, which can lead to confusion and misuse.

What is the value of having the allele depth field? I can think of two uses:

For further inference, like in your case where you want to infer the VAF.
To filter variant calls.

In both these cases there are better alternatives to an allele depth field because allele depth does not contain any measure of uncertainty. To calculate allele depth, we need some model of how to assign reads to alleles, but the model used for calling variants in the first instance should always be better than the one used to calculate allele depth. Allele depth will likely be calculated either by: 1) explicitly checking each read sequence for each called allele, hopefully against realigned reads. 2) Using maximum likelihood. The first is not a good method since it ignores alignment errors. The latter is better but is still contains no measure of uncertainty, and also assumes the calls made are correct. The model used for calling alleles in the first instance should suffer neither of these problems as it should always be able to consider more possible alleles than the number called (the ploidy), and ideally does not make a hard assignment of reads to these possibilities; it should assign probabilities instead.

So for either uses, there should always be more informative statistics we can derive from the calling model. Note this will not be true for other statistics that are not modelled during calling (e.g. strand bias in Octopus). Octopus provides VAF_CR and MAP_VAF (in upcoming v0.5-beta) for somatic VAF inference. These are derived from the calling model not empirically. Most importantly they indicate the uncertainty in the VAF; a point estimate of VAF is almost useless if the credible interval is very wide. For filtering, there should always be better statistics which can be derived directly from the calling model than an empirical allele depth (e.g. QUAL and GQ).

Having said all that, Octopus does actually calculate allele depth statistics, using maximum likelihood. These are used for filtering (e.g. AF is available in v0.4.1, and AD will appear in v0.5). The first reason for this is that they are sometimes more convenient than other calling model based statistics when using threshold filters. But more importantly, they are another data point for use with machine learning filtering methods, which is the direction I'm trying to take Octopus (e.g. random forest filtering).

Another case where Octopus implicitly computes allele depth is in the split evidence BAM. Here reads are assigned to haplotypes using maximum likelihood and output to separate BAMs. Currently this feature only works for single-sample calling but I intend to extend this to multi-sample calling, although probably not in time for the next release.

I hope this is all clear. In summary, if you just want a point estimate of VAF, then use MAP_VAF, or the median of VAF_CR until v0.5-beta. But ideally at least look at VAF_CR too as this tells you the uncertainty in MAP_VAF. Ultimately what I may do is just report the complete posterior distribution for the VAFs - a Dirichlet distribution.

Best,
Dan

from octopus.

ChristopheLegendre commented on August 30, 2024

Hi Dan,

Thanks for your reply. This is a great explanation that makes it clearer for people like me with a more biological-background than a statistical-background.

Note: I totally forgot to look at the Wiki page; Thanks for reminding me about it. Sorry about that; I would not have asked the question.

Yes, we use AD and VAF for filtering variants in our Somatic pipeline. That is mainly why I asked for AD in the VCF, in order to keep consistency in our analysis. Also, AD would be useful for us when having multi tumor samples from the same individual.

Of course, as you pointed out that AD and VAF have some caveats and alternatives exist; we will be glad to get the best of what Octopus has to offer in order to improve our variant calling, and we might consider the new models to filter out our calls in our next pipeline iteration.

In the meantime, AD and VAF will be very useful for us and we are eager to use the version v0.5. Thanks for adding it 👍

Best,
Chris.

from octopus.

ChristopheLegendre commented on August 30, 2024

Hi Dan,

Congratulations on the release of your article. Really great work.

I know I closed that issue a while ago hoping that AD would have been added into the 0.5 release version.
I just ran the version 0.5.2, and even though you describe AD in the variant filtering section of the user manual, I could not see the AD field in the output vcf.

I know that newer ways exist to filter variants within Octopus as you described above, but this AD field has been used quite often for filtering. As a reference and for back-compatibility, it would be useful to have that field.

So, instead of adding the flag permanently, may I suggest the use of an option that enables adding this field to the FORMAT columns for each sample? That way, only users interested in that field will have it in the VCF.

Thanks for considering my suggestion,
Best,
Chris.

from octopus.

dancooke commented on August 30, 2024

Hi Chris,

Yes, the AD field can now be used for variant filtering. It's also possible to get the value in the INFO column by using the --training-annotations command (e.g. --training-annotations AD), however, this command is primarily intended for training the random forest classifier (and debugging), as it clears existing INFO fields. I'm working on moving the sample-level annotations to the FORMAT columns.

I'll have a look at annotating the VCF with user-requested annotations, although this would likely only be ones used for filtering.

Cheers,
Dan

from octopus.

ChristopheLegendre commented on August 30, 2024

Hi Dan, Happy New Year to You. I just saw you have released a new version of Octopus (0.5.3). Thanks. I know that the issue for adding the AD in either INFO or FORMAT columns is still open in the list of issues, but did you add that AD in the latest version 0.5.3 yet? Unless I am mistaken, it is not there in the version 0.5.2-Beta AD is used quite often in our lab and All the variant callers we have used so far have information about AD, or a way of getting that information. I am sorry to ask that again :-( :-) Thanks in advance Chris

from octopus.

dancooke commented on August 30, 2024

Hi Chris,

I didn't have the chance to address this for v0.5.3-beta, which was mostly bug fixes and installation improvements. Hopefully for the next release...

Dan

from octopus.

ChristopheLegendre commented on August 30, 2024

Hi Dan,
:-)
Thanks, I am looking forward to it...
Cheers,
Chris

from octopus.

dancooke commented on August 30, 2024

I've added functionality to allow annotation of any measures used for filtering in 4d7fa77 (available on develop branch). I've modified the command-line interface slightly to accommodate this: --training-annotations is renamed --annotations. You can either request all annotations by just adding the --annotations option, or select specific annotations (e.g. --annotations AD AF SB). Note that the requested annotations must be used as part of the filtering classifier. For example, the default germline filter expression (i.e. --filter-expresion) is "QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1", so you can only request the annotations MQ, MP, AF, SB, BQ, DP. If you're using random forest filtering (recommended) then this includes all possible measures so you can request any annotation.

Let me know if you have any trouble.
Dan

from octopus.

ChristopheLegendre commented on August 30, 2024

Hi Dan,

Thanks for having worked on it.

I installed the "develop" branch of Octopus.
Running the command octopus --help, I could see the new option --annotations like this:
--annotations [=arg(=active)] Annotations to write to final VCF.

So I guess I used the correct branch to install octopus :-) and call somatic variant with the options you suggested: --annotations AD AF SB

part1) using `--annotations AD AF SB`

I have some trouble finding the flags (AD and AF) in the final VCF;
But it worked for the flag SB. SB got added to all the vcf records and to the vcf HEADER.

for `SB`

Line added in header is:
##FORMAT=<ID=SB,Number=1,Type=String,Description="Strand bias of reads based on haplotype support">
SB added to record:
22 51080757 . A G 28.04 DP AC=1;AN=5;DP=8;MP=22.61;MQ=60;MQ0=0;NS=2;PP=2.96;SOMATIC GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:SB:FT 0|0|1:118:5:60:51080757:99:0.38:0.14,0.64:0.678:PASS 0|0:118:3:60:51080757:99:.:.,.:.:DP

note: SB has value only in one of the two samples in my vcf; I called somatic variants with octopus and only the Tumor sample got a SB value; the other got "." a dot;
Q1: is it expected? or should all the samples in the vcf have an SB value?

for `AF`

I assume that it is the same as the flag MAP_VAF; (you previously mentionned MAP_VAF in these issue above ) So it was already there by default under the name MAP_VAF and not AF;
Q2: Am I correct to assume that MAP_VAF is identical to AF in octopus?

for `AD`

Allele_Depth (AD) is absent from the VCF ; It has neither been written in the vcf header nor in the records.

part2) using `--annotations`

Then I ran Octopus with option--annotations only, and the following flags where added to the FORMAT field: SMQ:SB:SD:BQ:MF:FRF:SOMATIC:REFCALL, but no AD or AF.

My command line is:
octopus --caller population --reference GRCh37.fa --reads T.chr22.bam N.chr22.bam --normal-sample NORMAL --threads 18 --max-reference-cache-footprint 4GB --target-read-buffer-footprint 6GB --working-directory octopus_22 --regions 22 --output octopus_chr22_TUMOR.vcf --ignore-unmapped-contigs --assembly-candidate-generator OFF --somatics-only OFF --phasing-level minimal --legacy --bamout realigned_bams --annotations

Q3: Is there any option I shoud use specifically to get the falg AD added to the vcf?

Q4: Is there a way to know for sure that I installed and used the correct develop version of Octopus? (sorry for this trivial question, but octopus --version returned octopus 0.5.3-beta, so I am afraid I might have installed the wrong version )

Hope it helps.
Let me know if you need further information

Thanks,
Best,
Chris

from octopus.

dancooke commented on August 30, 2024

Hi Chris,

The annotationAD is not output because it is not used by any of the threshold classifiers that are used for filtering somatic calls (i.e. --filter-expression, --somatic-filter-expression, and --refcall-filter-expression). You should have received a warning when you requested this annotation specifically. A simple way to get AD would be to add AD < 1 to --filter-expression or --somatic-filter-expression. Alternatively, you could just use random forest filtering, as this implicitly uses all annotations, and is the recommend approach for filtering somatic calls.

The SB annotation is missing (i.e. .) for the normal sample in your example because it was not genotyped for the alternative allele, so the strand bias is not calculable.

AF is not the same as MAP_VAF for two reasons:

AF is empirically calculated by hard assigning reads to called haplotypes, while MAP_VAF is probabilistically calculated using the core Variational Bayes calling model. This goes back to what I described in my first post on this issue.
By default, AF will be calculated using all well-formed reads overlapping the region where-as MAP_VAF is calculated using filtered reads. Filtered reads can also be used for variant call filtering by including the --use-calling-reads-for-filtering option.

Your point about the version number is a good one - I'll see if I can add more informative version numbers.

Out of interest, what was your reason for including --phasing-level minimal and --assembly-candidate-generator OFF in your command? Finally, a minor point, but --somatics-only does not require an argument as it is a command line flag.

Best
Dan

from octopus.

Adding AD Field in VCF _and_ info about VAF_CR field about octopus HOT 10 CLOSED

Comments (10)

part1) using `--annotations AD AF SB`

for `SB`

for `AF`

for `AD`

part2) using `--annotations`

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (10)

part1) using --annotations AD AF SB

for SB

for AF

for AD

part2) using --annotations

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

part1) using `--annotations AD AF SB`

for `SB`

for `AF`

for `AD`

part2) using `--annotations`