Giter VIP home page Giter VIP logo

Comments (14)

ESDeutekom avatar ESDeutekom commented on August 26, 2024 1

Dear @pgrosu and @kishwarshafin,

Thank you very much for the added information. This certainly helps me along.

Kind regards,

from deepvariant.

depristo avatar depristo commented on August 26, 2024

We'd be happy to exclude commonly known non-analyzable contigs on b38. Do you have a proposed list of such contig names, along with why they should be excluded?

The MT: we don't expect, like most variant callers, to properly handle the structure of genetic variation within the mitochrondira. You can read a bit about the challenges of MT calling here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201154/

from deepvariant.

jjfarrell avatar jjfarrell commented on August 26, 2024

Thanks! Attached is a list from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

This list has all the GRCh38 contigs after dropping chr1-chr22, chrX, chrY and chrM.

GRCh38 has lots of additional contigs which are described here:
https://software.broadinstitute.org/gatk/documentation/article.php?id=7857
https://software.broadinstitute.org/gatk/blog?id=8180

GRCh38.exclude.contigs.txt

from deepvariant.

depristo avatar depristo commented on August 26, 2024

Thanks @jjfarrell. We've added a bug and will fix for the next release.

from deepvariant.

rpoplin avatar rpoplin commented on August 26, 2024

Nice! This is fixed now and will go out with the next release of DeepVariant. Thanks for the great suggestion.

from deepvariant.

PedalheadPHX avatar PedalheadPHX commented on August 26, 2024

@rpoplin @depristo Is there a specific reason to exclude all of the super contigs in GRCh38 from deepvariant analysis? I agree with removing the HLA, EBV, M/MT and decoys, but removing the placed and unplaced contigs doesn't seem justified. Other callers do work on these regions so it doesn't seem like they are truly "Common problematic contigs on GRCh38" as stated in the release notes for v0.5.0.

Does it affect training or calling if they were included? I'd suggest the user should be able to provide the list of regions to exclude or include as its taken a bit of work to sort out why deepvariant is not making calls on all contigs in the provided BAM files like a number of other callers we are testing

from deepvariant.

pichuan avatar pichuan commented on August 26, 2024

Thanks @PedalheadPHX
Reopening this bug so it's easier to track that we have an unanswered question.
Also assigning to @AndrewCarroll

from deepvariant.

AndrewCarroll avatar AndrewCarroll commented on August 26, 2024

Hi @jjfarrell

I hadn't encountered users who meaningfully used the unplaced, but non-decoy contigs. The exclude list is current coded into DeepVariant and not modifiable. However, I agree that it would be reasonable to assess adding those in.

The main reason a decoy or unplaced contig should be excluded is if it aggregates a large amount of incorrectly mapped coverage. It makes sense for us to investigate how much the placed and unplaced contigs are affected by that and decide based on those results.

Training will not be affected, because these contigs do not have truth label variants that I am aware of. The model will be the same.

We'll consider this a potential change for the next release, which will hopefully be in the next month or two. I hope that timing will be adequate.

from deepvariant.

ESDeutekom avatar ESDeutekom commented on August 26, 2024

Dear @AndrewCarroll ,

I know this issue was closed, but I did not want to open a new one for a question this trivial. Do I understand correctly the team has now implemented some form of alt contig consideration, as the documentation now says: "Second, the BAM file provided to --reads should be aligned to a "compatible" version of the genome reference provided as the --ref. By compatible here we mean the BAM and FASTA share at least a reasonable set of common contigs, as DeepVariant will only process contigs shared by both the BAM and reference. "" [https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md].

Thanks in advance

from deepvariant.

pgrosu avatar pgrosu commented on August 26, 2024

Dear @ESDeutekom,

There are a few things to unpack here. The goal is to make sure that the BAM file and reference FASTA file match in their contigs and bases to ensure correct variants get called between the two, and if local realignment is required that the regions match appropriately. Thus, a couple of steps get performed:

$1)$ So in order to be sure there is proper consensus between the two, a first pass over the reference gets processed to remove contigs based on an internal list of excluded contigs found in the following file:

https://github.com/google/deepvariant/blob/r1.5/deepvariant/exclude_contigs.py

This is done in the reference first, because most likely these would not exist in most BAM files.

$2)$ Now given the remaining contigs from the reference, these are used to find common contigs that exist in the BAM file as well, to ensure there is proper overlap. All of this (including $Step \; 1$ above) is performed by the _ensure_consistent_contigs function.

The two steps summarize the statement you noticed in the documentation.

Now the only excluding option I see is for exclude_regions, though the rest of the above excluded contigs are hard-coded as exclude_contigs=exclude_contigs.EXCLUDED_HUMAN_CONTIGS. To exclude a region, that would come as a space-separated list of regions, which can be region literals (e.g., chr20:10-20) or paths to BED/BEDPE files.

Hope it helps,
Paul

from deepvariant.

pichuan avatar pichuan commented on August 26, 2024

Dear @AndrewCarroll ,

I know this issue was closed, but I did not want to open a new one for a question this trivial. Do I understand correctly the team has now implemented some form of alt contig consideration, as the documentation now says: "Second, the BAM file provided to --reads should be aligned to a "compatible" version of the genome reference provided as the --ref. By compatible here we mean the BAM and FASTA share at least a reasonable set of common contigs, as DeepVariant will only process contigs shared by both the BAM and reference. "" [https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md].

Thanks in advance

Hi @ESDeutekom ,

Note that the documentation you quoted actually has been there for a while now, since 2018: https://github.com/google/deepvariant/blob/r0.6/docs/deepvariant-details.md
If this wording is confusing, we're happy to improve it. Please let us know which part is particularly confusing.

If your question is whether DeepVariant takes alt contigs into account, the answer also hasn't changed - not really. In general, when choosing between mapping to GRCh38 with alt contigs or without alt contigs, we empirically find that it's better to use the latter.

Hope this helps.

from deepvariant.

ESDeutekom avatar ESDeutekom commented on August 26, 2024

Dear @pgrosu and @pichuan,

Thank you for your answers. This does help me further along. With regards to the following:

In general, when choosing between mapping to GRCh38 with alt contigs or without alt contigs, we empirically find that it's better to use the latter.

Is this stated somwhere, or published with comparative results? I am curious to get some insight and elaboration on why this would happen.

Kind regards,
Eva

from deepvariant.

kishwarshafin avatar kishwarshafin commented on August 26, 2024

@ESDeutekom , this article https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use from Heng describes different caveats of using different reference genomes. As suggested in that article and from our previous experience we found GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz to be the most suitable.

from deepvariant.

pgrosu avatar pgrosu commented on August 26, 2024

Hi Eva,

Usually such comparisons are rare, as Kishwar suggested through Heng's article. If you are curious about such a comparison, below is a table generated on GIAB samples through GATK's HaplotypeCaller on GRCh38 and GRCh38 with masked alt regions:

image

Hope it helps,
~p

from deepvariant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.