Giter VIP home page Giter VIP logo

Comments (5)

hpages avatar hpages commented on July 17, 2024

A deeper problem with the way VariantAnnotation handles the ALT field is that it replaces both the dots (.) and asterisks (*) with an empty string at load time, making it impossible to distinguish between a "no-alleles" situation and a deletion situation. For example, with a VCF file that contains the following lines:

#CHROM  POS     ID      REF     ALT     ...
20      17330   .       T       A       ...
20      20500   .       C       *       ...
20      1110696 .       A       G,T     ...
20      1230237 .       T       .       ...
20      1234567 .       GTC     G,*,GTCT        ...

ALT <- alt(readVcf(...)) will look like this (after turning it into a list of ordinary character vectors with as.list(as(ALT, "CharacterList"))):

[[1]]
[1] "A"

[[2]]
[1] ""

[[3]]
[1] "G" "T"

[[4]]
[1] ""

[[5]]
[1] "G"    ""     "GTCT"

Unfortunately ALT[[2]] and ALT[[4]] are both represented by an empty string so the information of whether these are a "no alleles" or a deletion is lost 😞

Should the package preserve the asterisks as they are and not blank them like it currently does?

It could, but that's not necessarily the easiest way to go.

I suggest that we don't change how asterisks (i.e. deletions) get loaded i.e. they still get loaded as empty strings but these empty strings are replaced back with asterisks at write time. However, that means that we need to change the way we handle the dots (i.e. no alleles). One possibility is to preserve them. Another possibility is to represent the "no-alleles" situation with an empty (zero-length) DNAStringSet object or character vector (not the same as an empty string!). So for the above example, ALT would look like this:

> ALT
DNAStringSetList of length 5
[[1]] A
[[2]] 
[[3]] G T
[[4]] DNAStringSet object of length 0
[[5]] G  GTCT

> as.list(as(ALT, "CharacterList"))
[[1]]
[1] "A"

[[2]]
[1] ""

[[3]]
[1] "G" "T"

[[4]]
character(0)

[[5]]
[1] "G"    ""     "GTCT"

Then those zero-length DNAStringSet objects or character vectors would be replaced back with dots at write time.

IMO using a zero-length DNAStringSet object or character vector makes more sense than preserving the dots at load time. If there are no alleles, then the DNAStringSet object or character vector that represents them is empty. It's very natural and keeps things consistent because then the number of alleles is always equal to the length of the DNAStringSet object or character vector that is used to represent them.

H.

from variantannotation.

hpages avatar hpages commented on July 17, 2024

@DarioS, @lawremi, @mtmorgan, @vjcitn: Anybody wants to try to come up with a fix? I'm not a co-author of VariantAnnotation and I don't maintain it either. Was just sharing my analysis of the problem above, with suggestions on how to address.

from variantannotation.

DarioS avatar DarioS commented on July 17, 2024

That seems like a suitable solution at first glance. I could implement it, unless someone else thinks that there is a better way.

from variantannotation.

vjcitn avatar vjcitn commented on July 17, 2024

I'd suggest that @DarioS take a shot at it, it sounds like a clear plan to me. When the PR is available I will review it and push.

from variantannotation.

DarioS avatar DarioS commented on July 17, 2024

The code which changes asterisks into empty strings is easy to find. It is flat[grepl("*", flat, fixed = TRUE)] <- "" in function .formatALT. But the dots seem to be changed into empty strings in C code, which I have not worked with since I was an undergraduate student. The returned result of invocation of C function scan_vcf_character has the conversion made but I can't see in which statement that would happen. Can anyone experienced in C programming identify it?

from variantannotation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.