Giter VIP home page Giter VIP logo

Comments (4)

kianfar77 avatar kianfar77 commented on August 18, 2024 1

@Hoeze Thanks for raising this issue. @Hoeze @henrydavidge I think this is more a discussion of our Variant Schema. That is whether to have alternateAlleles and INFO fields merged as an Array of Structs for those INFO fields that are Number=A in our variant schema? I think this is double-sided. Burying integers and strings in Array of Structs instead of having them in simple arrays brings its own awkwardness. We currently do not check for countType of the variant when reading a VCF to separate alternate-alleles-specific arrays from other arrays in any sense so we do not know whether the array is of that type or not. In split_multiallelics transformer I do this solely based on the number of elements in the array (if equal to A split, if not repeat the whole array). I think if we figure a way to tag the info fields that are alternate-allele-specific, the rest can be done by zipping programmatically. @henrydavidge Can StructField metadata be used for this?

@Hoeze in any case, split_multiallelics will be more complicated than exploding arrays as its main job is handling revision of calls and colex-ordered fields in the genotypes column .

from glow.

Hoeze avatar Hoeze commented on August 18, 2024

This applies especially to all info fields that are annotated with Number=A alias one value per alternate allele.

Maybe it would be easier to have all allele-specific columns in one large struct column?

root
 |-- contigName: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAlleles: array (nullable = true) # this is the struct array that contains every allele-specific annotation
 |    |-- element: struct (containsNull = false)
 |    |    |-- alleleString: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- INFO_controls_AC_nfe_seu: integer (nullable = true)
[...]

from glow.

henrydavidge avatar henrydavidge commented on August 18, 2024

This is an interesting suggestion. This could make interoperability with VCF a bit awkward since the VCF header type would also change after splitting. cc @kianfar77 for thoughts.

@Hoeze, I'm curious you necessarily need to convert all the arrays to scalars. Is there a query that you can't write against the array types, or is it just more verbose? Btw, if you do need scalar types, you should be able to convert all array typed (or all Number=A typed) columns programmatically.

from glow.

Hoeze avatar Hoeze commented on August 18, 2024

Thanks for your answer @henrydavidge.

When we work with VCF's, we do so with one variant per row.
Multiple alternate alleles at the same position is an edge case that we never used.

Therefore, until now I write a bunch of .withColumn("alternateAlleles", f.col("alternateAlleles")[0]) for every vcf.
This leads to a number of problems:

  • When I do not look at the header, I am not sure if a column is really an array of length num_alt_alleles
  • It requires to write another set of column casts for each VCF
  • Struct of equal-length arrays is a very bad representation if I want to work with alternative alleles.
    For example, if I want to filter for variant quality, I have to subset every single array by the result of the filter expression.
  • Combined explosion of the alt_allele dimension is not possible as well. I first need to zip all the equal-length arrays into one struct.

In comparison, with Array[Struct{<alt_allele annotations>}] you have the following guarantees:

  • All per-alt-allele annotation is collected in a single entry of the array
  • No confusion about which columns are really per-allele and which are not
  • You can directly filter the array for certain alternative alleles
  • You get split_multiallelic for free by exploding the array

The data type also does not change. On the contrary, it becomes even more explicit:
All columns in alternateAlleles can be assumed to be of type Number=A.


Thinking about this, I get more and more convinced that this representation would significantly improve the workflow with Glow.

from glow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.