Comments (4)
@Hoeze Thanks for raising this issue. @Hoeze @henrydavidge I think this is more a discussion of our Variant Schema. That is whether to have alternateAlleles and INFO fields merged as an Array of Structs for those INFO fields that are Number=A
in our variant schema? I think this is double-sided. Burying integers and strings in Array of Structs instead of having them in simple arrays brings its own awkwardness. We currently do not check for countType of the variant when reading a VCF to separate alternate-alleles-specific arrays from other arrays in any sense so we do not know whether the array is of that type or not. In split_multiallelics
transformer I do this solely based on the number of elements in the array (if equal to A split, if not repeat the whole array). I think if we figure a way to tag the info fields that are alternate-allele-specific, the rest can be done by zipping programmatically. @henrydavidge Can StructField metadata be used for this?
@Hoeze in any case, split_multiallelics
will be more complicated than exploding arrays as its main job is handling revision of calls and colex-ordered fields in the genotypes column .
from glow.
This applies especially to all info fields that are annotated with Number=A
alias one value per alternate allele
.
Maybe it would be easier to have all allele-specific columns in one large struct column?
root
|-- contigName: string (nullable = true)
|-- start: long (nullable = true)
|-- end: long (nullable = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAlleles: array (nullable = true) # this is the struct array that contains every allele-specific annotation
| |-- element: struct (containsNull = false)
| | |-- alleleString: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- INFO_controls_AC_nfe_seu: integer (nullable = true)
[...]
from glow.
This is an interesting suggestion. This could make interoperability with VCF a bit awkward since the VCF header type would also change after splitting. cc @kianfar77 for thoughts.
@Hoeze, I'm curious you necessarily need to convert all the arrays to scalars. Is there a query that you can't write against the array types, or is it just more verbose? Btw, if you do need scalar types, you should be able to convert all array typed (or all Number=A
typed) columns programmatically.
from glow.
Thanks for your answer @henrydavidge.
When we work with VCF's, we do so with one variant per row.
Multiple alternate alleles at the same position is an edge case that we never used.
Therefore, until now I write a bunch of .withColumn("alternateAlleles", f.col("alternateAlleles")[0])
for every vcf.
This leads to a number of problems:
- When I do not look at the header, I am not sure if a column is really an array of length
num_alt_alleles
- It requires to write another set of column casts for each VCF
- Struct of equal-length arrays is a very bad representation if I want to work with alternative alleles.
For example, if I want to filter for variant quality, I have to subset every single array by the result of the filter expression. - Combined explosion of the
alt_allele
dimension is not possible as well. I first need to zip all the equal-length arrays into one struct.
In comparison, with Array[Struct{<alt_allele annotations>}]
you have the following guarantees:
- All per-alt-allele annotation is collected in a single entry of the array
- No confusion about which columns are really per-allele and which are not
- You can directly filter the array for certain alternative alleles
- You get
split_multiallelic
for free by exploding the array
The data type also does not change. On the contrary, it becomes even more explicit:
All columns in alternateAlleles
can be assumed to be of type Number=A
.
Thinking about this, I get more and more convinced that this representation would significantly improve the workflow with Glow.
from glow.
Related Issues (20)
- spark.read.format("vcf").load() fails for vcf.bgz files HOT 3
- Cannot write INFO fields with LongType to VCF HOT 2
- Python tests fail with KeyError: '_glow_regression_values' HOT 1
- Logistic regression ValueError: Null fit failed! HOT 2
- AnalysisException: Column 'num_workers' does not exist. HOT 1
- pipe transformer support for file input/output for command line apps HOT 2
- Interaction Tests with GLOW HOT 1
- Spark version upgrade 3.4, 3.3 HOT 5
- Cannot install+use the python glow package any more HOT 2
- Regular updates to work with cloud providers HOT 1
- Problem importing Glow HOT 3
- glow.normalize_variant fails with NullPointerException in NormalizeVariantExpr.scala:55 HOT 4
- Unexpected behaviour when using the SplitMultiallelics Transformer with unbounded INFO fields HOT 2
- import glow throws error: module 'numpy' has no attribute 'unicode'. Did you mean: 'unicode_'? HOT 2
- Feature Request: add regenie's gene-based testing
- Comparision of performance of regenie WGR and glow WGR HOT 2
- Pandas should be pinned <2.0.0 HOT 1
- Glow.hail - No module found HOT 2
- Numpy == 2.0 fails with glow python package v2.0
- Tried spark.sql("RESET") not worked
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from glow.