PR #83 surfaced an issue of JSON format for frequencies data. Auspice frequencies panel wants just tip frequencies to work and rather than try to shoe-horn tips into the previous frequencies data model, it felt better to make something more targeted. However, @huddlej brought up the issue that it's going to be really annoying supporting two different frequencies formats. This is my attempt to identify a merged JSON format that works just for _tipfrequencies.json
but also works for the combined _frequencies.json
with all nodes in the tree, annotated clades and mutations.
Current _frequencies.json
is almost completely flat:
{
"pivots": [ 2015.0833, 2015.1667, ... ],
"global_HA1:40N": [ 0.0003, 0.0002, ...],
"north_america_clade:2024": [ 0.0, 0.0, ...],
"africa_6b.1": [ 0.1069, 0.1075, ...],
...
}
You have to know how the keys are constructed. Each begins with a region code (NA
, africa
, etc...) followed by _
then one of three things:
- A mutation, coded as protein (
HA1
, SigPep
, etc...) followed by :
followed by mutation (40N
, etc...)
- A generic clade in the tree with
clade:2024
indicating clade
2024 in the tree.
- Annotated clade in the tree like
6b.1
or 3c3.a
.
Currently proposed _tipfrequencies.json
has a bit more structure:
{
"pivots": [ 2015.0833, 2015.1667, ... ],
"A/Dnipro/768/2016": {
"frequencies": [0.004389, 0.004405, ...],
"weight": 1.0
}
}
My proposed flat spec would be:
{
"pivots": [ 2015.0833, 2015.1667, ... ],
"HA1:40N": {
"frequencies": [ 0.0003, 0.0002, ...],
"region": "global"
},
"clade:2024": {
"frequencies": [ 0.0, 0.0, ...],
"region": "north_america"
},
"6b.1": {
"frequencies": [ 0.1069, 0.1075, ...],
"region": "africa"
},
"A/Dnipro/768/2016": {
"frequencies": [0.004389, 0.004405, ...],
"weight": 1.0
},
}
This makes it easy to reach into _frequencies.json
and grab tip by strain name, grab clade by clade index, grab annotated clade by clade_annotation
or grab mutation by mutation name.
Note that the flat format makes it difficult to just from _frequencies.json
collect all, say, annotated clades. This would require adding hierarchy or marking these with something like anno:6b.1
and strain:A/Dnipro/768/2016
(but I think hierarchy is significantly better than this labeling).
My proposed hierarchical spec would be:
{
"pivots": [ 2015.0833, 2015.1667, ... ],
"mutations": {
"HA1:40N": {
"frequencies": [ 0.0003, 0.0002, ...],
"region": "global"
},
},
"clades": {
"2024": {
"frequencies": [ 0.0, 0.0, ...],
"region": "north_america"
},
},
"annotations": {
"6b.1": {
"frequencies": [ 0.1069, 0.1075, ...],
"region": "africa"
},
},
"tips": {
"A/Dnipro/768/2016": {
"frequencies": [0.004389, 0.004405, ...],
"weight": 1.0
},
},
}
In this case _tipfrequencies.json
gains a tips
dictionary at the top-level as well and corresponds to a simple subset of what's in _frequencies.json
.
The latter hierarchical spec seems more self-documenting to me. I might slightly prefer it, but it's a mild preference. Implementing either spec in nextflu/auspice would be very easy.
Can I get votes from people that work with frequencies, ie. @sidneymbell, @huddlej, @jameshadfield and @rneher for which they prefer?