nextstrain / seasonal-flu Goto Github PK
View Code? Open in Web Editor NEWScripts. config, and snakefiles for seasonal-flu nextstrain builds
Scripts. config, and snakefiles for seasonal-flu nextstrain builds
We need to know which strains and clades do not have titer measurements from different assays, passage types, and sources. We previously wrote a script to annotate the number of titer measurements per node as a node data JSON file. This script's output plus an Auspice config JSON coloring entry allowed us to color the tree by total titers or counts binned into larger categories.
We should implement this annotation in the refactored workflow for the private builds, at least. We could also use the Auspice config JSON to setup meaningful bins for the counts as a legend scale.
It would also be helpful to be able to filter the tree by presence/absence of mutations in the different categories, so we could focus on only tips that have no measurements.
Context
HI assays do not perform as well for recent H3N2 viruses as FRA/HINT assays (3c3.A is the exception) and these assays produce different titer model parameters than the FRA/HINT assays. We know these assays measure different aspects of viral infection and that FRA/HINT assays were historically performed when HI assays didn't work making direct comparisons between these data biased.
Although I had previously thought that FRA/HINT data were more numerous than HI data, the number of measurements submitted since Jan 1, 2019 is nearly the same for both assay types. It turns out that the majority of our FRA/HINT data come from the CDC and the majority of the HI data come from VIDRL. This is an important distinction because the live builds only include data from the CDC and are thus relatively depleted for HI data. Here is the breakdown of recent data based on the inclusion date:
>>> hi.query("inclusion_date > '2019-01-01'")["center"].value_counts()
vidrl 21415
crick 6621
cdc 3108
>>> fra.query("inclusion_date > '2019-01-01'")["center"].value_counts()
cdc 18649
vidrl 3987
niid 3851
crick 1097
Description
We should try building a 2y H3N2 HA tree with HI data and another with FRA data, keeping all other parameters of the builds the same (including the forecasting model). Then, we should compare the trees and the corresponding titer model results head-to-head. We should look for:
Cloned seasonal-flu repo, installed nextstrain cli with conda setup.
I tried to run the example build (nextstrain build . targets/flu_seasonal_h3n2_ha_12y), but it does not work, producing a "MissingRuleException."
Before the refactored workflow update, I was able to build with the same steps and example data.
I'd like to run the same with the updated, refactored workflow.
Steps to reproduce the current behavior:
curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/linux | bash; printf '\n%s\n' 'eval "$("/home/limh25/.nextstrain/cli-standalone/nextstrain" init-shell bash)"' >> ~/.bashrc; eval "$("/home/limh25/.nextstrain/cli-standalone/nextstrain" init-shell bash)"
python3 -m pip install nextstrain-cli; nextstrain setup conda; nextstrain setup --set-default conda
(nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain build . targets/flu_seasonal_h3n2_ha_12y --configfile /app/limh25/seasonal-flu/profiles/example/builds.yaml --cores 8 --printshellcmds --show-failed-logs --reason --verbose
Building DAG of jobs...
Full Traceback (most recent call last):
File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/__init__.py", line 771, in snakemake
success = workflow.execute(
File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 741, in execute
dag.init()
File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 210, in init
self.file2jobs(file),
File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 1984, in file2jobs
raise MissingRuleException(targetfile)
snakemake.exceptions.MissingRuleException: No rule to produce targets/flu_seasonal_h3n2_ha_12y (if you use input functions make sure that they don't raise unexpected exceptions).
MissingRuleException:
No rule to produce targets/flu_seasonal_h3n2_ha_12y (if you use input functions make sure that they don't raise unexpected exceptions).
unlocking
removed all locks
(nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain check-setup
nextstrain-cli is up to date!
Testing your setup…
# docker is not supported
✘ no: docker is installed
✘ no: docker run works
? unknown: containers have access to >2 GiB of memory
✔ yes: image is new enough for this CLI version
# conda is supported
✔ yes: operating system is supported
✔ yes: runtime data dir doesn't have spaces
✔ yes: runtime appears set up
✔ yes: snakemake is installed and runnable
✔ yes: augur is installed and runnable
✔ yes: auspice is installed and runnable
# singularity is not supported
✘ no: singularity is installed
✘ no: singularity version None ≥ 3.0.0
✘ no: singularity works
# ambient is not supported
✘ no: snakemake is installed and runnable
✘ no: augur is installed and runnable
✘ no: auspice is installed and runnable
# aws-batch is not supported
✘ no: job description "nextstrain-job" exists
✘ no: job queue "nextstrain-job-queue" exists
✘ no: S3 bucket "nextstrain-jobs" exists
Supported Nextstrain runtimes: conda
All good! Default runtime (conda) is supported.
There was no problem running the example build before the refactoring update. Now it does not work as expected.
Please help me running the example build.
Please also let me know if any further command-line outputs are needed to debug.
Thank you.
The current auspice configuration file (config/auspice_config.json
) is not always appropriate for all builds in the Snakefile. The most obvious example of this is that the auspice config exports a color-by for “RBS mutations” even though A/H3N2 HA builds are the only ones that define RBS mutations.
This feels like an issue that should be resolved at the level of the Snakefile and not at the augur or auspice level (it's not necessary for those tools to know that this is how we are using them). Possible solutions at the Snakefile-level include:
config/auspice_config_{lineage}_{segment}_{resolution}.json
)config/auspice_config_h3n2_ha.json
)The first solution seems the least ideal in that most of these configs will be identical and any update to the shared content between them will require the same updates to many files.
The second solution still duplicates shared content but benefits from defining build-specific conditional logic in Python in the Snakefile. For seasonal flu, it seems likely that the custom config will be the exception and not the rule, so a little duplicated content may not be the worst outcome.
The third solution is guaranteed to avoid content duplication, but it does move the build-specific conditional logic into a templating language. This might be the best solution in the long run if each build ends up having its own slight modifications to the shared settings. In the short term, though, it feels like an over-engineered approach.
The subsampling logic for strain selection in original augur supported a "priority" sampling for specific regions. This type of prioritization is helpful for custom flu builds where one wants to investigate evolutionary patterns in specific regions. The current strain selection script for seasonal flu exposes the same --sampling
flag used by old augur, but this flag is ignored.
To enable preferred sampling of regions by the current script, some of the code from the original subsampling script needs to be migrated.
We currently annotate full official clade names for all lineages, but for H3N2 and H1N1pdm the longer official clade names can be difficult to read in the Nextstrain view. We also often need to propose/follow new subclades before these are officially recognized.
A similar situation exists for SARS-CoV-2 where we support Nextstrain clades, PANGO lineages, and "emerging lineages" as alternate branch labels in Auspice.
We should set up parallel "clade" names (i.e., Nextstrain clades) and "full clade" names (i.e., official WHO clade names) in an analogous fashion to how we have "clade" and "emerging lineage" in ncov. This would be both a coloring and toggle under the "branch labels".
We should default to showing the Nextstrain clade names which we'll abbreviate to "2a.2", etc... but have the full official clade names available. Nextclade will similarly have two fields for flu clades.
The full official clade names are defined in files named like clades_h3n2_ha.tsv
while the Nextstrain clades are in files named like nextstrain_clades_h3n2_ha.tsv
. We need to either define a separate clade annotation rule for the Nextstrain clades or modify the existing clades rule to be parameterized by the set of clade definitions to use.
The ncov workflow defines separate rules to annotate clades and emerging lineages. This workflow also needs an additional rule to add branch labels for emerging lineages to the Auspice JSON. However, if we could first merge @jameshadfield's excellent Augur PR to add support for arbitrary branch labels in augur clades
, we could avoid this additional complexity in the seasonal flu workflow.
We also need to redefine Nextstrain clades as shortened versions of the full official clade names, keeping the official capitalization and abbreviating on dot. For example:
Before we can deploy newly built Auspice JSONs to the Nextstrain group, we need to rename them to a unique name that reflects the date they were built. The builds need to be renamed from names like h1n1pdm_2y_titers_ha.json
to flu_seasonal_YYYY-MM-DD_h1n1pdm_2y_titers_ha.json
where the YYYY-MM-DD
is the date the builds were run. We currently rename these files manually to include the date.
The nextstrain-public builds provide an example of custom rules to rename the JSONs prior to deployment. The new names for the JSONs get defined as a template string in the build configuration parameter called auspice_name
. The rename rules get invoked by the custom deploy rule which is the primary target of the workflow we run from GitHub Actions.
We should follow the same pattern for the nextflu-private builds, except we need to calculate the current date when the workflow runs. One approach could be to assign the current date to a global variable in the rename.smk
Snakemake file, define an auspice_name
for the builds with a non-wildcard placeholder for the date like "flu_seasonal_{{date}}_{lineage}_{{segment}}_{resolution}"
, and then pass the date value to the format
call on the filename strings (e.g., this one and this one).
TODO:
auspice_name
string for builds in profiles/nextflu-private.yaml
with reference to the datedatetime.date.today().strftime("%Y-%m-%d")
nextflu-private-bot
)--env NEXTSTRAIN_USERNAME
and --env NEXTSTRAIN_PASSWORD
. These environment variables should already exist in the GitHub Action thanks to setup by the pathogen-repo-build actionnextstrain login
command to the custom Snakemake deploy rule which will rely on credentials in environmental variablesh3n2_2y_ha_cell_fra_cTiter_x-ne_star_forecast-tip-frequencies.json
to flu_seasonal_YYYY-MM-DD_h3n2_2y_ha_cell_fra_cTiter_x-ne_star_forecast-tip-frequencies.json
or just store as auspice/forecasts/{build_name}/{model}.json
and copy to auspice/forecasts/{build_date}/{build_name}/{model}.json
)Some discussion with Becky:
Egg-passaged sequences currently appear in seasonal flu builds. These should be filtered out during the augur filter
step by excluding viruses with an "egg" passage history in the metadata.
Currently, epitope (ep
) and non-epitope (ne
) coloring for H1N1pdm NA is broken:
Context
To enable SARS-CoV-2 narratives based on data at specific timepoints, we've maintained dated Auspice JSONs that are automatically generated as part of the standard Nextstrain workflow. These dated builds follow the naming format of auspice/ncov_{build_name}_{date}.json
where the date is in YYYY-MM-DD
format.
Description
To support similar narrative reports on the status of influenza, we should create dated copies of builds for the four lineages. SInce our reporting usually only needs recent context, we can start by creating these copies only for the 2y builds of HA and NA. With tree/meta, tip frequencies, and root sequence JSONs, this approach should produce 24 JSONs per day (4 lineages * 2 segments * 3 JSONs per build
).
Possible solution
Following the ncov workflow example, the resulting dated JSONs would be named like:
flu_seasonal_h3n2_ha_2y_2021-04-20.json
And would produce a URL like https://nextstrain.org/flu/seasonal/h3n2/ha/2y/2021-04-20.
An alternate solution that would be more consistent with the hierarchical grouping standard of HTTP APIs would place the date earlier in the name such that all builds from that date could be displayed below. An example dated JSON would look like:
flu_seasonal_2021-04-20_h3n2_ha_2y.json
And the URL would look like https://nextstrain.org/flu/seasonal/2021-04-20/h3n2/ha/2y. When we eventually support browsing builds by exploring the URL's API, this hierarchical layout would enable a page at https://nextstrain.org/flu/seasonal/2021-04-20/ that could show all builds from that day.
Open question: Should we produce these dated builds for all trees on the private site, too, or only the public site?
Context
To understand circulation patterns of flu within a specific country, we need to know which divisions (e.g., states, provinces, etc.) strains were collected in. We currently only annotate country-level geographic resolution in our Auspice JSONs.
Description
We should export division
as an annotation in Auspice JSONs. To do this, we need to:
"division"
to the list of geo_resolutions
in each Auspice config JSONdivision
color-by so we can color and filter by division annotations from the metadataMy guess is that the lat/long database from ncov will be substantially better than the current flu database, so we could start by copying from ncov and fill in blanks as we go.
Context
Phylogenetic trees of HA/NA help us identify emerging clades that should be tracked by routine surveillance. For each of these emerging clades, we need a way to identify candidate strains for testing by experiment assays like HI, FRA, HINT, etc.
Description
Identify one or more prototype virus strains named for each putative clade (identified as in #65). The prototypes must be real strains. These could represent strains identified computationally or experimentally as the best representatives.
Possible solution
Produce reference alignments with currently circulating diversity and identify the closest strain(s) to each auto-assigned clade. Most likely, these strains will need to be manually curated after the clades are automatically identified.
Consider labeling the auto-assigned clades (from #65) with the names of the prototype strains we curate here.
Despite our myriad quality filters in the workflow including a global clock rate filter, local clock rate filter, an outliers list, and dropping sequences with poor alignments, our trees still occasionally include low-quality sequences that would have been flagged with a "bad" QC status by Nextclade. In most of my recent experiences with this issue, the low-quality sequences have too many private mutations.
In the best case, these low-quality sequences look strange in the tree. In the worst case, these sequences break the date inference for internal nodes and produce an invalid time tree topology, requiring the builds to be run again.
Since we already plan to migrate away from Nextalign to Nextclade, we should align sequences with Nextclade and produce the metadata output file with QC statuses. We should filter any sequences that have a QC status of "bad".
We could approach this functionality in a couple of ways:
augur filter
on the alignment sequences and "metadata" from Nextclade), and pass the filtered alignment to the tree rule. We'd probably want to merge the original metadata records with the Nextclade metadata prior to filtering much like the merge we do in the flu_frequencies workflow.OR
The benefit of the first approach is that we could implement it now without much additional infrastructure planning, since the changes all happen inside the phylogenetic workflow. The annotations would be very fast, since we'd only run Nextclade on the subsampled data. The main disadvantage is the additional complexity to the workflow and the redundant runs of Nextclade across multiple builds for the same lineages and segments.
The benefits of the second approach are that it would introduce no complexity to the existing workflow and it would produce a valuable resource that other current workflows (like flu_frequencies) and future workflows (forecasts?) could benefit from. The disadvantage is the additional infrastructural complexity of setting up the Nextclade runs with GitHub Actions for different references (e.g., A/Wisconsin/67/2005 or A/Darwin/6/2021 for H3N2) and storing the merged metadata in S3 in a way that allows us to unambiguously grab the outputs from the desired Nextclade dataset.
I think we want to end up at the second approach eventually, so maybe it is worth the extra planning effort to figure that approach out now instead of using the first approach.
To make these flu builds consistent with previous augur builds, I'm suggesting the following modifications to the current Snakefile:
augur filter
to filter sequences by length and outlier status prior to selecting strainsscripts/select_strains.py
to take both sequences and metadata as inputs and to output selected sequencesfilter
rule from the Snakefile (it is no longer needed after the aforementioned modifications)If select_strains.py
outputs filtered sequences themselves instead of strain names, that will remove the need for an --include-only
type of option for the augur filter command and make the flu-specific script more consistent with existing augur command rather than the other way around.
Context
We currently define clades for all Nextstrain flu builds using the same manually curated clade definition files. However, for surveillance purposes, it would be helpful to have a way to identify and talk about new clades before they have reached high enough frequency to get manually annotated.
Description
Assign more granular clade ids to all clades in trees that have at least one amino acid mutation.
Possible solution
Consider using the find_clades.py script (or something like it) from the flu-forecasting project to automatically assign a unique id to each distinct amino acid haplotype. These haplotypes could span all of HA or could be focused on HA1. These annotations could be added just to the nextflu private builds, initially, to avoid confusion with the manually curated clades. Alternately, we could include a description of how these clades are identified, so the annotations would make more sense in the public builds.
Follow up to #126
Will need to look through netlify-cli's changelog to figure out what has changed between 17.2.2 and 17.5.1 (or later).
Current Behavior
When I run the select_strains.py script on a metadata file that has missing values in the age
column, it fails with the error:
Traceback (most recent call last):
File "scripts/select_strains.py", line 292, in <module>
metadata = parse_metadata(args.segments, args.metadata, date_format=args.date_format)
File "scripts/select_strains.py", line 207, in parse_metadata
if age_str[-1]=='y':
IndexError: string index out of range
Expected behavior
Similar to the ncov workflow: where missing data in the metadata is permitted and the build still runs.
How to reproduce
Steps to reproduce the current behavior:
augur parse
scripts/construct-recency-from-submission-date.py
augur filter
scripts/select_strains.py
Possible solution
line 207 in select_strains.py currently assumes that each entry has text. Adding an if
statement to allow for empty values could be one solution.
Your environment: if running Nextstrain locally
auspice 2.7.0
): nextstrain.cli 3.0.3We currently annotate clades based on mutations in the HA tree for each subtype. To enable comparisons between HA trees and trees from other segments (like NA), we need to map HA's clade annotations onto the trees from these other segments.
The current HA clades JSON is the same file we'd want to use in the augur export for NA and other non-HA segments (just strain names matched to clades), but if there are any internal nodes in the NA tree that happen to match the HA tree, we'd want to omit those to avoid coloring the wrong internal nodes in NA.
To omit internal nodes, we could create a utility script in scripts/strip_internal_nodes.py
that takes any "node data" JSON file from modular augur and outputs the corresponding JSON file with all internal nodes stripped. We could then define one rule for HA clade annotations in the Snakefile and a separate rule for non-HA clade annotations that uses the HA JSON as its input.
The resulting Snakemake rules might look like the following:
rule ha_clades:
message: "Annotating HA clades"
input:
tree = rules.refine.output.tree,
nt_muts = rules.ancestral.output,
aa_muts = rules.translate.output,
clade_definitions = "config/clades_{lineage}_ha.tsv"
output:
clades = "results/clades_{lineage}_ha_{resolution}.json"
rule non_ha_clades:
message: "Annotating non-HA clades ({wildcards.segment})"
input:
ha_clades = "results/clades_{lineage}_ha_{resolution}.json"
output:
non_ha_clades = "results/clades_{lineage}_{segment}_{resolution}.json"
I am working on using Nextstrain for global surveillance of Influenza. As part of our bioinformatics workflow, we already assign each sequenced virus to a clade (e.g., 3C.2a1b.2a.2) and we can pull this as part of the tabular metadata for a build. Currently
augur clades
recomputes this for each node using the definitions file.
I would like to write to a clades.json
directly from tabular metadata instead of recomputing the clade based on augur clades
.
augur clades
to write the json from strain and clade columns.In the near future, we would like to support region-specific flu builds and also enable other researchers to use this repository to define their own custom flu builds in the same way the ncov repository allows custom workflows.
Seasonal flu builds should follow the same general pattern as ncov builds where a standard workflow depends on an initial selection of sequences (based on some subsampling logic), inferring and annotating phylogenies, and then running build-specific steps to finalize the analysis.
Users should be able to configure their own workflows by modifying configuration files (e.g., YAML or JSON) and/or defining custom Snakemake profiles.
Existing parallel Nextstrain builds ("live" and "WHO") should be executable through this framework, such that we are users of our own workflow configuration system.
See Nextstrain's ncov regional builds definitions for a real example of what we would like to support for flu eventually.
Specific solutions will require more discussion, but the basic first steps toward addressing this issue would be similar to what we had to do for ncov back in March/April:
workflow/*.smk
files including a common.smk
for shared functions and builds.smk
for the main build logic.Since the same strain can appear multiple times in GISAID, we should provide the unique GISAID accessions we use for each sequence that appears in our trees. We also modify the original strain names, in some cases, to standardize leading zero padding in ids, etc. such that users cannot find records in GISAID's search interface without the accession.
Hi,
I am trying to follow tutorial under 'Quickstart with GISAID data' so that I can get familiarize with flu Nextstrain build later. I aim to build an H3N2 instance for a period of 10 years. I completed the steps where I have both metadata and FASTA sequences saving as the suggested names metadata.xls
, raw_sequences_ha.fasta
, respectively. However, when I tried to run nextstrain build --forceall . --configfile profiles/gisaid/builds.yaml
, I got the message below.
Building DAG of jobs...
AmbiguousRuleException:
Rules parse and prepare_sequences are ambiguous for the file data/h3n2/ha.fasta.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
parse: lineage=h3n2,segment=ha
prepare_sequences: lineage=h3n2,segment=ha
Expected input files:
parse: data/h3n2/raw_ha.fasta
prepare_sequences: data/h3n2/raw_sequences_ha.fasta
Expected output files:
parse: data/h3n2/ha.fasta data/h3n2/metadata_ha.tsv
prepare_sequences: data/h3n2/ha.fasta
I later ran nextstrain view auspice
, it didn't project the data contained in raw_sequences_ha.fasta
. I wonder where could it be wrong. Any advice and feedback would be greatly appreciated!
Thanks,
Swan
We need to calculate tip-frequencies.json
via KDE frequencies in order to restore the frequencies panel. I believe this functionality should be part of augur frequencies
where --tree
is passed in. And then maybe have flags for --method kde
and --tips-only
or some such.
We should have lbi
calculated for all nodes in the tree. I suspect this should just be a simple script in scripts/
.
@corneliusroemer experimented with IQ-TREE's constraint tree option to prevent IQ-TREE from putting clades in the wrong place for NextClade reference trees. This seems like a good way to ensure correct trees especially in 6m builds that may be lacking context sequences.
One issue brought up in initial Slack thread:
One issue to sort out would be delimiter in sequence names, right now IQtree renames all
/
as some weird string
Current Behavior
The time tree view on nextflu works as expected:
When one deselects the "timetree" option to view the divergence tree, the tree collapses to one position on the x-axis:
Expected behavior
The divergence view should show the branch length in divergence along the x-axis.
How to reproduce
Steps to reproduce the current behavior:
Additional context
It is not clear if this issue is with the seasonal-flu build's tree JSONs or the nextflu code itself. In the latter case, we should transfer this issue to the nextflu repository.
Currently, besides reference genomes, select_strains.py
is only passing through viruses that possess both HA and NA segments (due to our use of --segments ha na
in the snakefile). For time pivots back about 3-6 months this is okay and there are usually enough strains with both HA and NA to fill sampling bins. However, some strains are just only getting HA sequenced. This was especially obvious looking just now where there are a number of H3s from January with just HA. These tend to be uploaded by groups that are not CCs.
I would propose to modify select_strains.py
so that "complete" genome (as in possessing all entries in --segments
) becomes another factor in priority
rather than a hard constraint.
The current script to import tip clades from HA to non-HA segments identifies internal nodes by checking that the first four characters of the node name match "NODE". The fact that internal nodes have this specific name is a design decision from TreeTime. This approach tries to infer the data structure from the data contents when information about the structure is already available in an external Newick file.
I'd suggest we pass the original HA tree Newick to this script and use the Bio.Phylo is_terminal
method to identify internal nodes instead of testing for specific text. This change will prevent the script from silently breaking when/if TreeTime names internal nodes something different in the future and will avoid confusion when someone would otherwise have to figure out where the "NODE" text comes (I admit this took me a little while to track down). Although this change couples the script to the Bio.Phylo interface, that interface is something more easily grepped for and replaced during a future refactor.
The Newick tree for the current segment is already an input for the clades rule, so it would not be difficult to force that input tree to always be the HA tree instead.
Context
H3N2 has an emerging clade from A1b/135K that has reached high enough frequency to start tracking. We agreed to name this clade A1b/186D and annotate based on its 135K, 186D, and 190N.
H1N1pdm has several previously identified clades that we haven't annotated yet. These include:
And a newer clade:
Description
We need to add annotations for these new clades to the corresponding clades TSV files in the builds and confirm visually that they work as expected.
We produce monthly reports describing seasonal influenza circulation patterns. A substantial portion of time spent on these reports is tedious copying and pasting or running commands manually that could be automated.
The following specific tasks will minimize the manual work associated with reports and allow us to focus on the science.
Fewer than 1% of strains sampled between 2017—2019 have ambiguous dates, but when these strains are included in a seasonal flu build, the pipeline assigns them two different effective dates. The first date is inferred by TreeTime. The second date is read dates from the metadata and converted to numerical values prior to frequency estimation.
These differences lead to examples like A/Malaysia/RP0118/2019 which has an ambiguous date of 2019-XX-XX, a TreeTime-inferred date of January 1, 2019, and a brute-force guess from metadata of June 2019. For this example, the user interface confusingly displays two different dates as shown below. This problem manifests in the forecasting pipeline where tips with non-zero frequencies are used to project the population one year into the future. In the example below, the strain with an ambiguous date is included in the forecast because of its frequency estimation date even though it was most likely circulating 6 months earlier.
We could fix this issue in a couple ways:
The first solution seems to be the simplest and since fewer than 1% of recent sequences have ambiguous dates, this filter shouldn’t adversely affect the quality of the flu builds. This solution only requires a change to the seasonal flu build.
The second solution requires a change to a core augur interface. It might be nice in the long run to estimate frequencies from the most accurate information available, though. Tip frequency estimation already requires the Newick tree from augur refine, so adding the branch_lengths.json
as an input to frequencies wouldn’t require any major rewiring of existing pipelines.
My preference is for the first solution.
I think there's a minor typo in the README:
These antigenic data come in four flavors depending on the assay that passage history of the antigens.
Should possibly be: "assay and passage history"
Submitting this here per conversations on Slack and with Tom.
The latest H3N2 WHO builds failed multiple times when they were run on AWS, seemingly because they exceed the disk limit (22GB) allocated to instances on AWS. They were submitted using:
python batch.py --system batch --version who
The logs for the failed run can be found here. The build seems to fail during the final zip
step, which points toward an issue with disk space. This will hopefully be helped by nextstrain/augur#278, but as a short term solution can also be helped by subdividing WHO builds by more than just subtype.
Current Behavior
Titer records define strain names for test and reference viruses, but we do not automatically cross-check these names with existing names in GISAID/GenBank records. As a result, there can be misspelled strain names in the titer records that lead us to omit these measurements from analyses when they do not match the corresponding sequence record's strain name.
Tal from the Bloom lab has helpfully provided a list of misspellings that he detected in our Crick titers:
"Misspelled Virus Name","Correct Virus Name"
"A/Camb/925256/2020","A/Cambodia/925256/2020"
"A/Christchurch/4/1985","A/ChristChurch/4/1985"
"A/Christchurch/515/2019","A/ChristChurch/515/2019"
"A/Christchurch/515/2019-egg","A/ChristChurch/515/2019-egg"
"A/CotedIvore/544/2016","A/CoteDIvoire/544/2016"
"A/Eng/538/2018","A/England/538/2018"
"A/Greecd/4/2017","A/Greece/4/2017"
"A/Hk/5738/2014","A/HongKong/5738/2014"
"A/Hk/656/2018","A/HongKong/656/2018"
"A/Hk/675/2018","A/HongKong/675/2018"
"A/Lyon/CHU/R1811667/2018","A/Lyon/CHU-R1811667/2018"
"A/Lyon/CHU/R181282/2018","A/Lyon/CHU-R181282/2018"
"A/Lyon/CHU/R1813393/2018","A/Lyon/CHU-R1813393/2018"
"A/Lyon/CHU/R190259/2019","A/Lyon/CHU-R190259/2019"
"A/Lyon/CHU/R190377/2019","A/Lyon/CHU-R190377/2019"
"A/Lyon/CHU/R1914685/2019","A/Lyon/CHU-R1914685/2019"
"A/Lyon/CHU/R1915450/2019","A/Lyon/CHU-R1915450/2019"
"A/Lyon/EHPAD/108/2019","A/Lyon/EHPAD-108/2019"
"A/Nor/2516/2018","A/Norway/2516/2018"
"A/Nor/2620/2018","A/Norway/2620/2018"
"A/Nor/4436/2016","A/Norway/4436/2016"
"A/Norway/3806-egg","A/Norway/3806/2016-egg"
"A/Singapore/INFIMH-16-001/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/INFIMH-16-001/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/StEtienne/1912/2018","A/Saint-Etienne/1912/2018"
"A/StEtienne/1998/2018","A/Saint-Etienne/1998/2018"
"A/StEtienne/2539/2020","A/Saint-Etienne/2539/2020"
"A/Stock/6/2014","A/Stockholm/6/2014"
"A/Switz/8060/2017-egg","A/Switzerland/8060/2017-egg"
"A/Switzerlandz/8060/2017-egg","A/Switzerland/8060/2017-egg"
Expected behavior
Misspelled strains in the list above should match their sequence strain names.
Possible solution
In addition to manually correcting these records in our database, we should also consider flagging any titer records with potential misspellings. One easy check would be for records whose test or reference strains don't have corresponding records in the sequence database.
The H3N2 NA build on nextstrain.org contains a mutation that's not within the range for selection. There is a major mutation (I469T) that is not selectable by the 'Color By' option. The range is 1 - 468. Note the mutation is outside this range (I469T)
This link should work:
https://nextstrain.org/flu/seasonal/h3n2/na/2y?branchLabel=aa&c=gt-NA_469
But nextstrain will return the standard tree. See NODE_0001125 for the mutation I am referring to.
Change the range to 1 - 469
Context
The workflow uses a custom Python script to perform codon-aware alignment, to resolve issues with variable sites in mafft's alignments. Since @rneher wrote this script, he and @ivan-aksamentov developed nextalign. Even though nextalign was developed with SARS-CoV-2 in mind, it works well for H3N2 HA sequences, too, and is way faster.
Description / proposed solution
We should try using nextalign for our flu builds. We'll need to setup the corresponding FASTA reference files for all of the lineages and segments or maybe implement GenBank file support in nextalign (whichever is easier...I can guess which though!). Then, we can take advantage of the codon-aware alignment functionality and also run alignments with multiple threads to speed up that step of our builds.
Once we have nextalign in place, we could start to do analyses that previously would have taken too long like creating multiple sequence alignments of all amino acid sequences for HA and running the titer substitution model on all available sequences and titers.
Hi,
I'd like to build the analysis for a custom flu dataset.
It seems the documentation and tutorials are not detailed enough for the flu build.
For instance, I was able to run the zika virus tutorial, but the input file formats look different.
The seasonal flu build using the example data in this repository works fine, but again, the input file format looks different.
So the questions are, if I want to run the analysis using a custom dataset:
The example data tells me some information about the format, but I was unable to run the nextstrain build on my own sequence and titer data.
Thank you for any advice.
Strain submission to GISAID and other databases often lags by one to two months from the actual collection date. When we make new seasonal flu builds, it would be helpful to identify which strains were recently added to the database.
The ncov builds address this problem by adding a "recency" color-by attribute. This categorical attribute bins submission dates into useful groups for ncov analysis.
For flu, a similar categorical attribute that highlights sequences submitted in the last month or before should be sufficient.
Require Augur version 15.0.0. and update the pipeline to use the the --min/max-date
options in augur filter
instead of filtering by date within select_strains
.
We need to calculate ep
, ne
and rb
for each node in the tree. I know that @huddlej has a clean way to do this well. I don't know whether this is better as augur scores
or augur sequence-traits
or if it's better as a separate script in seasonal-flu/scripts/
.
Currently augur titers sub
outputs a JSON that has entries for avidity
, potency
, substitution
, titers
, but lacks a nodes
entry that could be used to decorate the final tree.json
via augur export
.
This change needs to happen to augur titers sub
and then the seasonal build can be updated.
We currently estimate tip frequencies up to the current date of a given build, but the latest available sequences are often collected one or two months prior to the current date. As a result, the frequencies misrepresent what we know about the present. For H3N2, this lack of data also makes the forecasts less reliable because they end up being 14-15 month forecasts instead of 12-month forecasts.
Instead of using the current date as the max date for tip frequencies, we should dynamically determine the max date from the available sequences in the analysis, rounding up to the first day of the next month. For example, if the latest available sequence is from April 15, we would set the max date to May 1.
Current Behavior
@kistlerk noticed that the demarcated region for HA1 is not divisible by 3 in the HA reference files for Vic and Yam.
Expected behavior
The CDS for HA1 should say 57..1094 rather than 1095 and SigPep should be 12..56.
CDS coordinates in GenBank format are 1-based and closed. To calculate the amino acids in a given range, you have to add 1 to the range before dividing by 3. For H3N2's HA, we get the 16 amino acids of SigPep by (48 - 1 + 1) / 3 and the 329 of HA1 with (1035 - 49 + 1) / 3.
You already upload raw fasta to AWS in rule upload_sequences
.
It would be useful for our https://github.com/neherlab/flu_frequencies repo to be able to download metadata.tsv as produced by your rule parse
.
I will duplicate your parse for now, but in the future it would be nice to reduce code duplication and parse in one place.
Is there a reason you are currently only uploading the raw data not the parsed one?
@rneher ---
I don't think I introduced this bug. But you first might want to check one of your existing JSONs to confirm it's present there as well. If I build flu/who/h3n2/ha/2y/cell/hi
, the substitution model inference of tip antigenic advancement looks good via auspice v2:
However, the display in auspice v1 (nextflu/auspice
) doesn't register these antigenic changes:
I thought this might have been how substitutions are encoded in flu_who_h3n2_ha_2y_cell_hi_titer-sub-model.json
, but this actually looks good. Thinking now it may be entirely an issue with auspice v1.
Context
The most recent vaccine strains for the upcoming Northern Hemisphere season aren't included in live builds.
Description
These strains need to be added to the vaccines JSON and references files for H3N2 and H1N1pdm.
We currently assign clade labels to trees in our main phylogenetic workflow using the augur clades
command and the influenza clade nomenclature TSVs. However, clade assignments vary for some samples between these public/private trees and the Nextclade trees. Clade assignments can vary between different runs of public/private tress from the same time period due to different sample compositions of the trees produced by our random subsampling logic. These mismatches can cause confusion among users who look at both Nextclade outputs for their own data and the public/private Nextstrain trees.
Since Nextclade provides a standard clade label interface already, we should use Nextclade to annotate clades in our main phylogenetic workflow instead of augur clades
. This change will ensure that the samples are assigned to the same clade regardless of the sample composition of a given public/private tree.
In the short term, we could replace our nextalign
alignment with nextclade
using the corresponding reference's dataset for each subtype. We would need to replace the current augur clades
command with functionality like @corneliusroemer proposed in nextstrain/augur#1329 that allows us to assign clades to internal nodes and branches for complete backward compatibility of clade display in Auspice. Instead of inferring clades for internal nodes as a discrete trait, we could consider assigning clades with Nextclade to the inferred ancestral sequences for nodes.
In the long (medium?) term, we could run Nextclade during our "data upload to S3" workflow, upload the alignments and Nextclade annotations joined with metadata, and then start our workflows with those files. This approach would allow us to skip the alignment and clades steps of the current workflow and it would provide useful Nextclade data on S3 that we need for other analyses like flu frequencies, etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.