nextstrain / seasonal-flu Goto Github PK

Scripts. config, and snakefiles for seasonal-flu nextstrain builds

Python 93.48% Shell 0.59% Jupyter Notebook 5.93%

seasonal-flu's Issues

Annotate titers per node

Context

We need to know which strains and clades do not have titer measurements from different assays, passage types, and sources. We previously wrote a script to annotate the number of titer measurements per node as a node data JSON file. This script's output plus an Auspice config JSON coloring entry allowed us to color the tree by total titers or counts binned into larger categories.

Description

We should implement this annotation in the refactored workflow for the private builds, at least. We could also use the Auspice config JSON to setup meaningful bins for the counts as a legend scale.

It would also be helpful to be able to filter the tree by presence/absence of mutations in the different categories, so we could focus on only tips that have no measurements.

Use FRA data instead of HI data for live H3N2 builds

Context

HI assays do not perform as well for recent H3N2 viruses as FRA/HINT assays (3c3.A is the exception) and these assays produce different titer model parameters than the FRA/HINT assays. We know these assays measure different aspects of viral infection and that FRA/HINT assays were historically performed when HI assays didn't work making direct comparisons between these data biased.

Although I had previously thought that FRA/HINT data were more numerous than HI data, the number of measurements submitted since Jan 1, 2019 is nearly the same for both assay types. It turns out that the majority of our FRA/HINT data come from the CDC and the majority of the HI data come from VIDRL. This is an important distinction because the live builds only include data from the CDC and are thus relatively depleted for HI data. Here is the breakdown of recent data based on the inclusion date:

>>> hi.query("inclusion_date > '2019-01-01'")["center"].value_counts()
vidrl    21415
crick     6621
cdc       3108
>>> fra.query("inclusion_date > '2019-01-01'")["center"].value_counts()
cdc      18649
vidrl     3987
niid      3851
crick     1097

Description

We should try building a 2y H3N2 HA tree with HI data and another with FRA data, keeping all other parameters of the builds the same (including the forecasting model). Then, we should compare the trees and the corresponding titer model results head-to-head. We should look for:

major changes in overall tree topology based on strains selected in an assay-specific manner
changes in the titer model range and specific branches/substitutions with antigenic advance by these models
changes in forecasts

Example build not working. MissingRuleException

Current Behavior

Cloned seasonal-flu repo, installed nextstrain cli with conda setup.
I tried to run the example build (nextstrain build . targets/flu_seasonal_h3n2_ha_12y), but it does not work, producing a "MissingRuleException."

Before the refactored workflow update, I was able to build with the same steps and example data.
I'd like to run the same with the updated, refactored workflow.

How to reproduce

Steps to reproduce the current behavior:

Clone seasonal-flu repo
conda create -n nextstrain # python 3.9.12
Install NextStrain according to the documentation. curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/linux | bash; printf '\n%s\n' 'eval "$("/home/limh25/.nextstrain/cli-standalone/nextstrain" init-shell bash)"' >> ~/.bashrc; eval "$("/home/limh25/.nextstrain/cli-standalone/nextstrain" init-shell bash)"
python3 -m pip install nextstrain-cli; nextstrain setup conda; nextstrain setup --set-default conda
Copy example data to data/
Run the example build.

(nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain build . targets/flu_seasonal_h3n2_ha_12y --configfile /app/limh25/seasonal-flu/profiles/example/builds.yaml --cores 8 --printshellcmds --show-failed-logs --reason --verbose

Building DAG of jobs...
Full Traceback (most recent call last):
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/__init__.py", line 771, in snakemake
    success = workflow.execute(
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 741, in execute
    dag.init()
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 210, in init
    self.file2jobs(file),
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 1984, in file2jobs
    raise MissingRuleException(targetfile)
snakemake.exceptions.MissingRuleException: No rule to produce targets/flu_seasonal_h3n2_ha_12y (if you use input functions make sure that they don't raise unexpected exceptions).

MissingRuleException:
No rule to produce targets/flu_seasonal_h3n2_ha_12y (if you use input functions make sure that they don't raise unexpected exceptions).
unlocking
removed all locks

Your environment: if running Nextstrain locally

Operating system: Amazon Linux 2
Version (nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain version
nextstrain.cli 7.1.0, python 3.9.12

(nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain check-setup
nextstrain-cli is up to date!

Testing your setup…

# docker is not supported
✘ no: docker is installed
✘ no: docker run works
? unknown: containers have access to >2 GiB of memory
✔ yes: image is new enough for this CLI version

# conda is supported
✔ yes: operating system is supported
✔ yes: runtime data dir doesn't have spaces
✔ yes: runtime appears set up
✔ yes: snakemake is installed and runnable
✔ yes: augur is installed and runnable
✔ yes: auspice is installed and runnable

# singularity is not supported
✘ no: singularity is installed
✘ no: singularity version None ≥ 3.0.0
✘ no: singularity works

# ambient is not supported
✘ no: snakemake is installed and runnable
✘ no: augur is installed and runnable
✘ no: auspice is installed and runnable

# aws-batch is not supported
✘ no: job description "nextstrain-job" exists
✘ no: job queue "nextstrain-job-queue" exists
✘ no: S3 bucket "nextstrain-jobs" exists

Supported Nextstrain runtimes: conda

All good!  Default runtime (conda) is supported.

There was no problem running the example build before the refactoring update. Now it does not work as expected.
Please help me running the example build.
Please also let me know if any further command-line outputs are needed to debug.

Thank you.

Shared auspice config JSON is not appropriate for all builds

The current auspice configuration file (config/auspice_config.json) is not always appropriate for all builds in the Snakefile. The most obvious example of this is that the auspice config exports a color-by for “RBS mutations” even though A/H3N2 HA builds are the only ones that define RBS mutations.

This feels like an issue that should be resolved at the level of the Snakefile and not at the augur or auspice level (it's not necessary for those tools to know that this is how we are using them). Possible solutions at the Snakefile-level include:

define one separate config JSON per build (duplicating any shared content) and reference these files by wildcards (config/auspice_config_{lineage}_{segment}_{resolution}.json)
define one default config JSON shared by most builds and allow build-specific copies of this default to be defined through a Snakemake input function (e.g., config/auspice_config_h3n2_ha.json)
define a single templated config JSON (as with jinja, etc.) that contains all shared and build-specific content and render one config JSON per build from this template using conditional logic

The first solution seems the least ideal in that most of these configs will be identical and any update to the shared content between them will require the same updates to many files.

The second solution still duplicates shared content but benefits from defining build-specific conditional logic in Python in the Snakefile. For seasonal flu, it seems likely that the custom config will be the exception and not the rule, so a little duplicated content may not be the worst outcome.

The third solution is guaranteed to avoid content duplication, but it does move the build-specific conditional logic into a templating language. This might be the best solution in the long run if each build ends up having its own slight modifications to the shared settings. In the short term, though, it feels like an over-engineered approach.

Enable preferred sampling of specific regions during strain selection

The subsampling logic for strain selection in original augur supported a "priority" sampling for specific regions. This type of prioritization is helpful for custom flu builds where one wants to investigate evolutionary patterns in specific regions. The current strain selection script for seasonal flu exposes the same --sampling flag used by old augur, but this flag is ignored.

To enable preferred sampling of regions by the current script, some of the code from the original subsampling script needs to be migrated.

Annotate Nextstrain clade names by default with official full clade names as alternate annotation

Context

We currently annotate full official clade names for all lineages, but for H3N2 and H1N1pdm the longer official clade names can be difficult to read in the Nextstrain view. We also often need to propose/follow new subclades before these are officially recognized.

A similar situation exists for SARS-CoV-2 where we support Nextstrain clades, PANGO lineages, and "emerging lineages" as alternate branch labels in Auspice.

Description

We should set up parallel "clade" names (i.e., Nextstrain clades) and "full clade" names (i.e., official WHO clade names) in an analogous fashion to how we have "clade" and "emerging lineage" in ncov. This would be both a coloring and toggle under the "branch labels".

We should default to showing the Nextstrain clade names which we'll abbreviate to "2a.2", etc... but have the full official clade names available. Nextclade will similarly have two fields for flu clades.

Possible solution

The full official clade names are defined in files named like clades_h3n2_ha.tsv while the Nextstrain clades are in files named like nextstrain_clades_h3n2_ha.tsv. We need to either define a separate clade annotation rule for the Nextstrain clades or modify the existing clades rule to be parameterized by the set of clade definitions to use.

The ncov workflow defines separate rules to annotate clades and emerging lineages. This workflow also needs an additional rule to add branch labels for emerging lineages to the Auspice JSON. However, if we could first merge @jameshadfield's excellent Augur PR to add support for arbitrary branch labels in augur clades, we could avoid this additional complexity in the seasonal flu workflow.

We also need to redefine Nextstrain clades as shortened versions of the full official clade names, keeping the official capitalization and abbreviating on dot. For example:

3C.2a1b.2a.2 → 2a.2
3C.2a1b.1b → 1b

Add rules to the nextflu-private workflow to rename Auspice JSONs based on the build date and deploy the JSONs to the Nextstrain group

Before we can deploy newly built Auspice JSONs to the Nextstrain group, we need to rename them to a unique name that reflects the date they were built. The builds need to be renamed from names like h1n1pdm_2y_titers_ha.json to flu_seasonal_YYYY-MM-DD_h1n1pdm_2y_titers_ha.json where the YYYY-MM-DD is the date the builds were run. We currently rename these files manually to include the date.

The nextstrain-public builds provide an example of custom rules to rename the JSONs prior to deployment. The new names for the JSONs get defined as a template string in the build configuration parameter called auspice_name. The rename rules get invoked by the custom deploy rule which is the primary target of the workflow we run from GitHub Actions.

We should follow the same pattern for the nextflu-private builds, except we need to calculate the current date when the workflow runs. One approach could be to assign the current date to a global variable in the rename.smk Snakemake file, define an auspice_name for the builds with a non-wildcard placeholder for the date like "flu_seasonal_{{date}}_{lineage}_{{segment}}_{resolution}", and then pass the date value to the format call on the filename strings (e.g., this one and this one).

TODO:

Add custom rules to rename Auspice JSONs in nextflu-private profile
Define auspice_name string for builds in profiles/nextflu-private.yaml with reference to the date
Format the Auspice JSON names with the date calculated from datetime.date.today().strftime("%Y-%m-%d")
Add custom rule to deploy renamed Auspice JSONs in the nextflu-private profile using the Nextstrain Group secrets
Setup an AWS Cognito user with write access to the nextflu-private group (nextflu-private-bot)
Add nextflu-private username and password to seasonal-flu repo secrets for actions
Update GitHub Action for nextflu-private builds to add two lines below the current environment variables passed to the CLI which specify --env NEXTSTRAIN_USERNAME and --env NEXTSTRAIN_PASSWORD. These environment variables should already exist in the GitHub Action thanks to setup by the pathogen-repo-build action
Add nextstrain login command to the custom Snakemake deploy rule which will rely on credentials in environmental variables
Add a rule to the custom rename Snakefile to rename the forecast tip frequencies JSONs (e.g., from h3n2_2y_ha_cell_fra_cTiter_x-ne_star_forecast-tip-frequencies.json to flu_seasonal_YYYY-MM-DD_h3n2_2y_ha_cell_fra_cTiter_x-ne_star_forecast-tip-frequencies.json or just store as auspice/forecasts/{build_name}/{model}.json and copy to auspice/forecasts/{build_date}/{build_name}/{model}.json)

subsampling in flu

Some discussion with Becky:

she has prevalence data scrapped from FluNet that we could use for subsampling.
we could use this for subsampling and calculation of global frequencies.
this becomes more important as they are preferentially sampling rare lineages.

Filter egg passaged virus sequences from seasonal flu builds

Egg-passaged sequences currently appear in seasonal flu builds. These should be filtered out during the augur filter step by excluding viruses with an "egg" passage history in the metadata.

Epitope and non-epitope coloring for H1N1pdm NA broken

Currently, epitope (ep) and non-epitope (ne) coloring for H1N1pdm NA is broken:

Create dated copies of 2y builds

Context

To enable SARS-CoV-2 narratives based on data at specific timepoints, we've maintained dated Auspice JSONs that are automatically generated as part of the standard Nextstrain workflow. These dated builds follow the naming format of auspice/ncov_{build_name}_{date}.json where the date is in YYYY-MM-DD format.

Description

To support similar narrative reports on the status of influenza, we should create dated copies of builds for the four lineages. SInce our reporting usually only needs recent context, we can start by creating these copies only for the 2y builds of HA and NA. With tree/meta, tip frequencies, and root sequence JSONs, this approach should produce 24 JSONs per day (4 lineages * 2 segments * 3 JSONs per build).

Possible solution

Following the ncov workflow example, the resulting dated JSONs would be named like:

flu_seasonal_h3n2_ha_2y_2021-04-20.json

And would produce a URL like https://nextstrain.org/flu/seasonal/h3n2/ha/2y/2021-04-20.

An alternate solution that would be more consistent with the hierarchical grouping standard of HTTP APIs would place the date earlier in the name such that all builds from that date could be displayed below. An example dated JSON would look like:

flu_seasonal_2021-04-20_h3n2_ha_2y.json

And the URL would look like https://nextstrain.org/flu/seasonal/2021-04-20/h3n2/ha/2y. When we eventually support browsing builds by exploring the URL's API, this hierarchical layout would enable a page at https://nextstrain.org/flu/seasonal/2021-04-20/ that could show all builds from that day.

Open question: Should we produce these dated builds for all trees on the private site, too, or only the public site?

Add "division"-level geographic resolution to Auspice JSONs

Context
To understand circulation patterns of flu within a specific country, we need to know which divisions (e.g., states, provinces, etc.) strains were collected in. We currently only annotate country-level geographic resolution in our Auspice JSONs.

Description
We should export division as an annotation in Auspice JSONs. To do this, we need to:

add "division" to the list of geo_resolutions in each Auspice config JSON
add a division color-by so we can color and filter by division annotations from the metadata
add latitudes/longitudes for all divisions in our available metadata

My guess is that the lat/long database from ncov will be substantially better than the current flu database, so we could start by copying from ncov and fill in blanks as we go.

how to plot the following figure?

Identify "prototype virus" for each putative clade

Context
Phylogenetic trees of HA/NA help us identify emerging clades that should be tracked by routine surveillance. For each of these emerging clades, we need a way to identify candidate strains for testing by experiment assays like HI, FRA, HINT, etc.

Description
Identify one or more prototype virus strains named for each putative clade (identified as in #65). The prototypes must be real strains. These could represent strains identified computationally or experimentally as the best representatives.

Possible solution
Produce reference alignments with currently circulating diversity and identify the closest strain(s) to each auto-assigned clade. Most likely, these strains will need to be manually curated after the clades are automatically identified.

Consider labeling the auto-assigned clades (from #65) with the names of the prototype strains we curate here.

Filter low-quality sequences by Nextclade QC status

Context

Despite our myriad quality filters in the workflow including a global clock rate filter, local clock rate filter, an outliers list, and dropping sequences with poor alignments, our trees still occasionally include low-quality sequences that would have been flagged with a "bad" QC status by Nextclade. In most of my recent experiences with this issue, the low-quality sequences have too many private mutations.

In the best case, these low-quality sequences look strange in the tree. In the worst case, these sequences break the date inference for internal nodes and produce an invalid time tree topology, requiring the builds to be run again.

Description

Since we already plan to migrate away from Nextalign to Nextclade, we should align sequences with Nextclade and produce the metadata output file with QC statuses. We should filter any sequences that have a QC status of "bad".

We could approach this functionality in a couple of ways:

Modify the existing align rule to use Nextclade and produce QC output, add a subsequent post-alignment filter rule before the tree building rule to omit "bad" QC records (using augur filter on the alignment sequences and "metadata" from Nextclade), and pass the filtered alignment to the tree rule. We'd probably want to merge the original metadata records with the Nextclade metadata prior to filtering much like the merge we do in the flu_frequencies workflow.

Run Nextclade on all sequences upstream of the main phylogenetic workflow, merge the complete metadata with the Nextclade annotations, upload these combined metadata to S3, start the phylogenetic workflow from the combined metadata files, and apply custom filters on QC in the subsampling logic for each build. The only changes to the main phylogenetic workflow required by this approach would be additions to the build YAML files to include a filter for Nextclade QC status. The bigger changes happen outside of the main workflow in our sequence upload logic.

The benefit of the first approach is that we could implement it now without much additional infrastructure planning, since the changes all happen inside the phylogenetic workflow. The annotations would be very fast, since we'd only run Nextclade on the subsampled data. The main disadvantage is the additional complexity to the workflow and the redundant runs of Nextclade across multiple builds for the same lineages and segments.

The benefits of the second approach are that it would introduce no complexity to the existing workflow and it would produce a valuable resource that other current workflows (like flu_frequencies) and future workflows (forecasts?) could benefit from. The disadvantage is the additional infrastructural complexity of setting up the Nextclade runs with GitHub Actions for different references (e.g., A/Wisconsin/67/2005 or A/Darwin/6/2021 for H3N2) and storing the merged metadata in S3 in a way that allows us to unambiguously grab the outputs from the desired Nextclade dataset.

I think we want to end up at the second approach eventually, so maybe it is worth the extra planning effort to figure that approach out now instead of using the first approach.

Consistently filter sequences using augur filter

To make these flu builds consistent with previous augur builds, I'm suggesting the following modifications to the current Snakefile:

Use augur filter to filter sequences by length and outlier status prior to selecting strains
Modify scripts/select_strains.py to take both sequences and metadata as inputs and to output selected sequences
Remove current Python-only filter rule from the Snakefile (it is no longer needed after the aforementioned modifications)

If select_strains.py outputs filtered sequences themselves instead of strain names, that will remove the need for an --include-only type of option for the augur filter command and make the flu-specific script more consistent with existing augur command rather than the other way around.

Annotate more granular clades on nextflu site

Context
We currently define clades for all Nextstrain flu builds using the same manually curated clade definition files. However, for surveillance purposes, it would be helpful to have a way to identify and talk about new clades before they have reached high enough frequency to get manually annotated.

Description
Assign more granular clade ids to all clades in trees that have at least one amino acid mutation.

Possible solution
Consider using the find_clades.py script (or something like it) from the flu-forecasting project to automatically assign a unique id to each distinct amino acid haplotype. These haplotypes could span all of HA or could be focused on HA1. These annotations could be added just to the nextflu private builds, initially, to avoid confusion with the manually curated clades. Alternately, we could include a description of how these clades are identified, so the annotations would make more sense in the public builds.

Revisit deploy-private-nextflu Netlify CLI version pin

Follow up to #126

Will need to look through netlify-cli's changelog to figure out what has changed between 17.2.2 and 17.5.1 (or later).

select_strains.py fails if metadata age column has missing values

Current Behavior
When I run the select_strains.py script on a metadata file that has missing values in the age column, it fails with the error:

Traceback (most recent call last):
  File "scripts/select_strains.py", line 292, in <module>
    metadata = parse_metadata(args.segments, args.metadata, date_format=args.date_format)
  File "scripts/select_strains.py", line 207, in parse_metadata
    if age_str[-1]=='y':
IndexError: string index out of range

Expected behavior
Similar to the ncov workflow: where missing data in the metadata is permitted and the build still runs.

How to reproduce
Steps to reproduce the current behavior:

Download data from proprietary database (in a format I believe to consistent with Nextstrain)
augur parse
Manually edit "region" data in Python to match Nextstrain's naming conventions
scripts/construct-recency-from-submission-date.py
augur filter
scripts/select_strains.py

Possible solution
line 207 in select_strains.py currently assumes that each entry has text. Adding an if statement to allow for empty values could be one solution.

Your environment: if running Nextstrain locally

Operating system: centos linux
Version (e.g. auspice 2.7.0): nextstrain.cli 3.0.3

Annotate auspice trees for non-HA segments with HA clade annotations

We currently annotate clades based on mutations in the HA tree for each subtype. To enable comparisons between HA trees and trees from other segments (like NA), we need to map HA's clade annotations onto the trees from these other segments.

The current HA clades JSON is the same file we'd want to use in the augur export for NA and other non-HA segments (just strain names matched to clades), but if there are any internal nodes in the NA tree that happen to match the HA tree, we'd want to omit those to avoid coloring the wrong internal nodes in NA.

To omit internal nodes, we could create a utility script in scripts/strip_internal_nodes.py that takes any "node data" JSON file from modular augur and outputs the corresponding JSON file with all internal nodes stripped. We could then define one rule for HA clade annotations in the Snakefile and a separate rule for non-HA clade annotations that uses the HA JSON as its input.

The resulting Snakemake rules might look like the following:

rule ha_clades:
    message: "Annotating HA clades"
    input:
        tree = rules.refine.output.tree,
        nt_muts = rules.ancestral.output,
        aa_muts = rules.translate.output,
        clade_definitions = "config/clades_{lineage}_ha.tsv"
    output:
        clades = "results/clades_{lineage}_ha_{resolution}.json"

rule non_ha_clades:
    message: "Annotating non-HA clades ({wildcards.segment})"
    input:
        ha_clades = "results/clades_{lineage}_ha_{resolution}.json"
    output:
        non_ha_clades = "results/clades_{lineage}_{segment}_{resolution}.json"

Assign clades to nodes via tabular metadata

Context

I am working on using Nextstrain for global surveillance of Influenza. As part of our bioinformatics workflow, we already assign each sequenced virus to a clade (e.g., 3C.2a1b.2a.2) and we can pull this as part of the tabular metadata for a build. Currently
augur clades recomputes this for each node using the definitions file.

Description

I would like to write to a clades.json directly from tabular metadata instead of recomputing the clade based on augur clades.

Examples

Possible solution

Function in the API to write an augur formatted json based on tabular data.
Logic within augur clades to write the json from strain and clade columns.

Refactor workflows to follow Nextstrain standards from ncov workflow

Context

In the near future, we would like to support region-specific flu builds and also enable other researchers to use this repository to define their own custom flu builds in the same way the ncov repository allows custom workflows.

Description

Seasonal flu builds should follow the same general pattern as ncov builds where a standard workflow depends on an initial selection of sequences (based on some subsampling logic), inferring and annotating phylogenies, and then running build-specific steps to finalize the analysis.

Users should be able to configure their own workflows by modifying configuration files (e.g., YAML or JSON) and/or defining custom Snakemake profiles.

Existing parallel Nextstrain builds ("live" and "WHO") should be executable through this framework, such that we are users of our own workflow configuration system.

Examples

See Nextstrain's ncov regional builds definitions for a real example of what we would like to support for flu eventually.

Possible solutions

Specific solutions will require more discussion, but the basic first steps toward addressing this issue would be similar to what we had to do for ncov back in March/April:

Move all hardcoded parameters out of Snakefiles and into a configuration file.
Define a single top-level Snakefile that references component Snakefiles in workflow/*.smk files including a common.smk for shared functions and builds.smk for the main build logic.
Rewrite the WHO builds logic using pre- and post-main workflow rules, using dependencies and a custom profile to manage which rules are executed instead of running a separate Snakefile.
Organize outputs by named builds with custom parameters in a configuration files instead of using wildcards
Add support for running the workflow on a SLURM cluster and through AWS Batch by annotating rule-specific resources (memory, disk, and threads). (4d2cd21)
Define per-rule conda environments to allow custom dependency definitions outside of the canonical Nextstain environment or Docker image. (4d2cd21)
Document workflow configuration parameters and basic usage through a tutorial (similar to the ncov tutorial).

Export GISAID accessions from metadata into Auspice JSONs

Description

Since the same strain can appear multiple times in GISAID, we should provide the unique GISAID accessions we use for each sequence that appears in our trees. We also modify the original strain names, in some cases, to standardize leading zero padding in ids, etc. such that users cannot find records in GISAID's search interface without the accession.

Unable to run nextstrain command under 'Quickstart with GISAID data'

Hi,
I am trying to follow tutorial under 'Quickstart with GISAID data' so that I can get familiarize with flu Nextstrain build later. I aim to build an H3N2 instance for a period of 10 years. I completed the steps where I have both metadata and FASTA sequences saving as the suggested names metadata.xls, raw_sequences_ha.fasta, respectively. However, when I tried to run nextstrain build --forceall . --configfile profiles/gisaid/builds.yaml, I got the message below.

Building DAG of jobs...
AmbiguousRuleException:
Rules parse and prepare_sequences are ambiguous for the file data/h3n2/ha.fasta.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
parse: lineage=h3n2,segment=ha
prepare_sequences: lineage=h3n2,segment=ha
Expected input files:
parse: data/h3n2/raw_ha.fasta
prepare_sequences: data/h3n2/raw_sequences_ha.fasta
Expected output files:
parse: data/h3n2/ha.fasta data/h3n2/metadata_ha.tsv
prepare_sequences: data/h3n2/ha.fasta

I later ran nextstrain view auspice, it didn't project the data contained in raw_sequences_ha.fasta. I wonder where could it be wrong. Any advice and feedback would be greatly appreciated!

Thanks,
Swan

Tip frequencies are lacking

We need to calculate tip-frequencies.json via KDE frequencies in order to restore the frequencies panel. I believe this functionality should be part of augur frequencies where --tree is passed in. And then maybe have flags for --method kde and --tips-only or some such.

LBI is lacking

We should have lbi calculated for all nodes in the tree. I suspect this should just be a simple script in scripts/.

Explore use of IQ-TREE's constraint tree option

@corneliusroemer experimented with IQ-TREE's constraint tree option to prevent IQ-TREE from putting clades in the wrong place for NextClade reference trees. This seems like a good way to ensure correct trees especially in 6m builds that may be lacking context sequences.

One issue brought up in initial Slack thread:

One issue to sort out would be delimiter in sequence names, right now IQtree renames all / as some weird string

Divergence tree does not display properly on the x-axis for nextflu site

Current Behavior
The time tree view on nextflu works as expected:

When one deselects the "timetree" option to view the divergence tree, the tree collapses to one position on the x-axis:

Expected behavior

The divergence view should show the branch length in divergence along the x-axis.

How to reproduce
Steps to reproduce the current behavior:

Navigate to nextflu site
Deselect "timetree" checkbox under "Options" in the left-hand navigation

Additional context

It is not clear if this issue is with the seasonal-flu build's tree JSONs or the nextflu code itself. In the latter case, we should transfer this issue to the nextflu repository.

Prioritize rather than filter to "complete" genomes

Currently, besides reference genomes, select_strains.py is only passing through viruses that possess both HA and NA segments (due to our use of --segments ha na in the snakefile). For time pivots back about 3-6 months this is okay and there are usually enough strains with both HA and NA to fill sampling bins. However, some strains are just only getting HA sequenced. This was especially obvious looking just now where there are a number of H3s from January with just HA. These tend to be uploaded by groups that are not CCs.

I would propose to modify select_strains.py so that "complete" genome (as in possessing all entries in --segments) becomes another factor in priority rather than a hard constraint.

Identify internal nodes with the HA tree when importing tip clades

The current script to import tip clades from HA to non-HA segments identifies internal nodes by checking that the first four characters of the node name match "NODE". The fact that internal nodes have this specific name is a design decision from TreeTime. This approach tries to infer the data structure from the data contents when information about the structure is already available in an external Newick file.

I'd suggest we pass the original HA tree Newick to this script and use the Bio.Phylo is_terminal method to identify internal nodes instead of testing for specific text. This change will prevent the script from silently breaking when/if TreeTime names internal nodes something different in the future and will avoid confusion when someone would otherwise have to figure out where the "NODE" text comes (I admit this took me a little while to track down). Although this change couples the script to the Bio.Phylo interface, that interface is something more easily grepped for and replaced during a future refactor.

The Newick tree for the current segment is already an input for the clades rule, so it would not be difficult to force that input tree to always be the HA tree instead.

Annotate new clades for H3N2 and H1N1pdm

Context
H3N2 has an emerging clade from A1b/135K that has reached high enough frequency to start tracking. We agreed to name this clade A1b/186D and annotate based on its 135K, 186D, and 190N.

H1N1pdm has several previously identified clades that we haven't annotated yet. These include:

5A: 129D and 185I
5B: 160M

And a newer clade:

187A/189E

Description
We need to add annotations for these new clades to the corresponding clades TSV files in the builds and confirm visually that they work as expected.

Automate monthly reports

Context

We produce monthly reports describing seasonal influenza circulation patterns. A substantial portion of time spent on these reports is tedious copying and pasting or running commands manually that could be automated.

Description

The following specific tasks will minimize the manual work associated with reports and allow us to focus on the science.

Strains with ambiguous dates are assigned two different dates in auspice output

Fewer than 1% of strains sampled between 2017—2019 have ambiguous dates, but when these strains are included in a seasonal flu build, the pipeline assigns them two different effective dates. The first date is inferred by TreeTime. The second date is read dates from the metadata and converted to numerical values prior to frequency estimation.

These differences lead to examples like A/Malaysia/RP0118/2019 which has an ambiguous date of 2019-XX-XX, a TreeTime-inferred date of January 1, 2019, and a brute-force guess from metadata of June 2019. For this example, the user interface confusingly displays two different dates as shown below. This problem manifests in the forecasting pipeline where tips with non-zero frequencies are used to project the population one year into the future. In the example below, the strain with an ambiguous date is included in the forecast because of its frequency estimation date even though it was most likely circulating 6 months earlier.

We could fix this issue in a couple ways:

Exclude all strains with ambiguous dates
Use TreeTime inferred dates for frequency estimation

The first solution seems to be the simplest and since fewer than 1% of recent sequences have ambiguous dates, this filter shouldn’t adversely affect the quality of the flu builds. This solution only requires a change to the seasonal flu build.

The second solution requires a change to a core augur interface. It might be nice in the long run to estimate frequencies from the most accurate information available, though. Tip frequency estimation already requires the Newick tree from augur refine, so adding the branch_lengths.json as an input to frequencies wouldn’t require any major rewiring of existing pipelines.

My preference is for the first solution.

Minor typo in README: "depending on the assay that passage history of the antigens"?

I think there's a minor typo in the README:

These antigenic data come in four flavors depending on the assay that passage history of the antigens.

Should possibly be: "assay and passage history"

The json files exported by augur can not be visualized by the auspice

Hello, the command nextstrain view ./auspice/` can not view the json file. Can you help me address these problems?

Batched WHO jobs exceed disk limit of AWS instances

Submitting this here per conversations on Slack and with Tom.

The latest H3N2 WHO builds failed multiple times when they were run on AWS, seemingly because they exceed the disk limit (22GB) allocated to instances on AWS. They were submitted using:

python batch.py --system batch --version who

The logs for the failed run can be found here. The build seems to fail during the final zip step, which points toward an issue with disk space. This will hopefully be helped by nextstrain/augur#278, but as a short term solution can also be helped by subdividing WHO builds by more than just subtype.

Some strain names in titer records are misspelled

Current Behavior

Titer records define strain names for test and reference viruses, but we do not automatically cross-check these names with existing names in GISAID/GenBank records. As a result, there can be misspelled strain names in the titer records that lead us to omit these measurements from analyses when they do not match the corresponding sequence record's strain name.

Tal from the Bloom lab has helpfully provided a list of misspellings that he detected in our Crick titers:

"Misspelled Virus Name","Correct Virus Name"
"A/Camb/925256/2020","A/Cambodia/925256/2020"
"A/Christchurch/4/1985","A/ChristChurch/4/1985"
"A/Christchurch/515/2019","A/ChristChurch/515/2019"
"A/Christchurch/515/2019-egg","A/ChristChurch/515/2019-egg"
"A/CotedIvore/544/2016","A/CoteDIvoire/544/2016"
"A/Eng/538/2018","A/England/538/2018"
"A/Greecd/4/2017","A/Greece/4/2017"
"A/Hk/5738/2014","A/HongKong/5738/2014"
"A/Hk/656/2018","A/HongKong/656/2018"
"A/Hk/675/2018","A/HongKong/675/2018"
"A/Lyon/CHU/R1811667/2018","A/Lyon/CHU-R1811667/2018"
"A/Lyon/CHU/R181282/2018","A/Lyon/CHU-R181282/2018"
"A/Lyon/CHU/R1813393/2018","A/Lyon/CHU-R1813393/2018"
"A/Lyon/CHU/R190259/2019","A/Lyon/CHU-R190259/2019"
"A/Lyon/CHU/R190377/2019","A/Lyon/CHU-R190377/2019"
"A/Lyon/CHU/R1914685/2019","A/Lyon/CHU-R1914685/2019"
"A/Lyon/CHU/R1915450/2019","A/Lyon/CHU-R1915450/2019"
"A/Lyon/EHPAD/108/2019","A/Lyon/EHPAD-108/2019"
"A/Nor/2516/2018","A/Norway/2516/2018"
"A/Nor/2620/2018","A/Norway/2620/2018"
"A/Nor/4436/2016","A/Norway/4436/2016"
"A/Norway/3806-egg","A/Norway/3806/2016-egg"
"A/Singapore/INFIMH-16-001/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/INFIMH-16-001/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/StEtienne/1912/2018","A/Saint-Etienne/1912/2018"
"A/StEtienne/1998/2018","A/Saint-Etienne/1998/2018"
"A/StEtienne/2539/2020","A/Saint-Etienne/2539/2020"
"A/Stock/6/2014","A/Stockholm/6/2014"
"A/Switz/8060/2017-egg","A/Switzerland/8060/2017-egg"
"A/Switzerlandz/8060/2017-egg","A/Switzerland/8060/2017-egg"

Expected behavior

Misspelled strains in the list above should match their sequence strain names.

Possible solution

In addition to manually correcting these records in our database, we should also consider flagging any titer records with potential misspellings. One easy check would be for records whose test or reference strains don't have corresponding records in the sequence database.

H3N2 NA aa length is wrong

Current Behavior

The H3N2 NA build on nextstrain.org contains a mutation that's not within the range for selection. There is a major mutation (I469T) that is not selectable by the 'Color By' option. The range is 1 - 468. Note the mutation is outside this range (I469T)

Expected behavior

This link should work:
https://nextstrain.org/flu/seasonal/h3n2/na/2y?branchLabel=aa&c=gt-NA_469

But nextstrain will return the standard tree. See NODE_0001125 for the mutation I am referring to.

Possible solution

Change the range to 1 - 469

Use nextalign instead of custom Python codon alignment script

Context
The workflow uses a custom Python script to perform codon-aware alignment, to resolve issues with variable sites in mafft's alignments. Since @rneher wrote this script, he and @ivan-aksamentov developed nextalign. Even though nextalign was developed with SARS-CoV-2 in mind, it works well for H3N2 HA sequences, too, and is way faster.

Description / proposed solution
We should try using nextalign for our flu builds. We'll need to setup the corresponding FASTA reference files for all of the lineages and segments or maybe implement GenBank file support in nextalign (whichever is easier...I can guess which though!). Then, we can take advantage of the codon-aware alignment functionality and also run alignments with multiple threads to speed up that step of our builds.

Once we have nextalign in place, we could start to do analyses that previously would have taken too long like creating multiple sequence alignments of all amino acid sequences for HA and running the titer substitution model on all available sequences and titers.

Instructions for custom flu dataset

Hi,

I'd like to build the analysis for a custom flu dataset.
It seems the documentation and tutorials are not detailed enough for the flu build.
For instance, I was able to run the zika virus tutorial, but the input file formats look different.
The seasonal flu build using the example data in this repository works fine, but again, the input file format looks different.

So the questions are, if I want to run the analysis using a custom dataset:

Where should I put the sequence file, and what the filename should be?
What information should the header of fasta sequences contain?
Where should be the metadata file, and what's the file format? (tab-separated? what columns?)
How can I use a custom titer dataset? (filename match to sequence file? what's the last column, which is often 'hi')
Where should I put the reference (vaccine) sequences, so that these are 'X' marked on the visualization?

The example data tells me some information about the format, but I was unable to run the nextstrain build on my own sequence and titer data.
Thank you for any advice.

Color strains by database submission date

Strain submission to GISAID and other databases often lags by one to two months from the actual collection date. When we make new seasonal flu builds, it would be helpful to identify which strains were recently added to the database.

The ncov builds address this problem by adding a "recency" color-by attribute. This categorical attribute bins submission dates into useful groups for ncov analysis.

For flu, a similar categorical attribute that highlights sequences submitted in the last month or before should be sufficient.

Use augur filter --min/max-date

Require Augur version 15.0.0. and update the pipeline to use the the --min/max-date options in augur filter instead of filtering by date within select_strains.

ep, ne and rb scores are lacking

We need to calculate ep, ne and rb for each node in the tree. I know that @huddlej has a clean way to do this well. I don't know whether this is better as augur scores or augur sequence-traits or if it's better as a separate script in seasonal-flu/scripts/.

cTiterSub is lacking

Currently augur titers sub outputs a JSON that has entries for avidity, potency, substitution, titers, but lacks a nodes entry that could be used to decorate the final tree.json via augur export.

This change needs to happen to augur titers sub and then the seasonal build can be updated.

Estimate tip frequencies up to the month of latest available strains

We currently estimate tip frequencies up to the current date of a given build, but the latest available sequences are often collected one or two months prior to the current date. As a result, the frequencies misrepresent what we know about the present. For H3N2, this lack of data also makes the forecasts less reliable because they end up being 14-15 month forecasts instead of 12-month forecasts.

Instead of using the current date as the max date for tip frequencies, we should dynamically determine the max date from the available sequences in the analysis, rounding up to the first day of the next month. For example, if the latest available sequence is from April 15, we would set the max date to May 1.

CDS coordinates are incorrect for Vic and Yam hemagglutinin

Current Behavior

@kistlerk noticed that the demarcated region for HA1 is not divisible by 3 in the HA reference files for Vic and Yam.

Expected behavior

The CDS for HA1 should say 57..1094 rather than 1095 and SigPep should be 12..56.

CDS coordinates in GenBank format are 1-based and closed. To calculate the amino acids in a given range, you have to add 1 to the range before dividing by 3. For H3N2's HA, we get the 16 amino acids of SigPep by (48 - 1 + 1) / 3 and the 329 of HA1 with (1035 - 49 + 1) / 3.

Upload metadata.tsv for downstream use

You already upload raw fasta to AWS in rule upload_sequences.

It would be useful for our https://github.com/neherlab/flu_frequencies repo to be able to download metadata.tsv as produced by your rule parse.

I will duplicate your parse for now, but in the future it would be nice to reduce code duplication and parse in one place.

Is there a reason you are currently only uploading the raw data not the parsed one?

Substitution model not loading on nextflu/auspice

@rneher ---

I don't think I introduced this bug. But you first might want to check one of your existing JSONs to confirm it's present there as well. If I build flu/who/h3n2/ha/2y/cell/hi, the substitution model inference of tip antigenic advancement looks good via auspice v2:

However, the display in auspice v1 (nextflu/auspice) doesn't register these antigenic changes:

I thought this might have been how substitutions are encoded in flu_who_h3n2_ha_2y_cell_hi_titer-sub-model.json, but this actually looks good. Thinking now it may be entirely an issue with auspice v1.

[Feature request] Add most recent Northern Hemisphere vaccine strains to builds

Context
The most recent vaccine strains for the upcoming Northern Hemisphere season aren't included in live builds.

Description

These strains need to be added to the vaccines JSON and references files for H3N2 and H1N1pdm.

Use Nextclade to assign clade labels in main phylogenetic workflow

Context

We currently assign clade labels to trees in our main phylogenetic workflow using the augur clades command and the influenza clade nomenclature TSVs. However, clade assignments vary for some samples between these public/private trees and the Nextclade trees. Clade assignments can vary between different runs of public/private tress from the same time period due to different sample compositions of the trees produced by our random subsampling logic. These mismatches can cause confusion among users who look at both Nextclade outputs for their own data and the public/private Nextstrain trees.

Description

Since Nextclade provides a standard clade label interface already, we should use Nextclade to annotate clades in our main phylogenetic workflow instead of augur clades. This change will ensure that the samples are assigned to the same clade regardless of the sample composition of a given public/private tree.

Possible solutions

In the short term, we could replace our nextalign alignment with nextclade using the corresponding reference's dataset for each subtype. We would need to replace the current augur clades command with functionality like @corneliusroemer proposed in nextstrain/augur#1329 that allows us to assign clades to internal nodes and branches for complete backward compatibility of clade display in Auspice. Instead of inferring clades for internal nodes as a discrete trait, we could consider assigning clades with Nextclade to the inferred ancestral sequences for nodes.

In the long (medium?) term, we could run Nextclade during our "data upload to S3" workflow, upload the alignments and Nextclade annotations joined with metadata, and then start our workflows with those files. This approach would allow us to skip the alignment and clades steps of the current workflow and it would provide useful Nextclade data on S3 that we need for other analyses like flu frequencies, etc.

nextstrain / seasonal-flu Goto Github PK

seasonal-flu's Issues

Context

Description

Current Behavior

How to reproduce

Your environment: if running Nextstrain locally

Context

Description

Possible solution

Context

Description

Context

Description

Examples

Possible solution

Context

Description

Examples

Possible solutions

Description

Context

Description

Current Behavior

Expected behavior

Possible solution

Context

Description

Possible solutions

Recommend Projects

Recommend Topics

Recommend Org