Giter VIP home page Giter VIP logo

seasonal-flu's Introduction

nextstrain.org/flu

Build Status

This is the Nextstrain build for seasonal influenza viruses, available online at nextstrain.org/flu.

The build encompasses fetching data, preparing it for analysis, doing quality control, performing analyses, and saving the results in a format suitable for visualization (with auspice). This involves running components of Nextstrain such as fauna and augur.

All influenza virus specific steps and functionality for the Nextstrain pipeline should be housed in this repository.

This build is more complicated than other standard nextstrain build because all four currently circulating seasonal influenza lineages (A/H3N2, A/H1N1pdm, B/Vic and B/Yam) are analyzed using the same Snakefile with appropriate wildcards. In addition, we run analyses of both the HA and NA segments of the influenza virus genome and analyze datasets that span different time intervals (eg 2, 3, 6 years). Furthermore, the Nextstrain analysis of influenza virus evolution also uses antigenic and serological data from different WHO collaborating centers.

The different builds for the general public and the different WHO collaborating centers are configured via separate config files. The Nextstrain build configs (upload, nextstrain-public, private.nextflu.org) are used for our semi-automated builds through our GitHub Action workflows.

Example build

You can run an example build using the example data provided in this repository.

First follow the standard installation instructions for Nextstrain's suite of software tools.

Then run the example build via:

nextstrain build .  --configfile profiles/ci/builds.yaml

When the build has finished running, view the output Auspice trees via:

nextstrain view auspice/

Quickstart with GISAID data

Navigate to GISAID. Select the "EpiFlu" link in the top navigation bar and then select "Search" from the EpiFlu navigation bar. From the search interface, select A/H3N2 human samples collected in the last six months, as shown in the example below.

Search for recent A/H3N2 data

Also, under the "Required Segments" section at the bottom of the page, select "HA". Then select the "Search" button. Select the checkbox in the top-left corner of the search results (the same row with the column headings), to select all matching records as shown below.

Select all matching records from search results

Select the "Download" button. From the "Download" window that appears, select "Isolates as XLS (virus metadata only)" and then select the "Download" button.

Download metadata

Create a new directory for these data in the seasonal-flu working directory.

mkdir -p data/h3n2/

Save the XLS file you downloaded (e.g., gisaid_epiflu_isolates.xls) as data/h3n2/metadata.xls. Return to the GISAID "Download" window, and select "Sequences (DNA) as FASTA". In the "DNA" section, select the checkbox for "HA". In the "FASTA Header" section, enter only Isolate name. Leave all other sections at the default values.

Download sequences

Select the "Download" button. Save the FASTA file you downloaded (e.g., gisaid_epiflu_sequences.fasta) as data/h3n2/raw_sequences_ha.fasta.

Run the Nextstrain workflow for these data to produce an annotated phylogenetic tree of recent A/H3N2 HA data. If you have installed Nextstrain with the Docker runtime, run the following command.

nextstrain build . --configfile profiles/gisaid/builds.yaml

If you have installed Nextstrain with the Conda runtime, run the following command instead.

nextstrain build . --configfile profiles/gisaid/builds.yaml \
  --use-conda --conda-frontend mamba

When the workflow finishes running, visualize the resulting tree with the following command.

nextstrain view auspice

Explore the configuration file for this workflow by opening profiles/gisaid/builds.yaml in your favorite text editor. This configuration file determines how the workflow runs, including how samples get selected for the tree. Try changing the number of maximum sequences retained from subsampling from 100 to 500 and the geographic grouping from region to country. Rerun your analysis by adding the --forceall flag to the end of the nextstrain build command you ran above. How did those changes to the configuration file change the tree?

Explore the other configuration files in profiles/, to see other examples of how you can build your own Nextstrain workflows for influenza.

History

  • Prior to March 31, 2023, we selected strains for each build using a custom Python script called select_strains.py. With the merge of the refactored workflow, we have since used a configuration file to define the augur filter query logic we want for strain selection per build.

seasonal-flu's People

Contributors

barneypotter24 avatar huddlej avatar j23414 avatar jameshadfield avatar joverlee521 avatar kistlerk avatar rneher avatar trvrb avatar tsibley avatar victorlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seasonal-flu's Issues

Some strain names in titer records are misspelled

Current Behavior

Titer records define strain names for test and reference viruses, but we do not automatically cross-check these names with existing names in GISAID/GenBank records. As a result, there can be misspelled strain names in the titer records that lead us to omit these measurements from analyses when they do not match the corresponding sequence record's strain name.

Tal from the Bloom lab has helpfully provided a list of misspellings that he detected in our Crick titers:

"Misspelled Virus Name","Correct Virus Name"
"A/Camb/925256/2020","A/Cambodia/925256/2020"
"A/Christchurch/4/1985","A/ChristChurch/4/1985"
"A/Christchurch/515/2019","A/ChristChurch/515/2019"
"A/Christchurch/515/2019-egg","A/ChristChurch/515/2019-egg"
"A/CotedIvore/544/2016","A/CoteDIvoire/544/2016"
"A/Eng/538/2018","A/England/538/2018"
"A/Greecd/4/2017","A/Greece/4/2017"
"A/Hk/5738/2014","A/HongKong/5738/2014"
"A/Hk/656/2018","A/HongKong/656/2018"
"A/Hk/675/2018","A/HongKong/675/2018"
"A/Lyon/CHU/R1811667/2018","A/Lyon/CHU-R1811667/2018"
"A/Lyon/CHU/R181282/2018","A/Lyon/CHU-R181282/2018"
"A/Lyon/CHU/R1813393/2018","A/Lyon/CHU-R1813393/2018"
"A/Lyon/CHU/R190259/2019","A/Lyon/CHU-R190259/2019"
"A/Lyon/CHU/R190377/2019","A/Lyon/CHU-R190377/2019"
"A/Lyon/CHU/R1914685/2019","A/Lyon/CHU-R1914685/2019"
"A/Lyon/CHU/R1915450/2019","A/Lyon/CHU-R1915450/2019"
"A/Lyon/EHPAD/108/2019","A/Lyon/EHPAD-108/2019"
"A/Nor/2516/2018","A/Norway/2516/2018"
"A/Nor/2620/2018","A/Norway/2620/2018"
"A/Nor/4436/2016","A/Norway/4436/2016"
"A/Norway/3806-egg","A/Norway/3806/2016-egg"
"A/Singapore/INFIMH-16-001/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/INFIMH-16-001/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/StEtienne/1912/2018","A/Saint-Etienne/1912/2018"
"A/StEtienne/1998/2018","A/Saint-Etienne/1998/2018"
"A/StEtienne/2539/2020","A/Saint-Etienne/2539/2020"
"A/Stock/6/2014","A/Stockholm/6/2014"
"A/Switz/8060/2017-egg","A/Switzerland/8060/2017-egg"
"A/Switzerlandz/8060/2017-egg","A/Switzerland/8060/2017-egg"

Expected behavior

Misspelled strains in the list above should match their sequence strain names.

Possible solution

In addition to manually correcting these records in our database, we should also consider flagging any titer records with potential misspellings. One easy check would be for records whose test or reference strains don't have corresponding records in the sequence database.

Color strains by database submission date

Strain submission to GISAID and other databases often lags by one to two months from the actual collection date. When we make new seasonal flu builds, it would be helpful to identify which strains were recently added to the database.

The ncov builds address this problem by adding a "recency" color-by attribute. This categorical attribute bins submission dates into useful groups for ncov analysis.

For flu, a similar categorical attribute that highlights sequences submitted in the last month or before should be sufficient.

Annotate Nextstrain clade names by default with official full clade names as alternate annotation

Context

We currently annotate full official clade names for all lineages, but for H3N2 and H1N1pdm the longer official clade names can be difficult to read in the Nextstrain view. We also often need to propose/follow new subclades before these are officially recognized.

A similar situation exists for SARS-CoV-2 where we support Nextstrain clades, PANGO lineages, and "emerging lineages" as alternate branch labels in Auspice.

Description

We should set up parallel "clade" names (i.e., Nextstrain clades) and "full clade" names (i.e., official WHO clade names) in an analogous fashion to how we have "clade" and "emerging lineage" in ncov. This would be both a coloring and toggle under the "branch labels".

We should default to showing the Nextstrain clade names which we'll abbreviate to "2a.2", etc... but have the full official clade names available. Nextclade will similarly have two fields for flu clades.

Possible solution

The full official clade names are defined in files named like clades_h3n2_ha.tsv while the Nextstrain clades are in files named like nextstrain_clades_h3n2_ha.tsv. We need to either define a separate clade annotation rule for the Nextstrain clades or modify the existing clades rule to be parameterized by the set of clade definitions to use.

The ncov workflow defines separate rules to annotate clades and emerging lineages. This workflow also needs an additional rule to add branch labels for emerging lineages to the Auspice JSON. However, if we could first merge @jameshadfield's excellent Augur PR to add support for arbitrary branch labels in augur clades, we could avoid this additional complexity in the seasonal flu workflow.

We also need to redefine Nextstrain clades as shortened versions of the full official clade names, keeping the official capitalization and abbreviating on dot. For example:

  • 3C.2a1b.2a.2 → 2a.2
  • 3C.2a1b.1b → 1b

Use FRA data instead of HI data for live H3N2 builds

Context

HI assays do not perform as well for recent H3N2 viruses as FRA/HINT assays (3c3.A is the exception) and these assays produce different titer model parameters than the FRA/HINT assays. We know these assays measure different aspects of viral infection and that FRA/HINT assays were historically performed when HI assays didn't work making direct comparisons between these data biased.

Although I had previously thought that FRA/HINT data were more numerous than HI data, the number of measurements submitted since Jan 1, 2019 is nearly the same for both assay types. It turns out that the majority of our FRA/HINT data come from the CDC and the majority of the HI data come from VIDRL. This is an important distinction because the live builds only include data from the CDC and are thus relatively depleted for HI data. Here is the breakdown of recent data based on the inclusion date:

>>> hi.query("inclusion_date > '2019-01-01'")["center"].value_counts()
vidrl    21415
crick     6621
cdc       3108
>>> fra.query("inclusion_date > '2019-01-01'")["center"].value_counts()
cdc      18649
vidrl     3987
niid      3851
crick     1097

Description

We should try building a 2y H3N2 HA tree with HI data and another with FRA data, keeping all other parameters of the builds the same (including the forecasting model). Then, we should compare the trees and the corresponding titer model results head-to-head. We should look for:

  1. major changes in overall tree topology based on strains selected in an assay-specific manner
  2. changes in the titer model range and specific branches/substitutions with antigenic advance by these models
  3. changes in forecasts

Consistently filter sequences using augur filter

To make these flu builds consistent with previous augur builds, I'm suggesting the following modifications to the current Snakefile:

  • Use augur filter to filter sequences by length and outlier status prior to selecting strains
  • Modify scripts/select_strains.py to take both sequences and metadata as inputs and to output selected sequences
  • Remove current Python-only filter rule from the Snakefile (it is no longer needed after the aforementioned modifications)

If select_strains.py outputs filtered sequences themselves instead of strain names, that will remove the need for an --include-only type of option for the augur filter command and make the flu-specific script more consistent with existing augur command rather than the other way around.

subsampling in flu

Some discussion with Becky:

  • she has prevalence data scrapped from FluNet that we could use for subsampling.
  • we could use this for subsampling and calculation of global frequencies.
  • this becomes more important as they are preferentially sampling rare lineages.

Strains with ambiguous dates are assigned two different dates in auspice output

Fewer than 1% of strains sampled between 2017—2019 have ambiguous dates, but when these strains are included in a seasonal flu build, the pipeline assigns them two different effective dates. The first date is inferred by TreeTime. The second date is read dates from the metadata and converted to numerical values prior to frequency estimation.

These differences lead to examples like A/Malaysia/RP0118/2019 which has an ambiguous date of 2019-XX-XX, a TreeTime-inferred date of January 1, 2019, and a brute-force guess from metadata of June 2019. For this example, the user interface confusingly displays two different dates as shown below. This problem manifests in the forecasting pipeline where tips with non-zero frequencies are used to project the population one year into the future. In the example below, the strain with an ambiguous date is included in the forecast because of its frequency estimation date even though it was most likely circulating 6 months earlier.

image

We could fix this issue in a couple ways:

  1. Exclude all strains with ambiguous dates
  2. Use TreeTime inferred dates for frequency estimation

The first solution seems to be the simplest and since fewer than 1% of recent sequences have ambiguous dates, this filter shouldn’t adversely affect the quality of the flu builds. This solution only requires a change to the seasonal flu build.

The second solution requires a change to a core augur interface. It might be nice in the long run to estimate frequencies from the most accurate information available, though. Tip frequency estimation already requires the Newick tree from augur refine, so adding the branch_lengths.json as an input to frequencies wouldn’t require any major rewiring of existing pipelines.

My preference is for the first solution.

Batched WHO jobs exceed disk limit of AWS instances

Submitting this here per conversations on Slack and with Tom.

The latest H3N2 WHO builds failed multiple times when they were run on AWS, seemingly because they exceed the disk limit (22GB) allocated to instances on AWS. They were submitted using:

python batch.py --system batch --version who

The logs for the failed run can be found here. The build seems to fail during the final zip step, which points toward an issue with disk space. This will hopefully be helped by nextstrain/augur#278, but as a short term solution can also be helped by subdividing WHO builds by more than just subtype.

Add "division"-level geographic resolution to Auspice JSONs

Context
To understand circulation patterns of flu within a specific country, we need to know which divisions (e.g., states, provinces, etc.) strains were collected in. We currently only annotate country-level geographic resolution in our Auspice JSONs.

Description
We should export division as an annotation in Auspice JSONs. To do this, we need to:

  • add "division" to the list of geo_resolutions in each Auspice config JSON
  • add a division color-by so we can color and filter by division annotations from the metadata
  • add latitudes/longitudes for all divisions in our available metadata

My guess is that the lat/long database from ncov will be substantially better than the current flu database, so we could start by copying from ncov and fill in blanks as we go.

Identify internal nodes with the HA tree when importing tip clades

The current script to import tip clades from HA to non-HA segments identifies internal nodes by checking that the first four characters of the node name match "NODE". The fact that internal nodes have this specific name is a design decision from TreeTime. This approach tries to infer the data structure from the data contents when information about the structure is already available in an external Newick file.

I'd suggest we pass the original HA tree Newick to this script and use the Bio.Phylo is_terminal method to identify internal nodes instead of testing for specific text. This change will prevent the script from silently breaking when/if TreeTime names internal nodes something different in the future and will avoid confusion when someone would otherwise have to figure out where the "NODE" text comes (I admit this took me a little while to track down). Although this change couples the script to the Bio.Phylo interface, that interface is something more easily grepped for and replaced during a future refactor.

The Newick tree for the current segment is already an input for the clades rule, so it would not be difficult to force that input tree to always be the HA tree instead.

Annotate titers per node

Context

We need to know which strains and clades do not have titer measurements from different assays, passage types, and sources. We previously wrote a script to annotate the number of titer measurements per node as a node data JSON file. This script's output plus an Auspice config JSON coloring entry allowed us to color the tree by total titers or counts binned into larger categories.

Description

We should implement this annotation in the refactored workflow for the private builds, at least. We could also use the Auspice config JSON to setup meaningful bins for the counts as a legend scale.

It would also be helpful to be able to filter the tree by presence/absence of mutations in the different categories, so we could focus on only tips that have no measurements.

Shared auspice config JSON is not appropriate for all builds

The current auspice configuration file (config/auspice_config.json) is not always appropriate for all builds in the Snakefile. The most obvious example of this is that the auspice config exports a color-by for “RBS mutations” even though A/H3N2 HA builds are the only ones that define RBS mutations.

This feels like an issue that should be resolved at the level of the Snakefile and not at the augur or auspice level (it's not necessary for those tools to know that this is how we are using them). Possible solutions at the Snakefile-level include:

  • define one separate config JSON per build (duplicating any shared content) and reference these files by wildcards (config/auspice_config_{lineage}_{segment}_{resolution}.json)
  • define one default config JSON shared by most builds and allow build-specific copies of this default to be defined through a Snakemake input function (e.g., config/auspice_config_h3n2_ha.json)
  • define a single templated config JSON (as with jinja, etc.) that contains all shared and build-specific content and render one config JSON per build from this template using conditional logic

The first solution seems the least ideal in that most of these configs will be identical and any update to the shared content between them will require the same updates to many files.

The second solution still duplicates shared content but benefits from defining build-specific conditional logic in Python in the Snakefile. For seasonal flu, it seems likely that the custom config will be the exception and not the rule, so a little duplicated content may not be the worst outcome.

The third solution is guaranteed to avoid content duplication, but it does move the build-specific conditional logic into a templating language. This might be the best solution in the long run if each build ends up having its own slight modifications to the shared settings. In the short term, though, it feels like an over-engineered approach.

select_strains.py fails if metadata age column has missing values

Current Behavior
When I run the select_strains.py script on a metadata file that has missing values in the age column, it fails with the error:

Traceback (most recent call last):
  File "scripts/select_strains.py", line 292, in <module>
    metadata = parse_metadata(args.segments, args.metadata, date_format=args.date_format)
  File "scripts/select_strains.py", line 207, in parse_metadata
    if age_str[-1]=='y':
IndexError: string index out of range

Expected behavior
Similar to the ncov workflow: where missing data in the metadata is permitted and the build still runs.

How to reproduce
Steps to reproduce the current behavior:

  1. Download data from proprietary database (in a format I believe to consistent with Nextstrain)
  2. augur parse
  3. Manually edit "region" data in Python to match Nextstrain's naming conventions
  4. scripts/construct-recency-from-submission-date.py
  5. augur filter
  6. scripts/select_strains.py

Possible solution
line 207 in select_strains.py currently assumes that each entry has text. Adding an if statement to allow for empty values could be one solution.

Your environment: if running Nextstrain locally

  • Operating system: centos linux
  • Version (e.g. auspice 2.7.0): nextstrain.cli 3.0.3

Substitution model not loading on nextflu/auspice

@rneher ---

I don't think I introduced this bug. But you first might want to check one of your existing JSONs to confirm it's present there as well. If I build flu/who/h3n2/ha/2y/cell/hi, the substitution model inference of tip antigenic advancement looks good via auspice v2:

screen shot 2019-01-30 at 8 14 10 am

However, the display in auspice v1 (nextflu/auspice) doesn't register these antigenic changes:

screen shot 2019-01-30 at 8 15 24 am

I thought this might have been how substitutions are encoded in flu_who_h3n2_ha_2y_cell_hi_titer-sub-model.json, but this actually looks good. Thinking now it may be entirely an issue with auspice v1.

Use nextalign instead of custom Python codon alignment script

Context
The workflow uses a custom Python script to perform codon-aware alignment, to resolve issues with variable sites in mafft's alignments. Since @rneher wrote this script, he and @ivan-aksamentov developed nextalign. Even though nextalign was developed with SARS-CoV-2 in mind, it works well for H3N2 HA sequences, too, and is way faster.

Description / proposed solution
We should try using nextalign for our flu builds. We'll need to setup the corresponding FASTA reference files for all of the lineages and segments or maybe implement GenBank file support in nextalign (whichever is easier...I can guess which though!). Then, we can take advantage of the codon-aware alignment functionality and also run alignments with multiple threads to speed up that step of our builds.

Once we have nextalign in place, we could start to do analyses that previously would have taken too long like creating multiple sequence alignments of all amino acid sequences for HA and running the titer substitution model on all available sequences and titers.

Use augur filter --min/max-date

Require Augur version 15.0.0. and update the pipeline to use the the --min/max-date options in augur filter instead of filtering by date within select_strains.

Tip frequencies are lacking

We need to calculate tip-frequencies.json via KDE frequencies in order to restore the frequencies panel. I believe this functionality should be part of augur frequencies where --tree is passed in. And then maybe have flags for --method kde and --tips-only or some such.

cTiterSub is lacking

Currently augur titers sub outputs a JSON that has entries for avidity, potency, substitution, titers, but lacks a nodes entry that could be used to decorate the final tree.json via augur export.

This change needs to happen to augur titers sub and then the seasonal build can be updated.

CDS coordinates are incorrect for Vic and Yam hemagglutinin

Current Behavior

@kistlerk noticed that the demarcated region for HA1 is not divisible by 3 in the HA reference files for Vic and Yam.

Expected behavior

The CDS for HA1 should say 57..1094 rather than 1095 and SigPep should be 12..56.

CDS coordinates in GenBank format are 1-based and closed. To calculate the amino acids in a given range, you have to add 1 to the range before dividing by 3. For H3N2's HA, we get the 16 amino acids of SigPep by (48 - 1 + 1) / 3 and the 329 of HA1 with (1035 - 49 + 1) / 3.

Assign clades to nodes via tabular metadata

Context

I am working on using Nextstrain for global surveillance of Influenza. As part of our bioinformatics workflow, we already assign each sequenced virus to a clade (e.g., 3C.2a1b.2a.2) and we can pull this as part of the tabular metadata for a build. Currently
augur clades recomputes this for each node using the definitions file.

Description

I would like to write to a clades.json directly from tabular metadata instead of recomputing the clade based on augur clades.

Examples

Possible solution

  1. Function in the API to write an augur formatted json based on tabular data.
  2. Logic within augur clades to write the json from strain and clade columns.

Refactor workflows to follow Nextstrain standards from ncov workflow

Context

In the near future, we would like to support region-specific flu builds and also enable other researchers to use this repository to define their own custom flu builds in the same way the ncov repository allows custom workflows.

Description

Seasonal flu builds should follow the same general pattern as ncov builds where a standard workflow depends on an initial selection of sequences (based on some subsampling logic), inferring and annotating phylogenies, and then running build-specific steps to finalize the analysis.

Users should be able to configure their own workflows by modifying configuration files (e.g., YAML or JSON) and/or defining custom Snakemake profiles.

Existing parallel Nextstrain builds ("live" and "WHO") should be executable through this framework, such that we are users of our own workflow configuration system.

Examples

See Nextstrain's ncov regional builds definitions for a real example of what we would like to support for flu eventually.

Possible solutions

Specific solutions will require more discussion, but the basic first steps toward addressing this issue would be similar to what we had to do for ncov back in March/April:

  • Move all hardcoded parameters out of Snakefiles and into a configuration file.
  • Define a single top-level Snakefile that references component Snakefiles in workflow/*.smk files including a common.smk for shared functions and builds.smk for the main build logic.
  • Rewrite the WHO builds logic using pre- and post-main workflow rules, using dependencies and a custom profile to manage which rules are executed instead of running a separate Snakefile.
  • Organize outputs by named builds with custom parameters in a configuration files instead of using wildcards
  • Add support for running the workflow on a SLURM cluster and through AWS Batch by annotating rule-specific resources (memory, disk, and threads). (4d2cd21)
  • Define per-rule conda environments to allow custom dependency definitions outside of the canonical Nextstain environment or Docker image. (4d2cd21)
  • Document workflow configuration parameters and basic usage through a tutorial (similar to the ncov tutorial).

ep, ne and rb scores are lacking

We need to calculate ep, ne and rb for each node in the tree. I know that @huddlej has a clean way to do this well. I don't know whether this is better as augur scores or augur sequence-traits or if it's better as a separate script in seasonal-flu/scripts/.

LBI is lacking

We should have lbi calculated for all nodes in the tree. I suspect this should just be a simple script in scripts/.

Annotate more granular clades on nextflu site

Context
We currently define clades for all Nextstrain flu builds using the same manually curated clade definition files. However, for surveillance purposes, it would be helpful to have a way to identify and talk about new clades before they have reached high enough frequency to get manually annotated.

Description
Assign more granular clade ids to all clades in trees that have at least one amino acid mutation.

Possible solution
Consider using the find_clades.py script (or something like it) from the flu-forecasting project to automatically assign a unique id to each distinct amino acid haplotype. These haplotypes could span all of HA or could be focused on HA1. These annotations could be added just to the nextflu private builds, initially, to avoid confusion with the manually curated clades. Alternately, we could include a description of how these clades are identified, so the annotations would make more sense in the public builds.

Enable preferred sampling of specific regions during strain selection

The subsampling logic for strain selection in original augur supported a "priority" sampling for specific regions. This type of prioritization is helpful for custom flu builds where one wants to investigate evolutionary patterns in specific regions. The current strain selection script for seasonal flu exposes the same --sampling flag used by old augur, but this flag is ignored.

To enable preferred sampling of regions by the current script, some of the code from the original subsampling script needs to be migrated.

Identify "prototype virus" for each putative clade

Context
Phylogenetic trees of HA/NA help us identify emerging clades that should be tracked by routine surveillance. For each of these emerging clades, we need a way to identify candidate strains for testing by experiment assays like HI, FRA, HINT, etc.

Description
Identify one or more prototype virus strains named for each putative clade (identified as in #65). The prototypes must be real strains. These could represent strains identified computationally or experimentally as the best representatives.

Possible solution
Produce reference alignments with currently circulating diversity and identify the closest strain(s) to each auto-assigned clade. Most likely, these strains will need to be manually curated after the clades are automatically identified.

Consider labeling the auto-assigned clades (from #65) with the names of the prototype strains we curate here.

Prioritize rather than filter to "complete" genomes

Currently, besides reference genomes, select_strains.py is only passing through viruses that possess both HA and NA segments (due to our use of --segments ha na in the snakefile). For time pivots back about 3-6 months this is okay and there are usually enough strains with both HA and NA to fill sampling bins. However, some strains are just only getting HA sequenced. This was especially obvious looking just now where there are a number of H3s from January with just HA. These tend to be uploaded by groups that are not CCs.

I would propose to modify select_strains.py so that "complete" genome (as in possessing all entries in --segments) becomes another factor in priority rather than a hard constraint.

Annotate auspice trees for non-HA segments with HA clade annotations

We currently annotate clades based on mutations in the HA tree for each subtype. To enable comparisons between HA trees and trees from other segments (like NA), we need to map HA's clade annotations onto the trees from these other segments.

The current HA clades JSON is the same file we'd want to use in the augur export for NA and other non-HA segments (just strain names matched to clades), but if there are any internal nodes in the NA tree that happen to match the HA tree, we'd want to omit those to avoid coloring the wrong internal nodes in NA.

To omit internal nodes, we could create a utility script in scripts/strip_internal_nodes.py that takes any "node data" JSON file from modular augur and outputs the corresponding JSON file with all internal nodes stripped. We could then define one rule for HA clade annotations in the Snakefile and a separate rule for non-HA clade annotations that uses the HA JSON as its input.

The resulting Snakemake rules might look like the following:

rule ha_clades:
    message: "Annotating HA clades"
    input:
        tree = rules.refine.output.tree,
        nt_muts = rules.ancestral.output,
        aa_muts = rules.translate.output,
        clade_definitions = "config/clades_{lineage}_ha.tsv"
    output:
        clades = "results/clades_{lineage}_ha_{resolution}.json"

rule non_ha_clades:
    message: "Annotating non-HA clades ({wildcards.segment})"
    input:
        ha_clades = "results/clades_{lineage}_ha_{resolution}.json"
    output:
        non_ha_clades = "results/clades_{lineage}_{segment}_{resolution}.json"

Instructions for custom flu dataset

Hi,

I'd like to build the analysis for a custom flu dataset.
It seems the documentation and tutorials are not detailed enough for the flu build.
For instance, I was able to run the zika virus tutorial, but the input file formats look different.
The seasonal flu build using the example data in this repository works fine, but again, the input file format looks different.

So the questions are, if I want to run the analysis using a custom dataset:

  1. Where should I put the sequence file, and what the filename should be?
  2. What information should the header of fasta sequences contain?
  3. Where should be the metadata file, and what's the file format? (tab-separated? what columns?)
  4. How can I use a custom titer dataset? (filename match to sequence file? what's the last column, which is often 'hi')
  5. Where should I put the reference (vaccine) sequences, so that these are 'X' marked on the visualization?

The example data tells me some information about the format, but I was unable to run the nextstrain build on my own sequence and titer data.
Thank you for any advice.

Create dated copies of 2y builds

Context

To enable SARS-CoV-2 narratives based on data at specific timepoints, we've maintained dated Auspice JSONs that are automatically generated as part of the standard Nextstrain workflow. These dated builds follow the naming format of auspice/ncov_{build_name}_{date}.json where the date is in YYYY-MM-DD format.

Description

To support similar narrative reports on the status of influenza, we should create dated copies of builds for the four lineages. SInce our reporting usually only needs recent context, we can start by creating these copies only for the 2y builds of HA and NA. With tree/meta, tip frequencies, and root sequence JSONs, this approach should produce 24 JSONs per day (4 lineages * 2 segments * 3 JSONs per build).

Possible solution

Following the ncov workflow example, the resulting dated JSONs would be named like:

flu_seasonal_h3n2_ha_2y_2021-04-20.json

And would produce a URL like https://nextstrain.org/flu/seasonal/h3n2/ha/2y/2021-04-20.

An alternate solution that would be more consistent with the hierarchical grouping standard of HTTP APIs would place the date earlier in the name such that all builds from that date could be displayed below. An example dated JSON would look like:

flu_seasonal_2021-04-20_h3n2_ha_2y.json

And the URL would look like https://nextstrain.org/flu/seasonal/2021-04-20/h3n2/ha/2y. When we eventually support browsing builds by exploring the URL's API, this hierarchical layout would enable a page at https://nextstrain.org/flu/seasonal/2021-04-20/ that could show all builds from that day.

Open question: Should we produce these dated builds for all trees on the private site, too, or only the public site?

Example build not working. MissingRuleException

Current Behavior

Cloned seasonal-flu repo, installed nextstrain cli with conda setup.
I tried to run the example build (nextstrain build . targets/flu_seasonal_h3n2_ha_12y), but it does not work, producing a "MissingRuleException."

Before the refactored workflow update, I was able to build with the same steps and example data.
I'd like to run the same with the updated, refactored workflow.

How to reproduce

Steps to reproduce the current behavior:

  1. Clone seasonal-flu repo
  2. conda create -n nextstrain # python 3.9.12
  3. Install NextStrain according to the documentation. curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/linux | bash; printf '\n%s\n' 'eval "$("/home/limh25/.nextstrain/cli-standalone/nextstrain" init-shell bash)"' >> ~/.bashrc; eval "$("/home/limh25/.nextstrain/cli-standalone/nextstrain" init-shell bash)"
  4. python3 -m pip install nextstrain-cli; nextstrain setup conda; nextstrain setup --set-default conda
  5. Copy example data to data/
  6. Run the example build.
(nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain build . targets/flu_seasonal_h3n2_ha_12y --configfile /app/limh25/seasonal-flu/profiles/example/builds.yaml --cores 8 --printshellcmds --show-failed-logs --reason --verbose

Building DAG of jobs...
Full Traceback (most recent call last):
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/__init__.py", line 771, in snakemake
    success = workflow.execute(
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 741, in execute
    dag.init()
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 210, in init
    self.file2jobs(file),
  File "/home/limh25/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 1984, in file2jobs
    raise MissingRuleException(targetfile)
snakemake.exceptions.MissingRuleException: No rule to produce targets/flu_seasonal_h3n2_ha_12y (if you use input functions make sure that they don't raise unexpected exceptions).

MissingRuleException:
No rule to produce targets/flu_seasonal_h3n2_ha_12y (if you use input functions make sure that they don't raise unexpected exceptions).
unlocking
removed all locks

Your environment: if running Nextstrain locally

  • Operating system: Amazon Linux 2
  • Version (nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain version
    nextstrain.cli 7.1.0, python 3.9.12
(nextstrain) [limh25@machine1 seasonal-flu]$ nextstrain check-setup
nextstrain-cli is up to date!

Testing your setup…

# docker is not supported
✘ no: docker is installed
✘ no: docker run works
? unknown: containers have access to >2 GiB of memory
✔ yes: image is new enough for this CLI version

# conda is supported
✔ yes: operating system is supported
✔ yes: runtime data dir doesn't have spaces
✔ yes: runtime appears set up
✔ yes: snakemake is installed and runnable
✔ yes: augur is installed and runnable
✔ yes: auspice is installed and runnable

# singularity is not supported
✘ no: singularity is installed
✘ no: singularity version None ≥ 3.0.0
✘ no: singularity works

# ambient is not supported
✘ no: snakemake is installed and runnable
✘ no: augur is installed and runnable
✘ no: auspice is installed and runnable

# aws-batch is not supported
✘ no: job description "nextstrain-job" exists
✘ no: job queue "nextstrain-job-queue" exists
✘ no: S3 bucket "nextstrain-jobs" exists

Supported Nextstrain runtimes: conda

All good!  Default runtime (conda) is supported.

There was no problem running the example build before the refactoring update. Now it does not work as expected.
Please help me running the example build.
Please also let me know if any further command-line outputs are needed to debug.

Thank you.

Estimate tip frequencies up to the month of latest available strains

We currently estimate tip frequencies up to the current date of a given build, but the latest available sequences are often collected one or two months prior to the current date. As a result, the frequencies misrepresent what we know about the present. For H3N2, this lack of data also makes the forecasts less reliable because they end up being 14-15 month forecasts instead of 12-month forecasts.

Instead of using the current date as the max date for tip frequencies, we should dynamically determine the max date from the available sequences in the analysis, rounding up to the first day of the next month. For example, if the latest available sequence is from April 15, we would set the max date to May 1.

Divergence tree does not display properly on the x-axis for nextflu site

Current Behavior
The time tree view on nextflu works as expected:

image

When one deselects the "timetree" option to view the divergence tree, the tree collapses to one position on the x-axis:

image

Expected behavior

The divergence view should show the branch length in divergence along the x-axis.

How to reproduce
Steps to reproduce the current behavior:

  1. Navigate to nextflu site
  2. Deselect "timetree" checkbox under "Options" in the left-hand navigation

Additional context

It is not clear if this issue is with the seasonal-flu build's tree JSONs or the nextflu code itself. In the latter case, we should transfer this issue to the nextflu repository.

Annotate new clades for H3N2 and H1N1pdm

Context
H3N2 has an emerging clade from A1b/135K that has reached high enough frequency to start tracking. We agreed to name this clade A1b/186D and annotate based on its 135K, 186D, and 190N.

H1N1pdm has several previously identified clades that we haven't annotated yet. These include:

  • 5A: 129D and 185I
  • 5B: 160M

And a newer clade:

  • 187A/189E

Description
We need to add annotations for these new clades to the corresponding clades TSV files in the builds and confirm visually that they work as expected.

Upload metadata.tsv for downstream use

You already upload raw fasta to AWS in rule upload_sequences.

It would be useful for our https://github.com/neherlab/flu_frequencies repo to be able to download metadata.tsv as produced by your rule parse.

I will duplicate your parse for now, but in the future it would be nice to reduce code duplication and parse in one place.

Is there a reason you are currently only uploading the raw data not the parsed one?

H3N2 NA aa length is wrong

Current Behavior

The H3N2 NA build on nextstrain.org contains a mutation that's not within the range for selection. There is a major mutation (I469T) that is not selectable by the 'Color By' option. The range is 1 - 468. Note the mutation is outside this range (I469T)

Expected behavior

This link should work:
https://nextstrain.org/flu/seasonal/h3n2/na/2y?branchLabel=aa&c=gt-NA_469

But nextstrain will return the standard tree. See NODE_0001125 for the mutation I am referring to.

Possible solution

Change the range to 1 - 469

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.