seattleflu / augur-build Goto Github PK

This project forked from nextstrain/seasonal-flu

the (in development) augur build for understanding influenza dynamics in Seattle

Python 72.51% Mathematica 27.49%

augur-build's Introduction

augur builds for seattle flu study

This is the (in development) augur build for understanding influenza dynamics in Seattle. It is based off this nextstrain build.

Reassortment

Aim

Given phylogenies for each segement, we aim to identify sets of tips which have not undergone reassortment. This allows the concatenation of the segments for each set of (non-reassorting) tips, allowing the construction of a more informative phylogeny.

Method

In development.

The easiest way to rerun the algorithm when the script has changed is to run:

snakemake --force results/reassort_h3n2_2y_5vpm.json && snakemake
snakemake --force results/reassort_h3n2_2y_20vpm.json && snakemake

Running

This build starts by pulling sequences from our live fauna database (a RethinkDB instance). This requires environment variables FAUNA_PATH, RETHINK_HOST and RETHINK_AUTH_KEY to be set.

snakemake

augur-build's People

Contributors

Watchers

Forkers

cassiawag ceciletk

augur-build's Issues

Swap out reference for a more recent virus

@cassiawag @miparedes ---

I'm using reference virus genome A/Beijing/32/1992 to root each cluster phylogeny. It would work to use augur refine to generate a timetree and root things this will, but this is pretty slow. In order to increase the accuracy of this method we should be using a more recent reference virus. If you look at https://nextstrain.org/flu/seasonal/h3n2/ha/12y you'll see that everything circulating finds a common ancestor after A/Perth/16/2009. This should make a good choice.

You'll want to find matching Genbank files via FluDB (https://www.fludb.org/brc/vaccineRecommend.spg?decorator=influenza) and use these to replace config/reference_h3n2_ha.gb, etc... These need to have CDS and gene for each segment. Not all Genbank files do.

This is not blocking things, but should improve accuracy of rooting.

Compare genetic distance to geographic distance

@cassiawag ---

I'd like to measure whether we see a signal of genetically similar viruses being geographically adjacent. A good way to do this would be to take all the viruses with region = Seattle and do pairwise measurements to compare geographic distance between location (census tract) and genetic distance. You can get location from data/metadata_h3n2_ha.tsv. To measure geographic distance you can get lat/longs for each location from config/lat_longs.tsv. To measure genetic distance you can use aligned_h3n2_genome_2y.fasta.

Please write this as a separate Python script in a new folder analyses/. It should take as inputs --metadata, --alignment and --lat-longs and produce a PNG output via Matplotlib.

This can start as a Jupyter nodebook that is turned into a script once it's working.

Include script to hide ancestral nodes based on cluster designation

@jameshadfield ---

Currently, clustering across segments proceeds as:

Identify genomic constellations / clusters via connected_components.py and add cluster label to each tip via its node data output (rule clustering)
Make a genome FASTA file for each cluster (rule clusters_fasta)
Align each cluster with augur align (rule align_clusters)
Build tree for each cluster with augur tree (rule tree_clusters)
Reroot each of these trees to output augur refine (rule refine_clusters)
Switch individual rooted cluster trees together with a basal polytomy (rule aggregate_cluster_trees)
Run augur refine to estimate timetree, but keep topology fixed (rule refine_aggregated)

The output here is then a single tree where each cluster forms a monophyletic clade. I believe the only remaining thing we need here is a script to "hide" nodes basal to each of these monophyletic cluster clades. I imagine this works as follows:

Take as input:

the output of (7), eg results/aggregated/tree_h3n2_genome_2y.nwk
the output of (1) cluster labeling node data, eg results/clustering_h3n2_2y.json

Output:

node data JSON the specifies basal nodes as hidden, eg results/aggregated/mask_h3n2_genome_2y.json.

Then this "mask" would be handed to export_aggregated to result in an exploded auspice JSON. Does this seem like a reasonable approach?

Missing header information in genome reference

@cassiawag ---

There's a small fix needed to the genome reference .gb file. If you look at the top of the ha.gb file you'll see:

LOCUS       A/Beijing/32/1992               1701 bp    DNA              VRL 02-MAY-2006
DEFINITION  Influenza A virus (A/Beijing/32/1992(H3N2)) hemagglutinin gene,
            complete cds.
ACCESSION   U26830
VERSION     U26830.1  GI:857407

However, the top of the genome.gb file looks like:

LOCUS       .                      13350 bp    DNA              VRL 01-JAN-1980
DEFINITION  .
ACCESSION   <unknown id>
VERSION     <unknown id>

Can you use the strain name, ie A/Beijing/32/1992 to set all four of these fields? It should look like:

LOCUS       A/Beijing/32/1992            13350 bp    DNA              VRL 01-JAN-1980
DEFINITION  A/Beijing/32/1992
ACCESSION   A/Beijing/32/1992
VERSION     A/Beijing/32/1992

This is not at all blocking, but would make a couple things nicer.

Nucleotide length in reference .gb file

@cassiawag ---

One more small fix to the create_ref_genome.gb function. The current file looks like:

FEATURES             Location/Qualifiers
     source          1..1701
                     /mol_type="genomic RNA"

This 1701 references to the total length of the nucleotide sequence in the Genbank file. This should be 13350 instead.

This is a small fix and there's no rush.

Make filter-to-region optional in extract_cluster_fastas.py

@miparedes ---

It would be useful to be able to look at how the genome clustering and tree building performs across the entire dataset (not just clusters with Seattle sequences, even if that's the primary analysis). Could you revise extract_cluster_fastas.py to take an additional optional argument called --filter-to-region where we'd call --filter-to-region seattle in the primary Snakemake build? This could also be called as --filter-to-region oceania etc... Or if it's left out, there is no filtering done based on region metadata.

Additionally, please make --metadata an optional parameter. It should be okay to call extract_cluster_fastas.py with just --clusters, --nt-muts and --output-dir. If --filter-to-region is called without --metadata being supplied the script should throw an error.

Clean up incoming ID3C data

@joverlee521 ---

There are a small handful of upstream fixes we need to shipping views.

The date field in v2/shipping/augur-build-metadata was formatted as 2019-09-25T19:37:35.483+00:00. This should just read 2019-09-25. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.
Our strain names should match those used by the rest of the world rather than just being a long UUID. I'd like to match existing format as closely as possible. Strains in the US are geographically labeled by state, like B/Washington/2/2019. This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879 would be strain A/Washington/43eef879/2019, ie taking A or B depending on flu A or flu B and taking year from date.
We need neighborhood (within Seattle proper) / puma (outside Seattle proper) for location. I believe that @kairstenfay may have started on this already in ID3C.
Include age_range_coarse as a field in the shipping view.
Restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data.

Edited to update format for strain name in item 2 and to include items 4 and 5.

Create reproducible coloring for census tracts

@miparedes ---

We have hundreds of census tracts here: https://github.com/seattleflu/augur-build/blob/master/config/lat_longs.tsv#L49. I'd like a coloring that places nearby census tracts with similar colors, the way that we do things for, say, Zika: https://nextstrain.org/zika. For Zika, I did this color scale by hand. This is obviously impossible for the hundreds of census tracts. Instead, we'd like to automate this. My suggested algorithm:

Load a mapping of census tract to lat/long
Perform PCA on these lat/long values to orient the majority of variation to PC1 with PC2 orthologonal
Use PC1 to pick colors along the official Nextstrain ramp: https://github.com/nextstrain/auspice/blob/master/src/util/globals.js#L128
If you rank order PC1 you can slot the first X census tracts to #511EA8, the second X census tracts to #4928B4, etc... where X is chosen to equally distribute census tracts along the 36 elements of the color ramp.

This should end up as a Python script that takes a lat_longs.tsv and also a label (like region, country or location) and produces a colors.tsv output.

No hurry on this. I've wanted a script like this for a while as it would be generally useful in Nextstrain. And I think it could be a good programming challenge.

Genome reference

@miparedes @cassiawag ---

For the clusters build, we need a reference genome to align our concatenated genomes against. Right now we have segment-specific .gb files like:

Could you construct a new .gb file for the concatenated genome? This should be ordered as in the same order as the --nt-muts command from extract_cluster_fastas.py, so following:

python3 scripts/extract_cluster_fastas.py             --clusters segments-results/clustering_h3n2_2y.json             --nt-muts segments-results/nt-muts_h3n2_ha_2y.json segments-results/nt-muts_h3n2_na_2y.json segments-results/nt-muts_h3n2_pb2_2y.json segments-results/nt-muts_h3n2_pb1_2y.json segments-results/nt-muts_h3n2_pa_2y.json segments-results/nt-muts_h3n2_np_2y.json segments-results/nt-muts_h3n2_mp_2y.json segments-results/nt-muts_h3n2_ns_2y.json             --min-size 2             --output clusters-data/h3n2_2y_cluster0.fasta

which corresponds to

['ha', 'na', 'pb2', 'pb1', 'pa', 'np', 'mp', 'ns']

We need to preserve the CDS and gene labeling in the .gb files and also keep numbering intact. This will be a good task for BioPython (https://biopython.org/wiki/SeqIO) which can be used to read in all these features and then you can do some math in Python to produce the concatenated feature set.

This will be a new script to generate this combined genome reference .gb from existing reference files.

Calculate genetic distance to closest sample

@miparedes ---

This is similar to #20. We want to take every Seattle virus and calculate the genetic distance to the closest sample in the entire dataset. Ie take every virus with region=Seattle and look at genetic distance for its genome to every other virus. Report the distance that's closest. You can get region from data/metadata_h3n2_ha.tsv. To measure genetic distance you can use aligned_h3n2_genome_2y.fasta.

Please write this as a separate Python script in a new folder analyses/. It should take as inputs --metadata and --alignment and produce a PNG output via Matplotlib.

This can start as a Jupyter nodebook that is turned into a script once it's working.

Filter cluster output

@miparedes @cassiawag ---

Could you add two command line options to extract_cluster_fastas.py? These are:

--min-size: Take a parameter for the minimum number of genomes to include in an output cluster. I expect this will usually be set to --min-size 2. A lot of the downstream machinery will break if there's just a single sample (ie I expect augur tree to break if handed a FASTA alignment with just a single element).
--filter-to-seattle: It should make life easier downstream if we only export cluster FASTAs that contain "seattle" viruses. You'll need to identify clusters that contain viruses with region: seattle by importing the metadata tsv.

I'd suggest doing these as two separate feature branches / PRs.

seattleflu / augur-build Goto Github PK

augur-build's Introduction

augur builds for seattle flu study

Reassortment

Aim

Method

Running

augur-build's People

Contributors

Watchers

Forkers

augur-build's Issues

Recommend Projects

Recommend Topics

Recommend Org