seattleflu / augur-build Goto Github PK
View Code? Open in Web Editor NEWThis project forked from nextstrain/seasonal-flu
the (in development) augur build for understanding influenza dynamics in Seattle
Home Page: https://seattleflu.org
This project forked from nextstrain/seasonal-flu
the (in development) augur build for understanding influenza dynamics in Seattle
Home Page: https://seattleflu.org
@miparedes @cassiawag ---
Could you add two command line options to extract_cluster_fastas.py
? These are:
--min-size
: Take a parameter for the minimum number of genomes to include in an output cluster. I expect this will usually be set to --min-size 2
. A lot of the downstream machinery will break if there's just a single sample (ie I expect augur tree
to break if handed a FASTA alignment with just a single element).--filter-to-seattle
: It should make life easier downstream if we only export cluster FASTAs that contain "seattle" viruses. You'll need to identify clusters that contain viruses with region: seattle
by importing the metadata tsv.I'd suggest doing these as two separate feature branches / PRs.
@miparedes ---
We have hundreds of census tracts here: https://github.com/seattleflu/augur-build/blob/master/config/lat_longs.tsv#L49. I'd like a coloring that places nearby census tracts with similar colors, the way that we do things for, say, Zika: https://nextstrain.org/zika. For Zika, I did this color scale by hand. This is obviously impossible for the hundreds of census tracts. Instead, we'd like to automate this. My suggested algorithm:
#511EA8
, the second X census tracts to #4928B4
, etc... where X is chosen to equally distribute census tracts along the 36 elements of the color ramp.This should end up as a Python script that takes a lat_longs.tsv
and also a label (like region
, country
or location
) and produces a colors.tsv
output.
No hurry on this. I've wanted a script like this for a while as it would be generally useful in Nextstrain. And I think it could be a good programming challenge.
@cassiawag @miparedes ---
I'm using reference virus genome A/Beijing/32/1992
to root each cluster phylogeny. It would work to use augur refine
to generate a timetree and root things this will, but this is pretty slow. In order to increase the accuracy of this method we should be using a more recent reference virus. If you look at https://nextstrain.org/flu/seasonal/h3n2/ha/12y you'll see that everything circulating finds a common ancestor after A/Perth/16/2009
. This should make a good choice.
You'll want to find matching Genbank files via FluDB (https://www.fludb.org/brc/vaccineRecommend.spg?decorator=influenza) and use these to replace config/reference_h3n2_ha.gb
, etc... These need to have CDS
and gene
for each segment. Not all Genbank files do.
This is not blocking things, but should improve accuracy of rooting.
@cassiawag ---
I'd like to measure whether we see a signal of genetically similar viruses being geographically adjacent. A good way to do this would be to take all the viruses with region = Seattle
and do pairwise measurements to compare geographic distance between location
(census tract) and genetic distance. You can get location
from data/metadata_h3n2_ha.tsv
. To measure geographic distance you can get lat/longs for each location from config/lat_longs.tsv
. To measure genetic distance you can use aligned_h3n2_genome_2y.fasta
.
Please write this as a separate Python script in a new folder analyses/
. It should take as inputs --metadata
, --alignment
and --lat-longs
and produce a PNG output via Matplotlib.
This can start as a Jupyter nodebook that is turned into a script once it's working.
@miparedes @cassiawag ---
For the clusters build, we need a reference genome to align our concatenated genomes against. Right now we have segment-specific .gb
files like:
Could you construct a new .gb
file for the concatenated genome? This should be ordered as in the same order as the --nt-muts
command from extract_cluster_fastas.py
, so following:
python3 scripts/extract_cluster_fastas.py --clusters segments-results/clustering_h3n2_2y.json --nt-muts segments-results/nt-muts_h3n2_ha_2y.json segments-results/nt-muts_h3n2_na_2y.json segments-results/nt-muts_h3n2_pb2_2y.json segments-results/nt-muts_h3n2_pb1_2y.json segments-results/nt-muts_h3n2_pa_2y.json segments-results/nt-muts_h3n2_np_2y.json segments-results/nt-muts_h3n2_mp_2y.json segments-results/nt-muts_h3n2_ns_2y.json --min-size 2 --output clusters-data/h3n2_2y_cluster0.fasta
which corresponds to
['ha', 'na', 'pb2', 'pb1', 'pa', 'np', 'mp', 'ns']
We need to preserve the CDS
and gene
labeling in the .gb
files and also keep numbering intact. This will be a good task for BioPython (https://biopython.org/wiki/SeqIO) which can be used to read in all these features and then you can do some math in Python to produce the concatenated feature set.
This will be a new script to generate this combined genome reference .gb
from existing reference files.
@joverlee521 ---
There are a small handful of upstream fixes we need to shipping views.
date
field in v2/shipping/augur-build-metadata
was formatted as 2019-09-25T19:37:35.483+00:00
. This should just read 2019-09-25
. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.B/Washington/2/2019
. This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879
would be strain A/Washington/43eef879/2019
, ie taking A
or B
depending on flu A or flu B and taking year from date.neighborhood
(within Seattle proper) / puma
(outside Seattle proper) for location
. I believe that @kairstenfay may have started on this already in ID3C.age_range_coarse
as a field in the shipping view.shipping.augur-build-metadata
to only those samples that have sequencing data.Edited to update format for strain name in item 2 and to include items 4 and 5.
@miparedes ---
This is similar to #20. We want to take every Seattle virus and calculate the genetic distance to the closest sample in the entire dataset. Ie take every virus with region=Seattle
and look at genetic distance for its genome to every other virus. Report the distance that's closest. You can get region from data/metadata_h3n2_ha.tsv
. To measure genetic distance you can use aligned_h3n2_genome_2y.fasta
.
Please write this as a separate Python script in a new folder analyses/
. It should take as inputs --metadata
and --alignment
and produce a PNG output via Matplotlib.
This can start as a Jupyter nodebook that is turned into a script once it's working.
@jameshadfield ---
Currently, clustering across segments proceeds as:
connected_components.py
and add cluster
label to each tip via its node data output (rule clustering
)cluster
(rule clusters_fasta
)augur align
(rule align_clusters
)augur tree
(rule tree_clusters
)augur refine
(rule refine_clusters
)aggregate_cluster_trees
)augur refine
to estimate timetree, but keep topology fixed (rule refine_aggregated
)The output here is then a single tree where each cluster
forms a monophyletic clade. I believe the only remaining thing we need here is a script to "hide" nodes basal to each of these monophyletic cluster clades. I imagine this works as follows:
Take as input:
results/aggregated/tree_h3n2_genome_2y.nwk
results/clustering_h3n2_2y.json
Output:
hidden
, eg results/aggregated/mask_h3n2_genome_2y.json
.Then this "mask" would be handed to export_aggregated
to result in an exploded auspice JSON. Does this seem like a reasonable approach?
@cassiawag ---
There's a small fix needed to the genome reference .gb
file. If you look at the top of the ha.gb
file you'll see:
LOCUS A/Beijing/32/1992 1701 bp DNA VRL 02-MAY-2006
DEFINITION Influenza A virus (A/Beijing/32/1992(H3N2)) hemagglutinin gene,
complete cds.
ACCESSION U26830
VERSION U26830.1 GI:857407
However, the top of the genome.gb
file looks like:
LOCUS . 13350 bp DNA VRL 01-JAN-1980
DEFINITION .
ACCESSION <unknown id>
VERSION <unknown id>
Can you use the strain name, ie A/Beijing/32/1992
to set all four of these fields? It should look like:
LOCUS A/Beijing/32/1992 13350 bp DNA VRL 01-JAN-1980
DEFINITION A/Beijing/32/1992
ACCESSION A/Beijing/32/1992
VERSION A/Beijing/32/1992
This is not at all blocking, but would make a couple things nicer.
@cassiawag ---
One more small fix to the create_ref_genome.gb
function. The current file looks like:
FEATURES Location/Qualifiers
source 1..1701
/mol_type="genomic RNA"
This 1701
references to the total length of the nucleotide sequence in the Genbank file. This should be 13350
instead.
This is a small fix and there's no rush.
@miparedes ---
It would be useful to be able to look at how the genome clustering and tree building performs across the entire dataset (not just clusters with Seattle sequences, even if that's the primary analysis). Could you revise extract_cluster_fastas.py
to take an additional optional argument called --filter-to-region
where we'd call --filter-to-region seattle
in the primary Snakemake build? This could also be called as --filter-to-region oceania
etc... Or if it's left out, there is no filtering done based on region metadata.
Additionally, please make --metadata
an optional parameter. It should be okay to call extract_cluster_fastas.py
with just --clusters
, --nt-muts
and --output-dir
. If --filter-to-region
is called without --metadata
being supplied the script should throw an error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.