Giter VIP home page Giter VIP logo

augur-build's People

Contributors

barneypotter24 avatar cassiawag avatar huddlej avatar jameshadfield avatar joverlee521 avatar kistlerk avatar miparedes avatar rneher avatar trvrb avatar tsibley avatar

Watchers

 avatar  avatar

augur-build's Issues

Filter cluster output

@miparedes @cassiawag ---

Could you add two command line options to extract_cluster_fastas.py? These are:

  • --min-size: Take a parameter for the minimum number of genomes to include in an output cluster. I expect this will usually be set to --min-size 2. A lot of the downstream machinery will break if there's just a single sample (ie I expect augur tree to break if handed a FASTA alignment with just a single element).
  • --filter-to-seattle: It should make life easier downstream if we only export cluster FASTAs that contain "seattle" viruses. You'll need to identify clusters that contain viruses with region: seattle by importing the metadata tsv.

I'd suggest doing these as two separate feature branches / PRs.

Create reproducible coloring for census tracts

@miparedes ---

We have hundreds of census tracts here: https://github.com/seattleflu/augur-build/blob/master/config/lat_longs.tsv#L49. I'd like a coloring that places nearby census tracts with similar colors, the way that we do things for, say, Zika: https://nextstrain.org/zika. For Zika, I did this color scale by hand. This is obviously impossible for the hundreds of census tracts. Instead, we'd like to automate this. My suggested algorithm:

  1. Load a mapping of census tract to lat/long
  2. Perform PCA on these lat/long values to orient the majority of variation to PC1 with PC2 orthologonal
  3. Use PC1 to pick colors along the official Nextstrain ramp: https://github.com/nextstrain/auspice/blob/master/src/util/globals.js#L128
  4. If you rank order PC1 you can slot the first X census tracts to #511EA8, the second X census tracts to #4928B4, etc... where X is chosen to equally distribute census tracts along the 36 elements of the color ramp.

This should end up as a Python script that takes a lat_longs.tsv and also a label (like region, country or location) and produces a colors.tsv output.

No hurry on this. I've wanted a script like this for a while as it would be generally useful in Nextstrain. And I think it could be a good programming challenge.

Swap out reference for a more recent virus

@cassiawag @miparedes ---

I'm using reference virus genome A/Beijing/32/1992 to root each cluster phylogeny. It would work to use augur refine to generate a timetree and root things this will, but this is pretty slow. In order to increase the accuracy of this method we should be using a more recent reference virus. If you look at https://nextstrain.org/flu/seasonal/h3n2/ha/12y you'll see that everything circulating finds a common ancestor after A/Perth/16/2009. This should make a good choice.

You'll want to find matching Genbank files via FluDB (https://www.fludb.org/brc/vaccineRecommend.spg?decorator=influenza) and use these to replace config/reference_h3n2_ha.gb, etc... These need to have CDS and gene for each segment. Not all Genbank files do.

This is not blocking things, but should improve accuracy of rooting.

Compare genetic distance to geographic distance

@cassiawag ---

I'd like to measure whether we see a signal of genetically similar viruses being geographically adjacent. A good way to do this would be to take all the viruses with region = Seattle and do pairwise measurements to compare geographic distance between location (census tract) and genetic distance. You can get location from data/metadata_h3n2_ha.tsv. To measure geographic distance you can get lat/longs for each location from config/lat_longs.tsv. To measure genetic distance you can use aligned_h3n2_genome_2y.fasta.

Please write this as a separate Python script in a new folder analyses/. It should take as inputs --metadata, --alignment and --lat-longs and produce a PNG output via Matplotlib.

This can start as a Jupyter nodebook that is turned into a script once it's working.

Genome reference

@miparedes @cassiawag ---

For the clusters build, we need a reference genome to align our concatenated genomes against. Right now we have segment-specific .gb files like:

Could you construct a new .gb file for the concatenated genome? This should be ordered as in the same order as the --nt-muts command from extract_cluster_fastas.py, so following:

python3 scripts/extract_cluster_fastas.py             --clusters segments-results/clustering_h3n2_2y.json             --nt-muts segments-results/nt-muts_h3n2_ha_2y.json segments-results/nt-muts_h3n2_na_2y.json segments-results/nt-muts_h3n2_pb2_2y.json segments-results/nt-muts_h3n2_pb1_2y.json segments-results/nt-muts_h3n2_pa_2y.json segments-results/nt-muts_h3n2_np_2y.json segments-results/nt-muts_h3n2_mp_2y.json segments-results/nt-muts_h3n2_ns_2y.json             --min-size 2             --output clusters-data/h3n2_2y_cluster0.fasta

which corresponds to

['ha', 'na', 'pb2', 'pb1', 'pa', 'np', 'mp', 'ns']

We need to preserve the CDS and gene labeling in the .gb files and also keep numbering intact. This will be a good task for BioPython (https://biopython.org/wiki/SeqIO) which can be used to read in all these features and then you can do some math in Python to produce the concatenated feature set.

This will be a new script to generate this combined genome reference .gb from existing reference files.

Clean up incoming ID3C data

@joverlee521 ---

There are a small handful of upstream fixes we need to shipping views.

  1. The date field in v2/shipping/augur-build-metadata was formatted as 2019-09-25T19:37:35.483+00:00. This should just read 2019-09-25. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.
  2. Our strain names should match those used by the rest of the world rather than just being a long UUID. I'd like to match existing format as closely as possible. Strains in the US are geographically labeled by state, like B/Washington/2/2019. This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879 would be strain A/Washington/43eef879/2019, ie taking A or B depending on flu A or flu B and taking year from date.
  3. We need neighborhood (within Seattle proper) / puma (outside Seattle proper) for location. I believe that @kairstenfay may have started on this already in ID3C.
  4. Include age_range_coarse as a field in the shipping view.
  5. Restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data.

Edited to update format for strain name in item 2 and to include items 4 and 5.

Calculate genetic distance to closest sample

@miparedes ---

This is similar to #20. We want to take every Seattle virus and calculate the genetic distance to the closest sample in the entire dataset. Ie take every virus with region=Seattle and look at genetic distance for its genome to every other virus. Report the distance that's closest. You can get region from data/metadata_h3n2_ha.tsv. To measure genetic distance you can use aligned_h3n2_genome_2y.fasta.

Please write this as a separate Python script in a new folder analyses/. It should take as inputs --metadata and --alignment and produce a PNG output via Matplotlib.

This can start as a Jupyter nodebook that is turned into a script once it's working.

Include script to hide ancestral nodes based on cluster designation

@jameshadfield ---

Currently, clustering across segments proceeds as:

  1. Identify genomic constellations / clusters via connected_components.py and add cluster label to each tip via its node data output (rule clustering)
  2. Make a genome FASTA file for each cluster (rule clusters_fasta)
  3. Align each cluster with augur align (rule align_clusters)
  4. Build tree for each cluster with augur tree (rule tree_clusters)
  5. Reroot each of these trees to output augur refine (rule refine_clusters)
  6. Switch individual rooted cluster trees together with a basal polytomy (rule aggregate_cluster_trees)
  7. Run augur refine to estimate timetree, but keep topology fixed (rule refine_aggregated)

The output here is then a single tree where each cluster forms a monophyletic clade. I believe the only remaining thing we need here is a script to "hide" nodes basal to each of these monophyletic cluster clades. I imagine this works as follows:

Take as input:

  • the output of (7), eg results/aggregated/tree_h3n2_genome_2y.nwk
  • the output of (1) cluster labeling node data, eg results/clustering_h3n2_2y.json

Output:

  • node data JSON the specifies basal nodes as hidden, eg results/aggregated/mask_h3n2_genome_2y.json.

Then this "mask" would be handed to export_aggregated to result in an exploded auspice JSON. Does this seem like a reasonable approach?

Missing header information in genome reference

@cassiawag ---

There's a small fix needed to the genome reference .gb file. If you look at the top of the ha.gb file you'll see:

LOCUS       A/Beijing/32/1992               1701 bp    DNA              VRL 02-MAY-2006
DEFINITION  Influenza A virus (A/Beijing/32/1992(H3N2)) hemagglutinin gene,
            complete cds.
ACCESSION   U26830
VERSION     U26830.1  GI:857407

However, the top of the genome.gb file looks like:

LOCUS       .                      13350 bp    DNA              VRL 01-JAN-1980
DEFINITION  .
ACCESSION   <unknown id>
VERSION     <unknown id>

Can you use the strain name, ie A/Beijing/32/1992 to set all four of these fields? It should look like:

LOCUS       A/Beijing/32/1992            13350 bp    DNA              VRL 01-JAN-1980
DEFINITION  A/Beijing/32/1992
ACCESSION   A/Beijing/32/1992
VERSION     A/Beijing/32/1992

This is not at all blocking, but would make a couple things nicer.

Nucleotide length in reference .gb file

@cassiawag ---

One more small fix to the create_ref_genome.gb function. The current file looks like:

FEATURES             Location/Qualifiers
     source          1..1701
                     /mol_type="genomic RNA"

This 1701 references to the total length of the nucleotide sequence in the Genbank file. This should be 13350 instead.

This is a small fix and there's no rush.

Make filter-to-region optional in extract_cluster_fastas.py

@miparedes ---

It would be useful to be able to look at how the genome clustering and tree building performs across the entire dataset (not just clusters with Seattle sequences, even if that's the primary analysis). Could you revise extract_cluster_fastas.py to take an additional optional argument called --filter-to-region where we'd call --filter-to-region seattle in the primary Snakemake build? This could also be called as --filter-to-region oceania etc... Or if it's left out, there is no filtering done based on region metadata.

Additionally, please make --metadata an optional parameter. It should be okay to call extract_cluster_fastas.py with just --clusters, --nt-muts and --output-dir. If --filter-to-region is called without --metadata being supplied the script should throw an error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.