nextstrain / augur Goto Github PK

Pipeline components for real-time phylodynamic analysis

Home Page: https://docs.nextstrain.org/projects/augur/

License: GNU Affero General Public License v3.0

Python 80.75% Shell 0.70% Perl 15.31% Makefile 0.03% TeX 2.91% Raku 0.31%

augur's Introduction

This repository is archived and contains the content used to build the documentation and splash page found in nextstrain.org. This content can now be found here.

License and copyright

Source code to Nextstrain is made available under the terms of the GNU Affero General Public License (AGPL). Nextstrain is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

augur's People

Contributors

Stargazers

Watchers

Forkers

tmcjp sdwfrost ktheyss burkesquires zachcp alliblk emmahodcroft stevenweaver neavemj misgaxiong sclipman avantonder oasisye hgwu80 nathaliax3 vikash84 chalam artoria2e5 charesredhat brianpardy florianzwagemaker cnyuanh micahstubbs pythseq jxshi camerondevine balwierz danielsoneg godotgildor imb-fmg elebow willingc zsailer groutr tacaswell quansight-labs rgommers harmsm cclauss jasongrout pushups jharman25 terrycojones ayultimate lcwheeler dharrison-salesforce roblanf erickguan echelon9 alpha29 wfvining jgamblegeorge jstoja dkain bonomali fasemoreakinyemi biocyberman elimerin asassheng c3g aysunrhn migrau eduardoatlasia damnik-jain lifecn jiexiongxie ashang akshaysu12 gandalfone yilinxu novapyth erickfigue tomkinsc darrenwee ttung angiehinrichs jh36 saikirankv bohuspollak majensen wook2014 felixonmars jtboing global-localhost global19 global19-atlassian-net dandye rleir matt-sd-watson victorlin lananhle askarlupka carlosbogo tanxiaoqin888 sidneymbell ammaraziz dlukacevic-idm jjesuscovidmxlibcmx babarlelephant qianqli

augur's Issues

minor bugs

A running list of minor bugs found during analysis of H7N9 data.

reference genome gene co-ords misparsed for complex genbank
spaces in strain names throw alignment errors
country_cmap must be hardcoded
geo_inference fails if # countries < 2

Mumps prepare and process file names

We have prepare.mumps.py, but flu.prepare.py and zika.prepare.py. This should be resolved to be consistent across viruses.

rebuilds to keep up to date with auspice

Auspice is moving forward and some augur builds need to be updated. Note that they do not break in auspice, however they are missing out on functionality.

Please edit this list in place as we do things

Ebola:

needs citations exported TB: needs title, journal and puburl updated in fauna.
~~needs filters exported~~
~~needs title~~
~~needs maintainer~~

Zika:

all up to date

Mumps:

just needs to be rerun
“live” manifest needs to include mumps dataset

Dengue:
* needs citations exported TB: Dengue title, journal, puburl isn't being exported, like in Zika: https://github.com/nextstrain/fauna/blob/master/vdb/zika_download.py#L13. Fauna needs updating here. Not high prioirty.
* needs filters exported
* needs title
* needs maintainer
* defaults are exported as arrays not strings

Avian:

~~needs citations processed in fauna~~ TB: title, journal and puburl will never come through GISAID. We should figure out what to do here.
~~needs citations exported~~
~~needs filters exported~~
~~needs title~~
~~needs maintainer~~

Flu:

~~needs citations processed in fauna~~ TB: title, journal and puburl will never come through GISAID. We should figure out what to do here.
~~needs citations exported~~
~~needs filters exported~~
~~defaults are exported as arrays not strings~~

H7N9 build fails

H7N9 build is currently failing for me. The fault occurs during RAxML branch length estimation:

RAxML was called as follows:

raxml -f e -T 2 -s temp.phyx -n branches -c 25 -m GTRGAMMA -p 344312987 -t raxml_tree.newick 


RAxML expects to read a taxon label in the tree file
but the taxon label is an empty string.

RAxML will print the context of the error and then exit:

/1051:0.00060,(((((:0.00055,(A/GD-103/20

Here, the input tree raxml_tree.newick is poorly formatted. An entire taxon label has been dropped (as above). At fault then, is the previous RAxML command to build the tree topology.

Ebola metadata?

Hi there. Looking here https://github.com/nextstrain/augur/blob/master/ebola/ebola.py#L117, it would seem that some metadata for Ebola is being imported and transformed. I tried to locate these source files in the repo, but they do not seem to be available.

s3 script should handle non-JSON data

Currently, when trying to push / pull images (the splash images live on S3): WARNING:__main__:Skipping unsupported file type for file 'img_zika.png'

While compression should (probably) be enforced, I think we should "guess" the contentType.

Relevant code: #63 (diff)

meta JSON for segmented viruses should include regions & colours for all segments

Currently the colour maps and GPS co-ordinates encoded in the meta JSON refer to the strains in the corresponding tree JSON. Ideally, these entries would list the values for all of the segments. While this would increase the size a little bit, it would allow auspice to display multiple trees without having to download multiple metadata JSONs. I'm not sure how hard this is to accomplish with the current prepare / process machinery.

augur treetime is too complex

@rneher ---

I'm of the opinion that the augur treetime module is overly complex. If you just look at the arguments we see 28 lines and significantly more complexity than the other modules. Much of this complexity seems necessary, but I would try to tame things a bit.

I had initially been reading the command as augur timetree and expecting it to do the one Unix-y task of taking a ML subs tree and outputting a timetree. I would then have augur ancestral to do ancestral state inference. This streamlines both modules and surfaces the hidden ancestral state functionality.

Right now, after augur treetime is run there will be node_data.json with things like sequence, mutations, clock_length and mutation_length. I like the pattern where augur export stitches together multiple JSON files keyed off of NODE_0000071, etc... node_data.json is opaque, but if augur timetree produced node_dates.json and augur ancestral produced node_nt_seqs.json it would be immediately clear what sort of data it is we're talking about. This would also make it clearer that to run augur translate you need node_nt_seqs.json as input, which would produce node_aa_seqs.json as output.

Also, there is an issue with BEAST tree input. It should be obvious how to proceed from BEAST Newick. I don't think having one function like this is obvious, especially when almost everyone will think that treetime is just to get a clock (the name doesn't suggest at all that it can do ancestral state reconstruction).

Along these lines, what about having augur tree label nodes in the output Newick as NODE_0000071, etc... to prevent any issues downstream? Then augur ancestral could be run without having to run treetime first to get labels.

Also, checking with @emmahodcroft, @huddlej, @jameshadfield and @tsibley to see how they feel about augur treetime vs augur timetree / augur ancestral

NA flu display

Currently, there are a couple issues with NA flu display. With updated manifest_guest.json in place in auspice you can see that:

Clade coloring does not work because of changes to augur.

Current auspice:

It should look something like:

Need to infer clade_membership for HA and then take these attrs and import them over to the same strain in NA.

ColorBys ep, ne, rb and glyc shouldn't be exported for NA. These only make sense for HA. This is also done in augur.

Threading this here because although it manifests in auspice, the fixes lie in augur.

Augur mask should be able to take FASTA input

Current augur mask only excepts VCF input. It should equally apply to FASTA input (in which case it replaces ATGC with N). Fortunately, I think the command line interface can stay exactly as is with a --sequences input and a --ouput output. If FASTA is given as input then FASTA should be given as output.

Not an immediate priority, but should exist for completeness.

remove base init.py file

Is the base __init__.py file (https://github.com/nextstrain/augur/blob/master/__init__.py) a hangover from past versions of augur? As far as my extremely limited understanding of python modules goes, I don't think it should exist.

Please tag releases

Hi,
I'm writing you on behalf of the Debian Med team which is a group inside Debian with the objective to package free software in life sciences and medicine for official Debian. It would be interesting for us to package treetime. It would help if you could tag your releases to enable us automatically detecting what you consider as user targeting release and any ot such updates.
Thanks for considering, Andreas.

Trouble translating Zika sequences

@rneher ---

I'm having an issue with current Zika builds on master. To reproduce run standard Zika pipeline:

Download Zika sequences from fauna: python vdb/zika_download.py -db vdb -v zika --fstem zika --resolve_method choose_genbank
Go to augur/zika/.
Run python zika.prepare.py
Run python zika.process.py

After alignment, I get a bunch of Trouble translating USVI/24/2016 warnings. The resulting zika_tree.json files list mutations to gap characters like "T4188-":

This shows up on the live site as tooltips saying there are >10 mutations on a branch with the majority being these mutations to gap. This is a particular issue with the new genomes in fauna, but the issue can be seen in the live build:

http://data.nextstrain.org/zika_tree.json

Search for "T249-". This also manifests as AA mutations to X like "E58X".

I tried to work on this in the branch translation-bug, but was having trouble figuring out what was going on with things like remove_terminal_gaps: https://github.com/nextstrain/augur/blob/master/base/sequences_process.py#L163

I was thinking that this might derive from the work you did to make internal gaps not be stripped out for flu B/Vic. If you could take a look it would be super helpful. No huge hurry here. Doesn't look broken just somewhat strange.

python install errors

Conda install is not working on current augur master (493250ca100f6f9ea0d8228f8ad8668a30bdfbea).

Steps to reproduce:

conda create -n nextstrain python=3.6
source activate nextstrain
pip install --process-dependency-links .

which augur indicates it's been installed as a python package (and this doesn't exist outside of this conda env)
augur parse -h shows out-of-date arguments , so that running the zika snakefile results in augur: error: unrecognized arguments: --output-sequences results/sequences.fasta --output-metadata results/metadata.tsv

Running python setup.py install to try in vain to update augur results in a successful install but running augur -h results in

Traceback (most recent call last):
  File "/Users/james/miniconda3/envs/nextstrain/bin/augur", line 4, in <module>
    __import__('pkg_resources').run_script('augur==0.1.0', 'augur')
  File "/Users/james/miniconda3/envs/nextstrain/lib/python3.6/site-packages/pkg_resources/__init__.py", line 654, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/james/miniconda3/envs/nextstrain/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1425, in run_script
    .format(**locals()),
pkg_resources.ResolutionError: Script 'scripts/augur' not found in metadata at '/Users/james/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur-0.1.0.dist-info'

(note that python bin/augur -h works fine)

P.S. out-of-date augur parse help (out of date c/w python bin/augur parse -h)

usage: augur parse [-h] --sequences SEQUENCES [--fields FIELDS [FIELDS ...]]
                   [--output OUTPUT] [--metadata METADATA] [--sep SEP]
                   [--fix-dates {dayfirst,monthfirst}]

optional arguments:
  -h, --help            show this help message and exit
  --sequences SEQUENCES, -s SEQUENCES
                        sequences in fasta format (default: None)
  --fields FIELDS [FIELDS ...]
                        fields in fasta header (default: None)
  --output OUTPUT       output file (default: None)
  --metadata METADATA   output file (default: None)
  --sep SEP             separator of fasta header (default: |)
  --fix-dates {dayfirst,monthfirst}
                        attempt to parse dates and output them in standard
                        format (default: None)

Add an argument to fitness model class and process for DMS data

DMS data paths are currently hardcoded into the fitness model. We need a command line argument to process.py and to the fitness model class to support passing files that are not stored in version control.

add a visualise / view command

There should be a augur visualise or augur view command which, assuming auspice is a sibling directory, copies the JSONs to auspice/data, starts the dev server via npm run start:local and open up a browser window to localhost:4000.

I think this is what @tsibley's nextstrain view command does (?), but I think it'd be really useful as a part of augur.

General organization

The general vision would be to have a number of different modules (or some other name if you'd prefer) that can be applied to decorate the current data structure. For example, the frequencies module could decorate nodes with frequency trajectories given a tree with dated tips. The ancestral module could decorate nodes with sequence states given a tree with sequences at the tips. We want modules that can be included or not in a pipeline without messing up downstream modules. I could even imagine explicit build requirements, where the frequencies module requires that the tree module have ran. These modules could live in the modules/ directory.

There would then be a separate set of pipelines (or some other name if you'd prefer) that load these modules. These would live in a pipelines/ directory, within which would be things like pipelines/zika/ and pipelines/flu. I actually think that titers would be a general module rather than living in the flu/ directory as the titers module may work well for dengue or B cells. I'm calling these pipelines rather than viruses because we could well end up with Salmonella or whatever as entries.

I think it makes the most sense for metadata to live alongside pipelines, like pipelines/zika/zika.py and pipelines/zika/metadata/, but I'm not sure.

how to subsample segmented viruses

Part of the way prepare was written allowed segmented viruses to be subsampled such that both segments had the same samples taken forward for analysis (or at least a high overlap). How should one achieve this in modular augur? This is critical for tanglegrams and also fitness models using multiple segments.

Add error checking in config file

Code block from an older version:

valid_defaults = {'colorBy': ['region', 'country', 'num_date', 'ep', 'ne', 'rb', 'genotype', 'cHI'],
                  'layout': ['radial', 'rectangular', 'unrooted'],
                  'geoResolution': ['region', 'country', 'division'],
                  'distanceMeasure': ['num_date', 'div']}
for param, value in defaults.items():
    try:
        assert param in valid_defaults and value in valid_defaults[param]
    except:
         print('ERROR: invalid default options provided. Try one of the following instead:', valid_defaults)
         print('Export will continue; default display options can be corrected in the meta.JSON file.')
         continue
meta_json["defaults"] = defaults

case sensitive error

somehow getting both shanghai and Shanghai as division traits in H7N9. This may be a fauna error.

Write out stripped alignment

Currently, FASTA is just copied after alignment:

https://github.com/nextstrain/augur/blob/master/base/sequences_process.py#L102

However, this missed the stripping to reference that happens in process and that's important:

https://github.com/nextstrain/augur/blob/master/base/process.py#L167

Rather than writing to FASTA in sequences.process.align: https://github.com/nextstrain/augur/blob/master/base/sequences_process.py#L76

Should instead write to FASTA in process.align here:

https://github.com/nextstrain/augur/blob/master/base/process.py#L175

I'd suggest adding a new utility function to sequences_process that's the reverse of try_restore_align_from_disk https://github.com/nextstrain/augur/blob/master/base/sequences_process.py#L126, something like write_align_to_disk.

Zika reference sequence is not always pulled from metadata genbank file.

Currently, Augur looks for whether the reference strain is present in your fasta file or not. If it is not present, then the reference strain is pulled in from the genbank file in metadata. However, if your fasta already contains the reference sequence, augur does not overwrite this record with the genbank file sequence, but rather keeps the reference strain sequence in your fasta file.

This is problematic for Zika builds specifically because the current reference strain that is used for alignment and trimming, PF13/251013_18, has two associated sequences in Fauna, one of which is wrong. If you pull data from Fauna without specifying --resolve_method choose_genbank you will download the incorrect sequence for PF13/251013_18 (wrong sequence is 11555 nts long, correct one is 10769 nts long, and sequenced by Troesemeier et al). Then the wrong reference does not get overwritten, and the multiple sequence alignment is mapped and trimmed to an incorrect sequence.

S3 deployment with pseudo-API

Create data.nextstrain.org, which is a S3 instance to hold augur data for consumption in auspice

The pseudo-API is a file structure along the line of:

/zika/dataHex/augurVersionId/*.json

where dataHex is related to the time when the analysis was run - i.e. the amount of data available at that point in time (it could be YYMMDD). This should also be added to the metadata / tree - i.e. each sequence should detail when it was first available for analysis (different from when it was sampled).

This way auspice can access a specific data build.

zika reference sequence

Kristof Theys pointed out that our zika reference sequence is not correct and provided updated coordinates.

Proposal for tree JSONs rather than Newick + node JSONs

@rneher, @tsibley, @jameshadfield, @emmahodcroft, @huddlej, @barneypotter24 ---

This issue is related to #133, but is definitely a separate decision.

I've been thinking about the use of Newick + assortment of node JSONs and would like to propose moving to a single unified tree JSON. This single tree JSON would be same format as accepted by auspice. I see multiple reasons for this switch:

This is the biggest one I believe. The current build flow is non-linear with many steps taking multiple inputs and producing multiple outputs. Here is the current Zika flow starting from alignment:

augur treetime takes two inputs and produces two outputs, etc... And then everything is stitched together by augur export.

If we moved to a single tree JSON we get a completely linear flow:

I believe this fits much more closely with standard bioinformatics expectations. People are used to taking a FASTA and running a command and taking the resulting FASTA and running it through a second command, etc... FASTA ---> FASTA ---> FASTA. I believe this is a much more comprehensible of a build flow and will be easier for others to work with.

I understand that it's useful having a Newick file so that its immediately easy to look at the tree. I would propose two things here.

First, it's easy to have a augur json-to-newick module that could trivially make this conversion. Could also have augur json-to-fasta and augur json-to-vcf that pulls out sequences.

However, more importantly, it should be very little work to make auspice take any of the intermediate trees. Even the output of augur tree could be immediately visualized in auspice. This is super convenient and actually more useful than Newick. For Newick you could use FigTree to view overall structure, but with auspice we'd be able to see the results of augur traits before getting all the way to augur export. Ideally we could make a simple drag-and-drop function to look at intermediate trees in auspice. See nextstrain/auspice#586 for how tree jsons could be viewed in auspice.

This avoids the issue with rerooting and node_data.json. In current setup you have to run augur treetime to get an appropriate node_data.json file. To then run augur traits etc... However, in the proposed switch you could have tree.json produced by augur tree and then immediately run augur traits and then export to auspice. I can picture this sort of thing being a common workflow.
This reduces file and format proliferation. At the end of the pipeline you don't have to worry about a bunch of JSONs to stitch together. Also, we already use tree.json in auspice. Including node_data.json is an additional format. This also simplifies many of the command line interfaces by reducing number of input and number of outputs.

restore alignment broken

Restore alignment no longer works for flu. It is saved without reference but a reference is a condition for successful recovery.

Is is written out here (in most cases without reference):
https://github.com/nextstrain/augur/blob/master/base/process.py#L198

and after reload checked for existing reference:
https://github.com/nextstrain/augur/blob/master/base/sequences_process.py#L136

not sure why reference is needed on reload.

Augur export is messing with branch lengths

Edited to pin the blame on augur export.

Branch lengths are wrong in current augur export. Take a look at https://nextstrain.org/zika?l=clock (go to staging server). You'll see:

This is also apparent in that toggling between "time" and "divergence" does nothing.

This can be reproduced from the example_data/zika.fasta input as well, resulting in:

I confirmed that this was introduced in PR #165. Immediately prior to #165 (commit 341e43d) I get:

@jameshadfield: Could you investigate? I suspect it's just a misplaced xvalue or tvalue. It looks like divergence has been overwritten with time.

S3 script standard out

@huddlej ---

The S3 script is excellent. One small suggestion:

Enabling --verbose is great, but the current behavior where if -v is not flagged then you get absolutely nothing in the standard out is weird. I had no idea if the script worked or not. I'd think you'd want just one line: "pushing ... to ...".

pandas version

why do we restrict pandas version to <0.18 (9d1b078). 0.17 is >2y old and the current stable version is 0.23. any known incompatibilities @trvrb @tsibley

Frequencies JSON spec

PR #83 surfaced an issue of JSON format for frequencies data. Auspice frequencies panel wants just tip frequencies to work and rather than try to shoe-horn tips into the previous frequencies data model, it felt better to make something more targeted. However, @huddlej brought up the issue that it's going to be really annoying supporting two different frequencies formats. This is my attempt to identify a merged JSON format that works just for _tipfrequencies.json but also works for the combined _frequencies.json with all nodes in the tree, annotated clades and mutations.

Current _frequencies.json is almost completely flat:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "global_HA1:40N": [ 0.0003, 0.0002, ...],
  "north_america_clade:2024": [ 0.0, 0.0, ...],
  "africa_6b.1": [ 0.1069, 0.1075, ...],
  ...
}

You have to know how the keys are constructed. Each begins with a region code (NA, africa, etc...) followed by _ then one of three things:

A mutation, coded as protein (HA1, SigPep, etc...) followed by : followed by mutation (40N, etc...)
A generic clade in the tree with clade:2024 indicating clade 2024 in the tree.
Annotated clade in the tree like 6b.1 or 3c3.a.

Currently proposed _tipfrequencies.json has a bit more structure:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "A/Dnipro/768/2016": {
    "frequencies": [0.004389,  0.004405, ...],
    "weight": 1.0
  }
}

My proposed flat spec would be:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "HA1:40N": {
    "frequencies": [ 0.0003, 0.0002, ...],
    "region": "global"
  },
  "clade:2024": {
    "frequencies": [ 0.0, 0.0, ...],
    "region": "north_america"
  },
  "6b.1": {
    "frequencies": [ 0.1069, 0.1075, ...],
    "region": "africa"
  },
  "A/Dnipro/768/2016": {
    "frequencies": [0.004389,  0.004405, ...],
    "weight": 1.0
  },
}

This makes it easy to reach into _frequencies.json and grab tip by strain name, grab clade by clade index, grab annotated clade by clade_annotation or grab mutation by mutation name.

Note that the flat format makes it difficult to just from _frequencies.json collect all, say, annotated clades. This would require adding hierarchy or marking these with something like anno:6b.1 and strain:A/Dnipro/768/2016 (but I think hierarchy is significantly better than this labeling).

My proposed hierarchical spec would be:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "mutations": {
    "HA1:40N": {
      "frequencies": [ 0.0003, 0.0002, ...],
      "region": "global"
    },
   },
  "clades": {
    "2024": {
      "frequencies": [ 0.0, 0.0, ...],
      "region": "north_america"
    },
  },
  "annotations": {
    "6b.1": {
      "frequencies": [ 0.1069, 0.1075, ...],
      "region": "africa"
    },
  },
  "tips": {
    "A/Dnipro/768/2016": {
      "frequencies": [0.004389,  0.004405, ...],
      "weight": 1.0
    },
  },
}

In this case _tipfrequencies.json gains a tips dictionary at the top-level as well and corresponds to a simple subset of what's in _frequencies.json.

The latter hierarchical spec seems more self-documenting to me. I might slightly prefer it, but it's a mild preference. Implementing either spec in nextflu/auspice would be very easy.

Can I get votes from people that work with frequencies, ie. @sidneymbell, @huddlej, @jameshadfield and @rneher for which they prefer?

remove_dir method fails intermittently on NFS

While running augur on our shared cluster, I occasionally get a traceback like the following after the newick tree is built.

# [...clipped...]
Saved new tree to ../processed/flu_h3n2_ha_3y.newick
Traceback (most recent call last):
  File "flu.process.py", line 74, in <module>
    runner.build_tree()
  File "../base/process.py", line 389, in build_tree
    self.tree.build_newick(newick_file, **self.config["newick_tree_options"])
  File "../base/tree.py", line 123, in build_newick
    remove_dir(self.run_dir)
  File "../base/io_util.py", line 22, in remove_dir
    shutil.rmtree(dname)
  File "/home/jhuddles/miniconda2/envs/augur/lib/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/jhuddles/miniconda2/envs/augur/lib/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: 'temp_20170712-162039_74870'

After the script fails, I can confirm that the temporary directory is empty. This appears to be an old issue with shutils and NFS. A common workaround appears to be wrapping shutils.rmtree in a loop to account for the NFS latency that leads to the exception (see also easybuild's similar fix to the problem.

If there are no objections, I'd like to make a similar modification to the remove_dir method in augur.

H1N1pdm failing due to mismatched epitope mask length

@huddlej ---

Could you take another pass at sequence length issues? I made a couple hacks to just get a build running:

This worked for H3, Vic and Yam, but now H1 is throwing the following error:

Calculating scores with epitope mask 'canton' and glycosylation mask 'ha1_globular_head_h1n1pdm'.
Traceback (most recent call last):
  File "flu.process.py", line 617, in <module>
    glyc_mask_version=runner.config["glyc_mask_version"]
  File "/fh/fast/bedford_t/nextflu/augur-live/builds/flu/scores.py", line 150, in calculate_sequence_scores
    node.attr['ep'] = mask_distance(total_aa_seq, root_total_aa_seq, epitope_mask)
  File "/fh/fast/bedford_t/nextflu/augur-live/builds/flu/scores.py", line 104, in mask_distance
    sites_A = mask_sites(aaA, mask)
  File "/fh/fast/bedford_t/nextflu/augur-live/builds/flu/scores.py", line 100, in mask_sites
    return aa[mask[:len(aa)]]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 567 but corresponding boolean dimension is 566

Provide consistent minimal interface for all virus prepare scripts

There is some variation in the arguments provided by each virus's prepare.py script from 1 argument available in ebola.prepare.py to 9 in flu.prepare.py. For automated execution of augur scripts, it would be useful if each prepare script provided the same few options to the user include:

--sequences: one or more FASTA files to use for analysis
--auspice_prefix: a prefix for all auspice output files to allow automated scripts to predictably name output as desired
--viruses_per_month or -v: the number of viruses per month to sample; this exists for most scripts but the long form of this option for flu is named --viruses_per_month_seq

s3 script to remove data

there should be a way to "remove" files - simple prefix matching (like with pull) should work well, e.g. s3.py remove nextstrain-staging --prefix zika*

Flu vaccine distances

@sidneymbell ---

Thank you for working on the flu vaccine distances. These numbers would be generally good to have moving forward and I'd really like things incorporated into augur. Do you think you could do this? I thought it would work to make a new function in flu.process.py to export a JSON file to auspice/ that contains all the relevant data. We want every tip in the tree to have measurements against all available vaccine strains and use three different metrics. This is what I was picturing:

[
  {
    "strain": "A/Incheon/677/2006",
    "num_date": 2006.867,
    "region": "japan_korea",
    "ep": {
      "A/Perth/16/2009": 4,
      "A/Victoria/361/2011": 6
    },
    "cTiter": {
      "A/Perth/16/2009": 2.3,
      "A/Victoria/361/2011": 4.2
    },
    "cTiterSub": {
      "A/Perth/16/2009": 1.9,
      "A/Victoria/361/2011": 3.7
    }
  }
]

This would then make a simple lookup table for downstream analyses. I could work from this output as could Frank.

I didn't want to include this in the tree.json as it felt like it would add unnecessary cruft.

cc @huddlej

Validate JSON schema

The idea being that if this looks OK auspice should be able to visualize it

Not trivial

augur export case question

While trying to get WNV through modular augur i'm running into the issue where:

demes are specified in the metadata TSV in uppercase
augur export converts these to lowercase to find a match in the lat-long database (code here)
augur export exports the deme names in lowercase
auspice can't match the uppercase deme name (encoded on the tree node) with the lowercase deme name in the metadata->geo mapping.

More generally, we've had these kind of errors in the past, how should we proceed?
(Note that auspice should get the deme names etc encoded in the form that one wants them displayed)

Multiple Python bindings for git

There are two Python bindings for git used in augur:

GitPython which provides the git module and is used by the base/sequences_prepare.py augur module
pygit2 which provides the pygit2 module and is used by the base/process.py module

~~Neither git module is listed in the Dockerfile or requirements.txt. Is there a preference for which module to use?~~. Sorry, gitpython is in requirements.txt for the rejig branch!

Cannot add new attributes each node

Hi thanks for the software. It's been fun to use on new virus species. However, for our study we would like to be able to add host as an attribute to each virus node. So we tried to just modify the *.prepare.py file and added 'host' into 'header_fields'. It is correctly picked up by the software and is written into the prepared json file. However, during processing, the process.py doesn't seem to be able to pick up the host attribute and would instead fill it in with the default root value.

So we would like to know if there is a proper way to add new attributes to each node so that we can color the virus stains not only by the default classes, but also like host, town, sex etc.

augur tree vs augur unrooted

Related to #133, but separate enough that I wanted a separate issue.

Some of my semantic confusion lies in augur tree. I think it could be easier for people to immediately understand what's happening if we renamed this to augur unrooted. In the Snakefile it would produce tree_unrooted.nwk instead of tree_raw.nwk.

I think this flags things more appropriately to say that this tree is not yet rooted. And then we can make it clear that the tree produced by augur treetime has been rooted. This should make the change in topology between tree_raw.nwk and tree.nwk less of a surprise.

mugration / treetime sequence storage error

There is a treetime bug introduced between
4020af46fa00feaa0a081d550eaccb2f0df9f4f0 (currently used for augur/master) and treetime 383ec6ffd0a777047e14d6289c8353081c8c5655 (currently used for augur/rejig).

Treetime & vanilla mugration model works fine but specifying a root_state causes the following traceback:

  File "zika/zika.py", line 132, in <module>
    zika.tree.geo_inference('region', root_state = 'southeast_asia')
  File "./base/tree.py", line 284, in geo_inference
    store_compressed=False, pc=5.0, marginal=True, normalized_rate=False)
  File "../treetime_augur/treetime/treeanc.py", line 320, in infer_ancestral_sequences
  File "../treetime_augur/treetime/treeanc.py", line 347, in reconstruct_anc
  File "../treetime_augur/treetime/treeanc.py", line 559, in _ml_anc_marginal
AttributeError: 'Clade' object has no attribute 'cseq'

I'm guessing this is due to the compressed sequences storage being incompatable with how the "extra_clade" is defined https://github.com/nextstrain/augur/blob/master/base/tree.py#L277

how to reproduce: update treetime in augur/master & build zika

cc @rneher

H7N9 scripts and outputs are not named or organized consistently with other viruses

While writing code to automate builds of all nextstrain virus sites, I noticed several inconsistencies between H7N9 and other viruses:

fauna commands refer to “h7n9” while augur refers to “H7N9”
prepare and process scripts are not consistently named like the other viruses using “prepare.H7N9.py” instead of “H7N9.prepare.py”, for instance
H7N9’s prepare script builds all eight of its segments at once without providing any --segments argument
Output files from prepare and process add an “avian_” prefix even though there isn’t a higher level “avian” virus directory structure in augur like there is for “flu”

Renaming the prepare and process scripts should address the first two issues without significantly affecting the output from the scripts. Adding a --segments argument with a default to all eight segments should similarly avoid regressions while allowing users to specify one or more segments.

The last issue might involve a more significant structural change. It seems like it would make more sense for the H7N9 scripts and data to be grouped under an "avian" directory much like the "flu" scripts and data are. Then the prepare and process scripts would look like "avian.prepare.py" and would accept a --lineage argument to allow users to specify "h7n9" just as they can specify "h3n2" for flu now. These changes would make the augur side of the analysis consistent with the naming hierarchy in auspice where "avian" is the top-level virus and "h7n9" is a lineage within that heading.

Since this code is mostly maintained by @jameshadfield, what do you think about these changes? @trvrb might be interested, too.

Mask nonsense sites

Sometimes sequences come in with sites that RAxML doesn't know what to do with. For example,

>A/Ireland/61451/2017
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNIRLHAGGTTTTCGCTCAAAAAATTCCT
GGAAATGACAATAGCACGGCAACGCTGTGCCTTGGGCACCATGCAGTACCAAACGGAACG

This currently breaks the build with RAxML erroring out. These sites should be masked to to N. Here's a list of nucleotide ambiguity codes http://reverse-complement.com/ambiguity.html. These are all okay. Anything that's not on this list should be masked.

Running TB

@emmahodcroft ---

Currently the only files in builds/tb are Snakefile and data/config.json. However, the snakefile refers to:

seq = "data/lee_2015.vcf.gz",
meta = "data/meta.tsv",
exclude = "data/dropped_strains.txt",
mask = "data/Locus_to_exclude_Mtb.bed",
ref = "data/ref.fasta",
drms = "data/DRMs.txt",
sites = "data/drm_sites.txt",
generef = "data/Mtb_H37Rv_NCBI_Annot.gff",
genes = "data/genes.txt",
colors = "data/color.tsv",
config = "data/config.json"

Could you commit all these files to the repo? Except perhaps data/lee_2015.vcf.gz (which you could share separately). I'd like to be able to test TB to make sure I'm not introducing breaking changes.

Strain is duplicated in output tree.json

Currently, output tree.json files have strain listed for each node as both a direct attribute and as an attribute under attr, like so:

"strain": "NGA/2016/ISTH_0187", 
"tvalue": 362.23867, 
"yvalue": 266.0, 
"attr": {
	"raw_date": "2016-01-17"
	"strain": "NGA/2016/ISTH_0187", 
	"country": "nigeria", 
	"region": "africa", 
	"accesion": "MH053478", 
	"host_species": "human",

This is confusing. As described here: https://github.com/nextstrain/augur/blob/master/docs/auspice_output.md#tree-json, strain should live directly under a node.

Store titer records in TiterRecord class and export in structured JSON format

The goals of this proposal are:

remove need for eval call when loading titers from JSON in augur’s process step
add support for multiple user-defined titer attributes such as the passaging details available from the CDC titers
improve documentation of the JSON format for titers by explicitly naming fields instead of relying on slightly ambiguous dictionary format
encapsulate logic about individual titer records along the lines of a SequenceRecord from BioPython such that each record knows how to export itself to JSON and also report other details about itself

The current JSON format looks like this:

"titers": {
    "('A/Acores/11/2013', ('A/Alabama/5/2010', 'F27/10'))": [
      80.0
    ],
    "('A/Acores/11/2013', ('A/Athens/112/2012', 'F16/12'))": [
      640.0
    ]
}

In this format, each record is a key/value pair where the key is a tuple of test strain, reference strain, and serum id that has been converted to a string for JSON compatibility. The value of each pair is a list of floating point values corresponding to raw titer measurements.

The new format should be a list of dictionaries where each dictionary corresponds to a TiterRecord instance in JSON format. Each entry in the TiterRecord should be explicitly named to remove ambiguity about the data and enable additional fields to be added in the future. For example, the following format can support inclusion of the optional “source” and “assay” fields that was originally omitted from each record.

"titers": [
    {
        "assay": "hi",
        "test_strain": "A/Acores/11/2013",
        "reference_strain": "A/Alabama/5/2010",
        "serum": "F27/10",
        "source": "NIMR_Sep2013_7-11.csv",
        "values": [
            80.0
        ]
    },
    {
        "assay": "hi",
        "test_strain": "A/Acores/11/2013",
        "reference_strain": "A/Athens/112/2012",
        "serum": "F16/12",
        "source": "NIMR_Sep2013_7-11.csv",
        "values": [
            640.0
        ]
    }
]

The records from this JSON format map directly to attributes of the TiterRecord Python class. In addition to these attributes, the TiterRecord class would expose the following methods.

class TiterRecord(object):
    def __init__(self, test_strain, reference_strain, serum, values, **kwargs):
        """Builds a new TiterRecord instance.

        Args:
            test_strain (str): name of the test strain
            reference_strain (str): name of the reference strain
            serum (str): name of the serum
            values (list): a list of raw floating point titer measurements
            kwargs (dict): additional attributes of the TiterRecord instance

        Returns:
            TiterRecord: an instance of the record class populated with the given strains, serum, and values

        >>> record = TiterRecord(test_strain="strain_a", reference_strain="strain_b", serum="serum_a", values=[80.0], assay="hi")
        >>> record.test_strain
        'strain_a'
        >>> record.values[0]
        80.0
        >>> hasattr(record, "assay")
        True
        >>> record.assay
        'hi'
        >>> hasattr(record, "source")
        False
        >>> record_dict = {"test_strain": "strain_a", "reference_strain": "strain_b", "serum": "serum_a", "values": [80.0], "assay": "hi"}
        >>> record = TiterRecord(**record_dict)
        >>> record.test_strain
        'strain_a'
        """
        pass

    def to_dict(self):
        """Returns the current instance as a dictionary.

        Returns:
            dict: attributes of the current instance as key/value pairs

        >>> record = TiterRecord(test_strain="strain_a", reference_strain="strain_b", serum="serum_a", values=[80.0], assay="hi")
        >>> sorted(record.to_dict().items())
        [('assay', 'hi'), ('reference_strain', 'strain_b'), ('serum', 'serum_a'), ('test_strain', 'strain_a'), ('values', [80.0])]
        """
        pass

The primary distinction between a TiterRecord and a dict is that the former class has required attributes. TiterRecord instances will not know how to export themselves into the dictionary format used by augur with tuple keys and list values; the TiterModel class should know how to convert a list of TiterRecord instances into that format. The TiterModel class should also know how to build a list of TiterRecord instances from a tab-delimited file of measurements.

make test(s) for basel dataset(s)

Currently we have a simple test in augur/tests/zika so to check if things are (mostly!) working with code changes one just has to run rm -rf results auspice; snakemake in that directory.

Could we have a simple (i.e. quick to run) example of TB or some other datasets y'all use so that we avoid breaking things on your end (e.g. see #161 (comment))

Thanks!

Add versions, tags & --version cmd

As we move builds into separate repositories, augur versioning becomes important. I imagine:

https://stackoverflow.com/a/7071358 seems to imply this should be in _version.py
Does every push to master bump up the version?
We should tag each version so that it's easy to checkout certain versions
augur --version should print the version
a changelog would also be a good idea at this point
I think we start from 3.0.0

Flu prepare script throws exception when file prefix isn't provided by the user

Running flu’s prepare script from master without a file prefix throws an exception as shown below.

$ python flu.prepare.py -v 1 -l h3n2 -r 2y --sequences ../../fauna/data/flu_h3n2_ha.fasta --titers ../../fauna/data/flu_h3n2_titers.tsv
"Preparing lineage h3n2, segments: ['ha'], resolution: 2y"
[INFO] 2017-09-11 09:47:52,535 - base.titer_model - Read titers from ../../fauna/data/flu_h3n2_titers.tsv, found:
[INFO] 2017-09-11 09:47:52,535 - base.titer_model -  --- 4849 strains
[INFO] 2017-09-11 09:47:52,535 - base.titer_model -  --- 7900 data sources
[INFO] 2017-09-11 09:47:52,543 - base.titer_model -  --- 77960 total measurements
Traceback (most recent call last):
  File "flu.prepare.py", line 136, in <module>
    config = make_config(params.lineage, resolution, params)
  File "flu.prepare.py", line 75, in make_config
    file_prefix = "flu_{}_{}_{}".format(lineage, segment, resolution)
UnboundLocalError: local variable 'segment' referenced before assignment

VCF CLI questions