caporaso-lab / genome-sampler Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 10.0 3.16 MB

Home Page: https://caporasolab.us/genome-sampler/

License: BSD 3-Clause "New" or "Revised" License

Python 99.29% Makefile 0.13% TeX 0.49% HTML 0.09%

hacktoberfest

genome-sampler's Introduction

genome-sampler

Tools for sampling viral genomes across time of isolation, location of isolation, and genome divergence.

To get started with genome-sampler, see our installation and usage tutorial.

To learn about genome-sampler, read our paper.

This software is a QIIME 2 plugin implementing the subsampling workflow that was initially developed for the Arizona COVID-19 Genomics Union analysis of SARS-CoV-2 genomes.

If you're interested in contributing to genome-sampler, please review the software project's code of conduct, which is adapted from the Contributor Covenant, version 1.4.

If you use genome-sampler in any published work, please cite our paper.

genome-sampler's People

Contributors

Stargazers

Watchers

Forkers

ebolyen andrewsanchez gregcaporaso nbokulich oddant1 wasade keegan-evans pmcharrison reveurmichael lizgehret

genome-sampler's Issues

add stand-alone locale sampler

It may be useful to have one that is not connected to near-neighbor sampling.

dereplicate context sequences in sample_diversity

This should greatly speed up this step and doesn't impact the results in any meaningful way (different sequence ids might show up in the final set, but that doesn't matter).

vsearch actions need different percent_id parameters in Snakemake

sample-diversity should be 0.9990 by default and sample-neighbors should be 0.9999 by default

Error in rule view_context_seqs

I am trying to run genome sampler over a 350k genomes dataset.
However, I got the following error:

[Mon Feb 15 09:51:35 2021]
Error in rule view_context_seqs:
jobid: 1
output: results_15fev21/context-seqs.qzv
shell:

qiime feature-table tabulate-seqs       --i-data results_15fev21/context-seqs.qza       --o-visualization results_15fev21/context-seqs.qzv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

One interesting thing is that the computer froze for some seconds, the RAM memory was full. Do you believe it is a RAM issue?
My computer has 64Gb of ram.

Stochastic test failure in TestSubsampleNeighbors.test_sample_cluster_missing_locales

=================================== FAILURES ===================================
__________ TestSubsampleNeighbors.test_sample_cluster_missing_locales __________

self = <genome_sampler.tests.test_sample_neighbors.TestSubsampleNeighbors testMethod=test_sample_cluster_missing_locales>

    def test_sample_cluster_missing_locales(self):
        columns = ['context_id', 'n_mismatches', 'locale']
        cluster = pd.DataFrame([['c4', 5, 'abc'],
                                ['c2', 0, float('nan')],
                                ['c99', 1, float('nan')],
                                ['c42', 2, 'abc']],
                               columns=columns)
    
        count_obs_c4 = 0
        count_obs_c2 = 0
        count_obs_c99 = 0
        count_obs_c42 = 0
    
        for _ in range(self._N_TEST_ITERATIONS):
            obs = _sample_cluster(cluster, 3, np.random.RandomState())
            self.assertEqual(len(obs), 3)
            if 'c4' in obs:
                count_obs_c4 += 1
            if 'c2' in obs:
                count_obs_c2 += 1
            if 'c99' in obs:
                count_obs_c99 += 1
            if 'c42' in obs:
                count_obs_c42 += 1
    
        # c4 and c42 all have locale "abc" and c99 and c2 have unknown locale,
        # so we expect to see c99 amd c2 more frequently
        self.assertTrue(count_obs_c99 > count_obs_c4)
        self.assertTrue(count_obs_c99 > count_obs_c42)
>       self.assertTrue(count_obs_c2 > count_obs_c4)
E       AssertionError: False is not true

have filter_seqs return a log

It'd be useful to know which and how many sequences were filtered for being too long, too short, or too ambiguous.

move overridable configuration parameters from Snakefile to YAML file

This was suggested by one of our reviewers at F1000.

summarize-selections improvements

should include the number of unique ids across the provided selections to let the viewer know how many total context sequences they'll have in their downstream analysis
visual summaries of the samples (e.g., a map, a tree)

update mask format and parser

Should use pyvcf pending W-L/ProblematicSites_SARS-CoV2#4

Add tests of this transformer when this is addressed - these were inadvertently left out of #56.

add documentation section on running in parallel

I've received questions from multiple users on how to run in parallel, so we should add a specific section to the docs on this. Here is some text copied from my replies that we can use in this section:

genome-sampler can be run in parallel to speed it up. This is done in different ways depending on whether you're running the steps individually or through Snakemake.

If you're using Snakemake, you need to edit Snakefile and set the N_THREADS value to the number of threads you'd like genome-sampler to use.

If you're running the steps individually you can pass the --p-n-threads option to several of the commands. For example, sample-diversity is the slowest step in the workflow. You can provide the --p-n-threads parameter to run it in parallel:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads n

When running this command, you should set n to be the number of available processors or cores on a single node of your system. For example, I work on a cluster that has nodes with 28 cores, so when I submit this job I would run:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads 28

This uses all of the resources on a single node of the cluster for me. In the future we'll be adding support for splitting workflows like this across multiple cluster nodes, but we do not have this support at this time.

add pangolin action for lineage assignment

This will likely take FeatureData[Sequence] as input and generate FeatureData[Taxonomy] as output, mirroring taxonomy assignment in QIIME 2 microbiome workflows.

add IDSelection.combine method

This could pull code from the combine_selections action so it could be reused elsewhere.

Add MetadataColumn[Time] type to QIIME 2

We are getting by with Categorical in the plugin, but this feature would be pretty handy overall.

allow user to define target size for subsampled collection of sequences

This came up on the QIIME 2 Forum here. This require some rethinking of how our subsamples are defined since (as far as I know) this isn't something we could achieve with vsearch (e.g., for sample-diversity), but this has come up for me as well, so I think it's something we'll run into again.

Allow sample-random to take a proportion to be randomly sampled

Currently sample-random can only take the number of items you want to randomly sample. Evan and I ran into an instance while working on the benchmark where it would have been useful to be able to instead specify a proportion of the total number of samples that you want.

CI: rename actions/jobs/etc

update install instructions in docs

Possibly to use packages.qiime2.org.

running the tutorial yields a KeyError.

[Wed Aug 26 08:16:24 2020]
rule import_context_seqs:
    input: context-seqs.fasta
    output: context-seqs.qza
    jobid: 4

Traceback (most recent call last):
  File "/home/ctbrown/miniconda3/envs/genome-sampler/lib/python3.6/site-packages/qiime2/sdk/util.py", line 90, in parse_format
    format_record = pm.formats[format_str]
KeyError: 'GISAIDDNAFASTAFormat'

any thoughts?

reduce duplication of code in `GISAIDDNAFASTAFormat` validator

We could port this format to q2-types as a less strict DNAFastaFormat, and DNAFastaFormat could become our strict DNA Fasta Format. These could share a common base class, or we could have a function that creates these formats given ValidationSet and FASTADNAValidator.

Add MRCA to neighbor selection

This used to be done by finding the closest non-identical sequence by hamming distance.

Add GitHub issue templates

It would be useful to collect some basic information when new issues come in, we could look at the templates defined here as a basis:

https://github.com/qiime2/template-repo/tree/master/github_templates/templates/ISSUE_TEMPLATE

add action for importing sequences from NCBI

add equality operator for IDSelection

This will be useful in tests. Update unit tests when this has been added.

add action to support vcf-based masking

As discussed on Virological. This will enable us to construct an automated pipeline from context/focal sequences through a phylogenetic tree with complete data provenance.

Rename repo and plugin/package to `genome-sampler`

illustrate alignment, masking, and tree building in tutorial

Include any discussion of limitations.

select sequences with fewest degenerate characters in sample-diversity

We'll need to see if vsearch has options for this. If not, we should be able to achieve this by sorting sequences based on their fraction of degenerate characters prior to clustering, so that the sequences with the smallest fraction of degenerate characters are most likely to become cluster centroids.

add smaller data set for testing

The current Snakemake workflow takes a long time to run. We'll need a smaller data set (shorter sequences, I think) for efficient testing.

factor number of degenerate characters into selection approach in sample-neighbors

Sequences with fewer N characters should be considered better candidates for inclusion.

filter context sequences with ids that are not in metadata

I'm running an analysis with yesterday's GISAID downloads and there is at least one sequence id that is present in the sequence data but not metadata. We should add a note to the documentation on how to handle this and maybe add it to the Snakemake file as well. Here's how I'm handing this now:

qiime feature-table filter-seqs --i-data context-seqs.qza --m-metadata-file context-metadata.tsv --o-filtered-data context-seqs-w-metadata.qza

generalize label_seqs input type

This should be FeatureData[Sequence|AlignedSequence], pending a framework bug fix. @ebolyen or @Oddant1, it doesn't look like there is an issue yet for that framework bug. Can you add a link to it here when there is one?

Bootstrap jupyterbook builds

See https://github.com/executablebooks/quantecon-mini-example/tree/master/.github/workflows
for reference.

Phase 1: build https://github.com/caporaso-lab/q2-covid-19/blob/master/docs/methods.md "as-is" (updating as we replace scripts with QIIME 2 actions).
Phase 2: integrate https://github.com/qiime2/sphinx-ext-qiime2 (once ready), so that we can get q2cli-executed outputs (just like on docs.qiime2.org).
Phase 3: replace q2cli commands with usage examples (will require new directives in sphinx-ext-qiime2).

handle ? and U characters in GISAID format files

I came across both in GISAID downloaded sequence data today, and they caused imports to fail. I think this should be handled by first replacing any U or u characters in sequences with a T, and then dropping any sequences that still contain characters that are outside of the IUPAC DNA alphabet (the ? probably implies N, but I'm not comfortable with generally, silently making that assumption).

illustrate sampling of focal sequences in tutorial

This came up in peer review.

tutorial modifications

update name of doc file from methods.md to tutorial.md
break commands over multiple lines, like in the Q2 docs
add tutorial data to Snakemake folder, update instructions for accessing that data
[ ] Note however that usually you would perform some manual filtering and trimming between these two steps, so these two commands likely won’t get you a publication quality phylogeny. : note that we are working on adding more support of these steps this is now part of #67
Add suggestions on modifications that need to be made to Snakemake file to apply to non-tutorial data (e.g., update file names and N_THREADS)