Giter VIP home page Giter VIP logo

genome-sampler's Introduction

genome-sampler

lint-build-test

Tools for sampling viral genomes across time of isolation, location of isolation, and genome divergence.

To get started with genome-sampler, see our installation and usage tutorial.

To learn about genome-sampler, read our paper.

This software is a QIIME 2 plugin implementing the subsampling workflow that was initially developed for the Arizona COVID-19 Genomics Union analysis of SARS-CoV-2 genomes.

If you're interested in contributing to genome-sampler, please review the software project's code of conduct, which is adapted from the Contributor Covenant, version 1.4.

If you use genome-sampler in any published work, please cite our paper.

genome-sampler's People

Contributors

ebolyen avatar gregcaporaso avatar nbokulich avatar oddant1 avatar q2d2 avatar thermokarst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

genome-sampler's Issues

Error in rule view_context_seqs

I am trying to run genome sampler over a 350k genomes dataset.
However, I got the following error:

[Mon Feb 15 09:51:35 2021]
Error in rule view_context_seqs:
jobid: 1
output: results_15fev21/context-seqs.qzv
shell:

qiime feature-table tabulate-seqs       --i-data results_15fev21/context-seqs.qza       --o-visualization results_15fev21/context-seqs.qzv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

One interesting thing is that the computer froze for some seconds, the RAM memory was full. Do you believe it is a RAM issue?
My computer has 64Gb of ram.

Stochastic test failure in TestSubsampleNeighbors.test_sample_cluster_missing_locales

=================================== FAILURES ===================================
__________ TestSubsampleNeighbors.test_sample_cluster_missing_locales __________

self = <genome_sampler.tests.test_sample_neighbors.TestSubsampleNeighbors testMethod=test_sample_cluster_missing_locales>

    def test_sample_cluster_missing_locales(self):
        columns = ['context_id', 'n_mismatches', 'locale']
        cluster = pd.DataFrame([['c4', 5, 'abc'],
                                ['c2', 0, float('nan')],
                                ['c99', 1, float('nan')],
                                ['c42', 2, 'abc']],
                               columns=columns)
    
        count_obs_c4 = 0
        count_obs_c2 = 0
        count_obs_c99 = 0
        count_obs_c42 = 0
    
        for _ in range(self._N_TEST_ITERATIONS):
            obs = _sample_cluster(cluster, 3, np.random.RandomState())
            self.assertEqual(len(obs), 3)
            if 'c4' in obs:
                count_obs_c4 += 1
            if 'c2' in obs:
                count_obs_c2 += 1
            if 'c99' in obs:
                count_obs_c99 += 1
            if 'c42' in obs:
                count_obs_c42 += 1
    
        # c4 and c42 all have locale "abc" and c99 and c2 have unknown locale,
        # so we expect to see c99 amd c2 more frequently
        self.assertTrue(count_obs_c99 > count_obs_c4)
        self.assertTrue(count_obs_c99 > count_obs_c42)
>       self.assertTrue(count_obs_c2 > count_obs_c4)
E       AssertionError: False is not true

have filter_seqs return a log

It'd be useful to know which and how many sequences were filtered for being too long, too short, or too ambiguous.

summarize-selections improvements

  • should include the number of unique ids across the provided selections to let the viewer know how many total context sequences they'll have in their downstream analysis
  • visual summaries of the samples (e.g., a map, a tree)

add documentation section on running in parallel

I've received questions from multiple users on how to run in parallel, so we should add a specific section to the docs on this. Here is some text copied from my replies that we can use in this section:

genome-sampler can be run in parallel to speed it up. This is done in different ways depending on whether you're running the steps individually or through Snakemake.

If you're using Snakemake, you need to edit Snakefile and set the N_THREADS value to the number of threads you'd like genome-sampler to use.

If you're running the steps individually you can pass the --p-n-threads option to several of the commands. For example, sample-diversity is the slowest step in the workflow. You can provide the --p-n-threads parameter to run it in parallel:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads n

When running this command, you should set n to be the number of available processors or cores on a single node of your system. For example, I work on a cluster that has nodes with 28 cores, so when I submit this job I would run:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads 28

This uses all of the resources on a single node of the cluster for me. In the future we'll be adding support for splitting workflows like this across multiple cluster nodes, but we do not have this support at this time.

Allow sample-random to take a proportion to be randomly sampled

Currently sample-random can only take the number of items you want to randomly sample. Evan and I ran into an instance while working on the benchmark where it would have been useful to be able to instead specify a proportion of the total number of samples that you want.

running the tutorial yields a KeyError.

[Wed Aug 26 08:16:24 2020]
rule import_context_seqs:
    input: context-seqs.fasta
    output: context-seqs.qza
    jobid: 4

Traceback (most recent call last):
  File "/home/ctbrown/miniconda3/envs/genome-sampler/lib/python3.6/site-packages/qiime2/sdk/util.py", line 90, in parse_format
    format_record = pm.formats[format_str]
KeyError: 'GISAIDDNAFASTAFormat'

any thoughts?

reduce duplication of code in `GISAIDDNAFASTAFormat` validator

We could port this format to q2-types as a less strict DNAFastaFormat, and DNAFastaFormat could become our strict DNA Fasta Format. These could share a common base class, or we could have a function that creates these formats given ValidationSet and FASTADNAValidator.

select sequences with fewest degenerate characters in sample-diversity

We'll need to see if vsearch has options for this. If not, we should be able to achieve this by sorting sequences based on their fraction of degenerate characters prior to clustering, so that the sequences with the smallest fraction of degenerate characters are most likely to become cluster centroids.

add smaller data set for testing

The current Snakemake workflow takes a long time to run. We'll need a smaller data set (shorter sequences, I think) for efficient testing.

filter context sequences with ids that are not in metadata

I'm running an analysis with yesterday's GISAID downloads and there is at least one sequence id that is present in the sequence data but not metadata. We should add a note to the documentation on how to handle this and maybe add it to the Snakemake file as well. Here's how I'm handing this now:

qiime feature-table filter-seqs --i-data context-seqs.qza --m-metadata-file context-metadata.tsv --o-filtered-data context-seqs-w-metadata.qza

generalize label_seqs input type

This should be FeatureData[Sequence|AlignedSequence], pending a framework bug fix. @ebolyen or @Oddant1, it doesn't look like there is an issue yet for that framework bug. Can you add a link to it here when there is one?

Bootstrap jupyterbook builds

See https://github.com/executablebooks/quantecon-mini-example/tree/master/.github/workflows
for reference.

Phase 1: build https://github.com/caporaso-lab/q2-covid-19/blob/master/docs/methods.md "as-is" (updating as we replace scripts with QIIME 2 actions).
Phase 2: integrate https://github.com/qiime2/sphinx-ext-qiime2 (once ready), so that we can get q2cli-executed outputs (just like on docs.qiime2.org).
Phase 3: replace q2cli commands with usage examples (will require new directives in sphinx-ext-qiime2).

handle ? and U characters in GISAID format files

I came across both in GISAID downloaded sequence data today, and they caused imports to fail. I think this should be handled by first replacing any U or u characters in sequences with a T, and then dropping any sequences that still contain characters that are outside of the IUPAC DNA alphabet (the ? probably implies N, but I'm not comfortable with generally, silently making that assumption).

tutorial modifications

  • update name of doc file from methods.md to tutorial.md
  • break commands over multiple lines, like in the Q2 docs
  • add tutorial data to Snakemake folder, update instructions for accessing that data
  • [ ] Note however that usually you would perform some manual filtering and trimming between these two steps, so these two commands likely won’t get you a publication quality phylogeny. : note that we are working on adding more support of these steps this is now part of #67
  • Add suggestions on modifications that need to be made to Snakemake file to apply to non-tutorial data (e.g., update file names and N_THREADS)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.