Giter VIP home page Giter VIP logo

sars-cov-3's Introduction

Daily analyses of SARS-CoV-2 genomic data

This project is a part of a larger effort with the Galaxy team: covid19.galaxyproject.org

TL;DR

  1. Analysis of all current SARS-CoV-2 genomes for evidence of natural selection
  2. Divergence and diversity of SARS-CoV-2 genomes over time overall and by region

Analysis pipeline

  1. We collect data from the gisaid-logo database daily. These are mostly full genome sequences collected from different platforms and different regions. See here for a summary of the sequence data. The metadata on the sequences can be found in data/db/master-no-fasta.json; in accordance with GISAID data usage policies, we do not distribute sequence data here.

  2. We extract full genome human sequences and map them to the reference genes using a simple codon-aware pipeline. At this step we also compress the data to retain a single copy of each unique haplotype in the gene, and filter out sequences that have too many (>0.5%) uncalled/unresolved (N) bases.

  3. We reconstruct ML phylogenies on compressed data using raxml-ng

  4. We estimate gene-by-gene distances to compute diversity and divergence using TN93, summarized here

  5. We run several HyPhy dN/dS based selection analyses on each gene. We restrict these analyses to internal branches of the tree filter within-host evolution.

When analyzing intra-species or intra-host data, dN/dS estimates may be inflated due to the fact that not all observed sequence variation is due to substitutions, but some are simply mutations that have not yet been filtered by selection. In other words, dN/dS may be elevated by intra-species / intra- host polymorphism that need not be attributable by positive selection. One simple approach to mitigating this undesirable effect is to restrict site-specific analyses to Internal branches only. This is because internal branches encompass at least one step that is visible to selection (transmission and/or multiple rounds of replication), and are less likely to contain spurious polymorphic variants.

  1. These analyses include SLAC and FEL, MEME, and PRIME (the latter allows to test for conservation/change in specific biochemical properties at site) to identify which sites may be experiencing positve selection, and what properties may be important to preserve/change during these changes. The up-to-date summary is hosted here

sars-cov-3's People

Contributors

aglucaci avatar niemasd avatar spond avatar stevenweaver avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.