The sars-cov-3 from stevenweaver

Daily analyses of SARS-CoV-2 genomic data

This project is a part of a larger effort with the Galaxy team: covid19.galaxyproject.org

TL;DR

Analysis pipeline

We collect data from the database daily. These are mostly full genome sequences collected from different platforms and different regions. See here for a summary of the sequence data. The metadata on the sequences can be found in data/db/master-no-fasta.json; in accordance with GISAID data usage policies, we do not distribute sequence data here.
We extract full genome human sequences and map them to the reference genes using a simple codon-aware pipeline. At this step we also compress the data to retain a single copy of each unique haplotype in the gene, and filter out sequences that have too many (>0.5%) uncalled/unresolved (N) bases.
We reconstruct ML phylogenies on compressed data using raxml-ng
We estimate gene-by-gene distances to compute diversity and divergence using TN93, summarized here
We run several HyPhy dN/dS based selection analyses on each gene. We restrict these analyses to internal branches of the tree filter within-host evolution.

When analyzing intra-species or intra-host data, dN/dS estimates may be inflated due to the fact that not all observed sequence variation is due to substitutions, but some are simply mutations that have not yet been filtered by selection. In other words, dN/dS may be elevated by intra-species / intra- host polymorphism that need not be attributable by positive selection. One simple approach to mitigating this undesirable effect is to restrict site-specific analyses to Internal branches only. This is because internal branches encompass at least one step that is visible to selection (transmission and/or multiple rounds of replication), and are less likely to contain spurious polymorphic variants.

These analyses include SLAC and FEL, MEME, and PRIME (the latter allows to test for conservation/change in specific biochemical properties at site) to identify which sites may be experiencing positve selection, and what properties may be important to preserve/change during these changes. The up-to-date summary is hosted here

stevenweaver / sars-cov-3 Goto Github PK

sars-cov-3's Introduction

Daily analyses of SARS-CoV-2 genomic data

TL;DR

Analysis pipeline

sars-cov-3's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent