Giter VIP home page Giter VIP logo

disseqt.jl's Introduction

DISSEQT.jl

The DISSEQT.jl package is a Julia implementation of the pipeline described in the paper DISSEQT – DIStribution based modeling of SEQuence Space Time dynamics.

Installation

Start Julia (1.1 or later) and enter the Package REPL by pressing ]. Then install DISSEQT.jl by entering:

add https://github.com/rasmushenningsson/SynapseClient.jl.git
add https://github.com/rasmushenningsson/DISSEQT.jl.git

Also see installation instructions for SynapseClient.jl if you want to enable the Synapse features in DISSEQT.

Examples

The complete analysis of deep sequencing data from the DISSEQT paper is available at the collaborative science platform Synapse here. In order to view and download files you must create a Synapse account. (The scripts are also available in the examples folder for reference, but note that you will need a Synapse account to access the data files.)

The Provenance system in Synapse makes it possible to trace the steps used to produce every result in Synapse, showing how the analysis was done (which script was called) and listing all input files. All example scripts below upload their results to Synapse. The scripts themselves are also automatically uploaded to ensure that all analyses can be rerun elsewhere. (In order to run the scripts locally, you need to make the Julia Packages used by the script avaiable by running e.g. add JLD in the Julia Package REPL.) The steps in the DISSET pipeline are outlined below:

Alignment

If you already have BAM files, you can start the DISSEQT pipeline from the next step. Note however that DISSEQT performs iterative alignment. That is, if the consensus sequence of an aligned mutant swarm is different from the reference sequence used during alignment, it is realigned using the new consensus sequence as the reference. The process is repeated until the reference does not change. Iterative alignment improves inference of codon frequencies close to consensus changes.

To run alignment locally, you need to have bwa, samtools and fastq-mcf installed and available in your path.

Example scripts and other relevant files for running alignment using DISSEQT can be found here. It is recommended to use one script for each run. The Reference Genomes and Adapter files are also needed. The iterative alignment typically took about 1-4 minutes per sample on a modest desktop computer. Alignment for multiple samples can be run in parallel to cut total runtime.

The outputs of the Alignment step are BAM Files, the consensus sequence and a detailed alignment log for each sample are also saved in the same folder. An overview log file - AlignUtils.log - is also created.

An optional step for quality control is to create Read Coverage graphs by running another script. Read Coverage

Codon Frequency Inference

Codon frequencies are inferred from BAM Files. An example script can be found here. The output is one Mutant Swarm File per sample and an overview log file. Runtimes for the codon frequency inference were similar to alignment at about 1-4 minutes per sample on a modest desktop computer. Inference for multiple samples can be run in parallel to cut total runtime.

All the later steps run in a matter of minutes for the whole data set.

Limit of Detection

DISSEQT determines the Limit of Detection for each codon at each site per experiment. To account for differences between runs, a Metadata table with details about the samples is needed for this step. The Limit of Detection script for the Fitness Landscape data set can be found here.

Dimension Estimation

A Talus Plot used for dimension estimation of the Fitness Landscape data set can be created using this script. Talus Plot

Sequence Space Representation

One of the core features of the DISSEQT pipeline is to produce a low-dimensional representation of the Mutant Swarms in Sequence Space, which is useful for plotting and downstream analysis. The script for the Fitness Landscape data set can be found here. First, a high-dimensional representation is created using the inferred codon frequencies and estimated per-variant limits of detection. Second, SubMatrix Selection SVD is used for dimension reduction, based on the number of dimensions determined in the Talus Plot. SubMatrixSelectionSVD plot

Downstream Analysis

Based on the low-dimensional representatation, fitness landscapes and sequence space visualizations are created in this script. An evaluation of the ability of different models to predict fitness is performed in this script. Fitness Landscape

Contact

If you have problems running DISSEQT, please open an issue in the Issue Tracker or contact [email protected]. A docker file will be published soon.

disseqt.jl's People

Contributors

rasmushenningsson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.