Giter VIP home page Giter VIP logo

orchid's Introduction

Installation and usage instructions can be found in the wiki.

Orchid

orchid

A framework for cancer variant annotation, classification, analysis

Introduction

Please refer to the following publication for a detailed description of the software:
Bioinformatics, btx709, https://doi.org/10.1093/bioinformatics/btx709


What is orchid?

The objective of orchid is to facilitate meaningful biological and clinical interpretation of tumor genetic data though the use of machine learning. For example, orchid could be used to classify aggressive vs. non-aggressive prostate cancer or determine the tissue-of-origin from the cell-free DNA molecules of a patient with cancer.


What is a 'tumor mutational profile'?

In the orchid framework, we define a tumor mutational profile as the annotated set of mutations within a tumor. A typical tumor might contain thousands of mutations. Most are presumed to be irrelevant to disease because they arise due to an important hallmark of cancer-- an unstable genome. However, a crucial subset of these mutations is considered fundamental to carcinogenesis, or at least significantly involved, making them potential biomarkers for clinical classification (e.g. tumor aggressiveness). Orchid adopts a comprehensive approach to variant analysis, employing machine learning algorithms to collectively analyze all mutations. This methodology exposes nuanced mutational patterns and helps tease apart biological complexity.


What is an 'annotated set of mutations'?

Annotations are numeric or categorical values that are associated with a particular mutation. For example, mutation 'A' may change the amino acid sequence of a protein, so we can annotate it with one category of amino acid consequences: a 'non-synonymous single nucleotide polymorphism' or 'nsSNP'. On the other hand, mutation 'B' may change a codon, but not the corresponding amino acid, so we would annotate it with another amino acid consequence category: a 'synonymous SNP'. Biologically speaking, a nsSNPs are more likely to change the effect of a protein than a synonymous one. In the machine learning world, annotations like these are called features. If we gather many mutations across a tumor (or tumors) and annotate each mutation with many features, we end up with a set of annotated mutations, which we call a tumor mutational profile.

To-date, many regulatory and coding features of the human genome have been cataloged. If we gather enough biological data to annotate mutations found in a tumor genome, we may be able to understand the mutatinal process in cancer. For development and publication of this sofware, we used quite a few public biological databases (see here; Note: This page is now archived). In practice, any can be used.

Here's an example of a mutational profile: Mutational Profile

Mutations are arranged in rows and corresponding feature values in columns. The values here are normalized and colored white to orange (low to high). There is also a final column of sample labels, which is ultimately used for training and validation. NOTE: You may notice a lot of correlated feature vectors. Before training a ML model, its important to reduce feature correlation as much as possible!

Getting Started

  1. Download this code and install prerequisites
  2. Obtain tumor and annotation data
  3. Build the database
  4. Perform machine learning

Please refer to the wiki to begin!

NOTICE: This software requires the use of other code and/or data that must be obtained with respect to its license or copyright. Generally speaking, this implies orchid's use is restricted to non-commercial activities. Orchid itself is licensed under the MIT license requiring only preservation of copyright and license notices. Please see the LICENSE file for more details.

orchid's People

Contributors

ccario83 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

orchid's Issues

Can't Connect to MySQL Server for Tissue of Origin Example Dataset

jupyter-notebook11(cannotConnectToWitteLab-error)

Hi,

I have come across an error connecting to the MySQL server on wittelab.ucf.edu that was used for the Tissue of Origin Example dataset for the Orchid software when I tried to run it in the jupyter notebook for performing Machine Learning as specified in the wiki. I have attached a screenshot of my error message. Please let me know if there is any way of resolving this issue. Thanks!

Persistent output for load() and encode() functions

The load and encode functions clear output to display progress updates, however previous output lines are useful. An output array can be included into a MutationMatrix and reprinted upon each iteration using a custom print function.

Persistent mysql databases in docker image

May have to base the image on mysql and update instructions on how to mount volumes. Since docker is the alternative setup, this is low priority for now. Please comment if you'd like this feature, there is a strong chance I'll implement it!

Add wig support

Possibly easiest to convert wig to bed in feature directory with process run during database population.

More refined ROC/PR curves

It would be useful to have options plotting CV fold ROC curves as well as the mean curve and standard deviation.

This should work in a multi-class scenario, where all classes could optionally be plotted together or separately, or in a one class scenario.

Problem with ./make_database.sh

N E X T F L O W ~ version 0.23.4
Launching /export/home/craig/orchid/workflow/annotate.nf [elated_banach] - revision: fe1e111a6e
======================== Run Info ==================================================
Database: mysql://root:orchid123@localhost:3306/feature_test
Mutations: 41
Number of chunks / process: 1

[warm up] executor > local
[b7/afe46e] Submitted process > makeTabixes (splitting data)
[36/cb64ee] Submitted process > makeBeds (splitting data)
[73/b446ec] Submitted process > updateMetadata (saving feature info)
WARN: Process makeBeds (splitting data) terminated with an error exit status (1) -- Execution is retried (1)
[76/a6ef5b] Re-submitted process > makeBeds (splitting data)
ERROR ~ Error executing process > 'makeBeds (splitting data)'

Caused by:
Process makeBeds (splitting data) terminated with an error exit status (1)

Command executed:

mysql -u$MYSQL_USER -h$MYSQL_IP -P$MYSQL_PORT -D$MYSQL_DB -NB -e "SELECT CONCAT('chr',chromosome), start-1, end, ssm_id, '' FROM ssm" > variants.bed
sort-bed variants.bed > sorted_variants.bed

Command exit status:
1

Command output:
(empty)

Command error:

*****ERROR: Unrecognized parameter: variants.bed *****
I've tried entering the command by hand and get ERROR 1049 (42000): Unknown database 'orchid_20171215'

Thanks!

CNV feature conversion

There is the /code/etc/parse_cnv.sh script to prepare cnv data for annotation. If it doesn't take too long to run, it may be better to have this in the annotation process itself so a user doesn't have to call it manually before running the orchid-db workflow

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.