Giter VIP home page Giter VIP logo

expressyeaself's Introduction

License: MIT Build Status Coverage Status

ExpressYeaself

Authors: Joe Abbott, Keertana Krishnan, Guoyao Chen.


Overview

ExpressYeaself is an open source scientific software package that aims to quickly and accurately predict the contribution a promoter sequence has on the expression of genes in Saccharomyces cerevisiae (or 'Brewer's yeast ').

This will allow the costly and time-consuming trial-and-error processes in the development and synthesis of biotherapeutics to be streamlined. Our goal is to use machine learning and data mining to make recommendations on which promoter sequences are likely to contribute to high levels of gene expression, and which are not.

For further details on the scientific background of our project and back-end operation of our package, please see our use cases.


Current Features

  1. Raw data1 consisting of ~ 62 million sequences and their associated expression levels, can be processed in an automated, efficient, and highly tunable way, according to a large number of processing parameters.

  2. Processed data is manipulated for input into our neural networks by one-hot encoding. This allows relationships between motifs within nucleotide sequences - and the effect they have on the expression level - to be learned on a deep level.

  3. Three different models have been trained on this encoded data:

    • 1-dimensional convolutional neural network (1DCNN)
    • 1-dimensional locally connected network (1DLOCCON)
    • Long-Short-Term Memory (LSTM), a type of recurrent neural network.
  4. These trained models can then be used to make predictions on the extent to which each promoter sequence in a file will contribute to a gene's expression level.

This means a large input file of promoter sequences with potential for use in biotherapeutic drug design can be rapidly evaluated for their likelihood of being effective.


Future Work

  • We are currently in the process of developing some data mining tools to identify and extract so called magic motifs from our raw data.
  • These are shorter nucleotide sequences that are present within complete promoter sequences that contribute to the highest expression levels.
  • Identifying and extracting these will allow us to make recommendations on what motifs a promoter sequence should contain in order to result in a high expression level of the gene being promoted.

Configuration

Pre-requirements

  • Python 3.6.7 or later
  • Conda version 4.6.8 or later
  • GitHub

Installation

Execute the following commands in your computer's terminal application to install our package:

  1. Clone the ExpressYeaself repository:

    git clone https://github.com/yeastpro/ExpressYeaself.git

  2. Navigate into the repository: cd ExpressYeaself

  3. Install our virtual environment: conda env create -n environment.yml

  4. Enter the virtual environment: conda activate yeast

  5. Download the raw data: chmod +x download_data.sh && ./download_data.sh

Getting started

Now you have installed our package and downloaded the raw data, you are ready to start using our features! You can use our interactive notebooks to take you through the process:

  • Navigate into the directory containing our interactive guides:
    cd expressyeaself/interaction/
  • To start processing the data, use jupyter to open our first interactive notebook:
    jupyter notebook 1_how_to_process_raw_data.ipynb &
  • Follow the instructions in the notebook, choose your parameters, and process the data.
  • When you're done, save and exit the notebook.
  • You can then start to encode your data and train your model: jupyter notebook 2_how_to_train_model.ipynb &

Directory Structure

ExpressYeaself (master)  
|---doc  
    |---technology_reviews
    	  |--1_sequencing_software_packages.md
    	  |--2_neural_network_packages.md
    |--timeline.md
    |--use_cases.md
|---example  
    |---Abf1TATA_data
        |--Abf1TATA_scaffold.txt
    |---native_data
        |--native_data.txt
    |---pTpA_data
        |--pTpA_scaffold.txt
    |---processed_data
        |--10000_from_20190610100252461788_homogeneous_deflanked_sequences_inserted_into_Abf1TATA_scaffold_with_exp_levels.txt.gz
        |--10000_from_20190611170757656183_homogeneous_deflanked_sequences_with_exp_levels.txt.gz
        |--10000_from_20190612130111781831_percentiles_els_binarized_homogeneous_deflanked_sequences_with_exp_levels.txt.gz
    |--__init__.py
    |--series_matrix_GSE104878-GPL17143.txt
|---expressyeaself  
    |---interaction
        |--1_how_to_process_raw_data.ipynb
        |--context.py
    |---models
    	  |---1d_cnn
    	      |---saved_models
    	          |--1d_cnn_classifier_onehot.hdf5
    	          |--1d_cnn_parallel_onehot.hdf5
    	          |--1d_cnn_sequential_onehot.hdf5
    	      |--1D_CNN_builder.ipynb
    	      |--context.py
    	      |--native_sample.txt
    	  |---1d_loccon
    	      |--1d_locally_connected.ipynb
    	      |--context.py
    	      |--loc_con_1d.py
    	  |---lstm
    	      |--context.py
    	      |--lstm_model_function.py
    	  |---prediction_results
    	      |--__init__.py
    |---tests
        |--__init__.py
        |--context.py
        |--test_build_promoter.py
        |--test_encode_sequences.py
        |--test_organize_data.py
        |--test_process_data.py
        |--test_utilities.py
    |--__init__.py
    |--build_promoter.py
    |--construct_neural_net.py
    |--encode_sequences.py
    |--organize_data.py
    |--process_data.py
    |--utilities.py
    |--version.py  
|--.coveragerc
|--.gitignore  
|--.travis.yml
|--LICENSE  
|--README.md 
|--download_data.sh
|--environment.yml
|--requirements.txt
|--runtests.sh 

Contributions

Any contributions to the project are warmly welcomed! If you discover any bugs, please report them in the issues section of this repository and we'll work to sort them out as soon as possible. If you have data that you think will be good to train our model on, please contact one of the authors.


References

1 Carl G. de Boer et al., Deciphering cis-regulatory logic with 100 million synthetic promoters, doi: http://dx.doi.org/10.1101/224907, 2017.


License

ExpressYeaself is licensed under the MIT license.


Troubleshooting

  • Module not found errors:
    • Make sure you're in our virtual environment!
    • Re-enter it with: conda activate yeast
  • Permission denied errors when running shell scripts from terminal:
    • You need to grant yourself access to execute the scripts.
    • Do so with: chmod +x <filename>.sh

expressyeaself's People

Contributors

jwa7 avatar kaykrish avatar lisboacgypt avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jwa7

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.