Giter VIP home page Giter VIP logo

protmlp's Introduction

protMLP

Predict from a protein sequence the growth temperature of the species of origin.

See: Sauer & Wang. Using machine learning to predict organismal growth temperatures from protein primary sequences (2019) https://doi.org/10.1101/677328

Installation and Requirements

This has been developed and tested on Ubuntu 18.04 LTS. The scripts should work on any system supporting Python 3, so long as the external programs are installed properly.

  1. Download these scripts. This is easiest using git to clone the repository.
git clone https://github.com/DavidBSauer/protMLP
  1. Install the requirements. These scripts depend upon Python3 with the following python packages also installed: numpy, scipy, matplotlib, biopython, pandas, tqdm, pydot, graphviz, and tensorflow.

To install everything in Ubuntu (or other system that use the apt package manager), go into the downloaded directory and use the pre-made bash script.

cd protMLP
./Ubuntu_setup.bash

Step 1 - Retrieve the species of origin for each protein seqeunce.

Take in an MSA file and assign the species from Uniprot (locally or via web). Will remove fragments and gap incuding sequences and generate train, test, and validation MSAs

Run locally as:

python3 step1.py -fa fasta_sequence_file.fa -ld local_copy_of_Uniprot.dat

Run over web as:

python3 step1.py -fa fasta_sequence_file.fa -w

Step 2 - Remove protein sequences outside of a provided growth temperature range.

Take in a species-Tg file and MSA files. Assign Tg's to all sequences based on species of origin, then remove sequences outside of provided Tg range.

python3 step2.py -sq MSA_file.fa -t all_merged_12_10_2012.txt -r all

Step 3 - One-hot encode the protein sequences and train MLPs

One-hot encode the protein sequences, the calculate a linear regression and MLPs. Optionally remove amino acids which are not correlated with Tg and/or balance the training data.

python3 step3.py -tr training_file.fa -te testing_file.fa -vd validation_file.fa

Predicting Tg from sequences

Predict the Tg of a provided set of sequences in FASTA format. Note: to get meaningful results the sequences must be aligned to the training MSA.

python3 predictor.py -sq sequences.fa -t NN_AA_template.txt -m model.h5

Predict Tg of point mutants to a provided sequence

Given a provided protein sequence, predict the Tg of all possible amino acids observed at each position of the training MSA.

Note, this program can predict compound (double, triple, etc) mutants also. However, mutational space increase exponentially with the number of mutations, therefore requiring exponentially more CPU-time and memory to calculate. If the program crashes, try decreasing the batch size.

python3 point_mutant_screening.py -sq sequences.fa -t NN_AA_template.txt -m model.h5 -n 1

protmlp's People

Contributors

davidbsauer avatar

Stargazers

 avatar  avatar  avatar

protmlp's Issues

all_merged_12_10_2012.txt in step2.py

I was wondering if this file(all_merged_12_10_2012.txt) and species_Tg.txt in the repository are interchangeable?

Also I was wondering if replacing species_oxygen_mode.txt as well as species_pH.txt in the input of step2.py would be enough for oxygen and PH prediction?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.