MiniFold

Abstract

Introduction: The Protein Folding Problem (predicting a protein structure from its sequence) is an interesting one since DNA sequence data available is becoming cheaper and cheaper at an unprecedented rate, even faster than Moore's law 1. Recent research has applied Deep Learning techniques in order to accurately predict the structure of polypeptides [2, 3].
Methods: In this work, we present an attempt to imitate the AlphaFold system for protein prediction architecture [3]. We use 1-D Residual Networks (ResNets) to predict dihedral torsion angles and 2-D ResNets to predict distance maps between the protein amino-acids[4]. We use the CASP7 ProteinNet dataset section for training and evaluation of the model [5]. An open-source implementation of the system described can be found here.
Results: We are able to obtain distance maps and torsion angle predictions for a protein given it's sequence and PSSM. Our angle prediction model scores a 0.39 of MAE (Mean Absolute Error), and 0.39 and 0.43 R^2 coefficients for Phi and Psi respectively, whereas SoTA is around 0.69 (Phi) and 0.73 (Psi). Our methods do not include post-processing of Deep Learning outputs, which can be very noisy.
Conclusion: We have shown the potential of Deep Learning methods and its possible application to solve the Protein Folding Problem. Despite technical limitations, Neural Networks are able to capture relations between the data. Although our visually pleasant results, our system lacks components such as the protein structure prediction from both dihedral torsion angles and the distance map of a given protein and the post-processing of our predictions in order to reduce noise.

Citation

@misc{ericalcaide2019
  title = {MiniFold: a DeepLearning-based Mini Protein Folding Engine},
  publisher = {GitHub},
  journal = {GitHub repository},
  author = {Alcaide, Eric},
  year = {2019},
  howpublished = {\url{https://github.com/EricAlcaide/MiniFold/}},
  doi = {10.5281/zenodo.3774491},
  url = {https://doi.org/10.5281/zenodo.3774491}
}

Introduction

DeepMind, a company affiliated with Google and specialized in AI, presented a novel algorithm for Protein Structure Prediction at CASP13 (a competition which goal is to find the best algorithms that predict protein structures in different categories).

The Protein Folding Problem is an interesting one since there's tons of DNA sequence data available and it's becoming cheaper and cheaper at an unprecedented rate (faster than Moore's law). The cells build the proteins they need through transcription (from DNA to RNA) and translation (from RNA to Aminocids (AAs)). However, the function of a protein does not depend solely on the sequence of AAs that form it, but also their spatial 3D folding. Thus, it's hard to predict the function of a protein from its DNA sequence. AI can help solve this problem by learning the relations that exist between a determined sequence and its spatial 3D folding.

The DeepMind work presented @ CASP was not a technological breakthrough (they did not invent any new type of AI) but an engineering one: they applied well-known AI algorithms to a problem along with lots of data and computing power and found a great solution through model design, feature engineering, model ensembling and so on. DeepMind has no plan to open source the code of their model nor set up a prediction server.

Based on the premise exposed before, the aim of this project is to build a model suitable for protein 3D structure prediction inspired by AlphaFold and many other AI solutions that may appear and achieve SOTA results.

Methods

Proposed Architecture

The methods implemented are inspired by DeepMind's original post. Two different residual neural networks (ResNets) are used to predict angles between adjacent aminoacids (AAs) and distance between every pair of AAs of a protein. For distance prediction a 2D Resnet was used while for angles prediction a 1D Resnet was used.

Image from DeepMind's original blogpost.

Implementation_details can be found here together with a detailed explanation. A sample result of our distance predictor:

Ground truth (left) and predicted distances (right) by MiniFold.

Ground truth (left) and predicted distances (right) by AlphaFold.

Reproducing the results

Here are the following steps in order to run the code locally or in the cloud:

Clone the repo: git clone https://github.com/EricAlcaide/MiniFold
Install dependencies: pip install -r requirements.txt
Get & format the data
1. Download data here (select CASP7 text-based format)
2. Extract/Decompress the data in any directory
3. Create the /data folder inside the MiniFold directory and copy the training_30, training_70 and training90 files to it. Change extensions to .txt.
Execute data preprocessing notebooks (preprocessing folder) in the following order (we plan to release simple scripts instead of notebooks very soon):
1. get_proteins_under_200aa.jl *source_path* *destin_path*: - selects proteins under 200 residues from the source_path file (alternatively can be declared in the script itself) - (you will need the Julia programming language v1.0 in order to run it)
  1. Alternatively: julia_get_proteins_under_200aa.ipynb (you will need Julia as well as iJulia)
2. get_angles_from_coords_py.ipynb - calculates dihedral angles from raw coordinates
3. angle_data_preparation_py.ipynb
Run the models!
1. For angles prediction: models/predicting_angles.ipynb
2. For distance prediction:
  1. models/distance_pipeline/pretrain_model_pssm_l_x_l.ipynb
  2. models/distance_pipeline/pipeline_caller.py

If you encounter any errors during installation, don't hesitate and open an issue.

Discussion

Future

There is plenty of ideas that could not be tried in this project due to computational and time constraints. In a brief way, some promising ideas or future directions are listed below:

Train with crops of 64x64 AAs, not windows of 200x200 AAs and average at prediction time.
Use data from Multiple Sequence Alignments (MSA) such as paired changes bewteen AAs.
Use distance map as potential input for angle prediction or vice versa.
Train with more data
Use predictions as constraints to a Protein Structure Prediction pipeline (CNS, Rosetta Solve or others).
Set up a prediction script/pipeline from raw text/FASTA file

Limitations

This project has been developed mainly during 3 weeks by 1 person and, therefore, many limitations have appeared. They will be listed below in order to give a sense about what this project is and what it's not.

No usage of Multiple Sequence Alignments (MSA): The methods developed in this project don't use MSA nor MSA-based features as input.
Computing power/memory: Development of the project has taken part in a computer with the following specifications: Intel i7-6700k, 8gb RAM, NVIDIA GTX-1060Ti 6gb and 256gb of storage. The capacity for data exploration, processing, training and evaluating the models is limited.
GPU/TPUs for training: The models were trained and evaluated on a single GPU. No cloud servers were used.
Time: Three weeks of development during spare time.
Domain expertise: No experts in the field of genomics, proteomics or bioinformatics. The author knows the basics of Biochemistry and Deep Learning.
Data: The average paper about Protein Structure Prediction uses a personalized dataset acquired from the Protein Data Bank (PDB). No such dataset was used. Instead, we used a subset of the ProteinNet dataset from CASP7. Our models are trained with just 150 proteins (distance prediction) and 600 proteins (angles prediction) due to memory constraints.

Due to these limitations and/or constraints, the precission/accuracy the methods here developed can achieve is limited when compared against State Of The Art algorithms.

References

Contribute

Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request. Clone this project to your computer:

git clone https://github.com/EricAlcaide/MiniFold

By participating in this project, you agree to abide by the thoughtbot code of conduct

sailfish009 / minifold Goto Github PK

minifold's Introduction

MiniFold

Abstract

Citation

Introduction

Methods

Proposed Architecture

Reproducing the results

Discussion

Future

Limitations

References

Contribute

Meta

minifold's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent