Giter VIP home page Giter VIP logo

ai-bind's Introduction

Logo

AI-Bind

AI-Bind is a deep-learning pipeline that provides interpretable binding predictions for never-before-seen proteins and ligands. AI-Bind is capable of rapid screening of large chemical libraries and guiding computationally expensive auto-docking simulations by prioritizing protein-ligand pairs for validation. The pipeline requires as input simple chemical features such as the amino-acid sequence of a protein and the isomeric SMILE of a ligand, which helps to overcome limitations associated with the lack of available 3D protein structures.

Preprint available at: https://arxiv.org/abs/2112.13168

Why AI-Bind?

Shortcomings of Existing ML Models in Predicting Protein-Ligand Binding

Our interest in predicting binding for never-before-seen proteins and ligands pushed us in splitting the test performances of the existing machine learning models (e.g., DeepPurpose) into three components:

(a) Transductive test: When both proteins and ligands from the test dataset are present in the training data,

(b) Semi-inductive test: when only the ligands from the test dataset are present in the training data, and

(c) Inductive test: When both proteins and ligands from the test dataset are absent in the training data.

We learn that only inductive test performance is a dependable metric for evaluating how well a machine learning model has learned binding from the structural features of proteins and ligands. We note that the majority of the models mainly present transductive test performance, which is related to predicting unseen links in the protein-ligand interaction network used in training. We explore how ML models achieve transductive performances comparable to much simpler algorithms (namely, network configuration models), which completely ignore the molecular structures and use the degree information to make binding predictions.

What does AI-Bind offer?

AI-Bind pipeline maximizes inductive test performance by including network-derived negatives in the training data and introducing unsupervised pre-training for the molecular embeddings. The pipeline is validated via three different neural architectures: VecNet, VAENet, and Siamese model. The best performing architecture in AI-Bind is VecNet, which uses Mol2vec and ProtVec to embed proteins and ligands, respectively. These embeddings are fed into a decoder (Multi-layer Perceptron), predicting the binding probability. VecNet

Interpretability of AI-Bind and Identifying Active Binding Sites

We mutate certain building blocks (amino acid trigrams) of the protein structure to recognize the regions influencing the binding predictions the most and identify them as the potential binding sites. Below, we validate the AI-Bind predicted active binding sites on the human protein TRIM59 by visualising the results of the auto-docking simulations and mapping the predicted sites to the amino acid residues where the ligands bind. AI-Bind predicted binding sites can guide the users in creating an optimal grid for the auto-docking simulations, further reducing simulation time.

trigram-study

Setting up AI-Bind and Predicting Protein-Ligand Binding (Guidelines for end users)

Hardware set-up for AI-Bind

We trained and tested all our models via a server on the Google Cloud Platform with a Intel Broadwell CPU and NVIDIA Tesla T4 GPU(s). Python version used in AI-Bind is 3.6.6. CUDA version used is 9.0.

Using Docker

Please use this docker for running AI-Bind: https://hub.docker.com/r/omairs/foodome2

Using requirements file

All Python modules and corresponding versions required for AI-Bind are listed here: requirements.txt

Use pip install -r requirements.txt to install the related packages.

rdkit version used in AI-Bind: '2017.09.1' (For installation, check the documentation here: https://www.rdkit.org/docs/Install.html, command: conda install -c rdkit rdkit)

Make sure the VecNet-User-Frontend.ipynb notebook and the three files in the AIBind folder (AIBind.py, init.py and import_modules.py) are in the same folder.

Download and save the data files under /data. Download link: https://zenodo.org/record/7226641

Alternative Installation using Docker

  1. Download the docker file named "Predictions.dockerfile".
  2. On your terminal, move to the directory with the dockerfile and run : docker build -t aibindpred -f ./Predictions.dockerfile ./
  3. To run the image as a container: docker run -it --gpus all --name aibindpredcontainer -p 8888:8888 aibindpred You may clone the git repository inside the container, or attach your local volume while running the container : docker run -it --gpus all --name aibindpredcontainer -p 8888:8888 -v ./local_directory:/home aibindpred
  4. To execute additional shells inside the container, run : docker exec -it aibindpredcontainer /bin/bash
  5. To run a Jupyter notebook instance inside the container, run : jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root The steps above will install all necessary packages and create the environment to run binding predictions using AI-Bind.

Running predictions from the frontend

  1. Organize your data file in a dataframe format with the colulmns 'InChiKey', 'SMILE' and 'target_aa_code'. Save this dataframe in a .csv file.
  2. Run the notebook titled VecNet-User-Frontend.ipynb to make the binding predictions. Predicted binding probabilities will be available under the column header 'Averaged Predictions'.

Code and Data

Data Files

All data files are available here: https://zenodo.org/record/7226641

  1. /data/sars-busters-consolidated/Database files: Contains protein-ligand binding data derived from DrugBank, BindingDB and DTC (Drug Target Commons).
  2. /data/sars-busters-consolidated/chemicals: Contains ligands used in training and testing of AI-Bind with embeddings.
  3. /data/sars-busters-consolidated/GitData/DeepPurpose and Configuration Model: Train-test data related to 5-fold cross-validation of Transformer-CNN (DeepPurpose) and the Duplex Configuration Model.
  4. /data/sars-busters-consolidated/GitData/interactions: Contains the network derived negatives dataset used in training of AI-Bind neural netoworks.
  5. /data/sars-busters-consolidated/GitData: Contains trained VecNet model, binding predictions on viral and human proteins associated with COVID-19, and a summary of the results from the auto-docking simulations.
  6. /data/sars-busters-consolidated/master_files: Contains the absolute negative (non-binding) protein-ligand pairs used in testing of AI-Bind.
  7. /data/sars-busters-consolidated/targets: Contains the proteins used in training and testing of AI-Bind with associated embeddings.
  8. /data/sars-busters-consolidated/interactions: Contains the positive (binding) protein-ligand pairs derived from DrugBank, NCFD (Natural Compounds in Food Database), BindingDB and DTC.
  9. /data/sars-busters-consolidated/Auto Docking: Contains all files and results from the validation of AI-Bind on COVID-19 related viral and human proteins.
  10. /data/sars-busters-consolidated/Binding Probability Profile Validation: Contains the files visualizing the active binding sites from auto-dcoking simulations.
  11. /data/sars-busters/Mol2vec: Pre-trained Mol2vec and ProtVec models are available here.
  12. /data/sars-busters-consolidated/s4pred: Includes the code and files for predicting the secondary structure of TRIM59.

Code

Here we describe the Jupyter Notebooks, Python Modules and MATLAB scripts used in AI-Bind.

AIBind

  1. AIBind.py: Contains the Python class for AI-Bind. Includes all the neural architectures: VecNet, VAENet and Siamese Model.
  2. import_modules.py: Contains all the necessary Python modules to run AI-Bind.

Configuration-Model-5-fold

  1. Configuration Model - Cross-Validation.ipynb: Computes the 5-fold cross-validation performance of the Duplex Configuration Model on BindingDB data used in DeepPurpose.
  2. configuration_bipartite.m: Contains the MATLAB implementation of the Duplex Configuration Model.
  3. runscriptposneg.m: Runs the Duplex Configuration Model using the degree seuqences of the ligands and the proteins. Output files summat10.csv and summat01.csv are used in calculating the performance of the configuration model.

DeepPurpose-5-fold

  1. Deep Purpose - Final DataSet - Unseen Targets.ipynb: We execute a 5-fold cross-validation over unseen targets (Semi-Inductive Test) on DeepPurpose using the network-derived negatives.
  2. Deep Purpose - Final DataSet - Unseen Nodes.ipynb: We execute a 5-fold cross-validation over unseen nodes (Inductive Test) on DeepPurpose using the network-derived negatives.

MolTrans

  1. example_inductive_AI_Bind_data.py: We run inductive test on MolTrans using the network-derived negative samples which is used in training AI-Bind.
  2. example_inductive_BindingDB.py: We run inductive test on MolTrans using the BindingDB data which is used in the MolTrans paper.
  3. example_semi_inductive.py: This script can be used to run semi-inductive tests on MolTrans.
  4. example_transductive.py: This script can be used to run transductive tests on MolTrans.

DeepPurpose-and-Confuguration-Model

  1. DeepPurpose Rerun - Transformer CNN.ipynb: We train-test DeepPurpose using the benchmark BindingDB data. Multiple experiments on DeepPurpose have been carried out here, which includes randomly shuffling the chemical structures and degree analysis of DeepPurpose performance.
  2. Configuration Models on DeepPurpose data.ipynb: We explore the performance of the Duplex Configuration Model on the BindingDB dataset used in DeepPurpose.

EigenSpokes

  1. Eigen Spokes Analysis.ipynb - We run the EigenSpokes analysis here on the combined adjacency matrix (square adjancecy matrix with ligands and targets in both rows and columns).

Emergence-of-shortcuts

  1. Without_and_with_constant_fluctuations_p_bind=0.16.ipynb: Creates and runs the configuration model on the toy unipartite network based on the protein sample in BindingDB. Here we explore two scenarios related to the association between degree and dissociation constant - without any fluctuation and constant fluctuations over the dissociation constant values.
  2. With_varying_fluctuations.ipynb: Creates and runs the configuration model on the toy unipartite network based on the protein sample in BindingDB, where the fluctuations over the dissociation constant values follow similar trends as in the BindingDB data.

Engineered-Features

  1. Underdstanding Engineered Features.ipynb: We explore the explainability of the engineered features (simple features representing the ligand and protein molecules.
  2. VecNet Engineered Features - Mol2vec and Protvec Important Dimensions.ipynb: Identifies the most important dimensions in Mol2vec and ProtVec embeddings, in terms of protein-ligand binding.
  3. VecNet Engineered Features Concat Original Features.ipynb: Explores the performance of VecNet after concatencating the original protein and ligand embeddings.
  4. VecNet Engineered Features.ipynb: Replaces Mol2vec and ProtVec embeddings with simple engineered features in VecNet architecture and explores its performance.

Identifying-active-binding-sites

  1. VecNet-Protein-Trigrams-Study-GitHub.ipynb: We mutate the amino acid trigrams on the protein and observe the fluctuations in VecNet predictions. This process helps us identify the potential active binding sites on the amino acid sequence.

Random Input Tests

  1. VecNet-Unseen_Nodes-RANDOM.ipynb: Runs VecNet on unseen nodes (Inductive Test) where the ligand and the protein embeddings are replaced by Gaussian random inputs.
  2. VecNet-Unseen_Nodes-T-RANDOM-Only.ipynb: Runs VecNet on unseen nodes (Inductive Test) where the protein embeddings are replaced by Gaussian random inputs.
  3. VecNet-Unseen_Targets-RANDOM.ipynb: Runs VecNet on unseen targets (Semi-Inductive Test) where the ligand and the protein embeddings are replaced by Gaussian random inputs.
  4. VecNet-Unseen_Targets-T-RANDOM-Only.ipynb: Runs VecNet on unseen targets (Semi-Inductive Test) where the protein embeddings are replaced by Gaussian random inputs.

Siamese

  1. Siamese_Unseen_Nodes.ipynb: We create the network-derived negatives dataset and execute a 5-fold cross-validation on unseen nodes (Inductive test) here.
  2. Siamese_Unseen_Targets.ipynb: We execute a 5-fold cross-validation on unseen targets (Semi-Inductive test) here.

VAENet

  1. VAENet-Unseen_Nodes.ipynb: We create the network-derived negatives and and execute a 5-fold cross-validation on unseen nodes (Inductive test) here.
  2. VAENet-Unseen_Targets.ipynb: We execute a 5-fold cross-validation on unseen targets (Semi-Inductive test) here.

Validation

  1. SARS-CoV-2 Predictions Analysis VecNet.ipynb: Auto-docking validation of top and bottom 100 predictions made by VecNet on SARS-CoV-2 viral proteins and human proteins associated with COVID-19.
  2. Binding_Probability_Profile_Golden_Standar_Validation.py: Validation of the AI-Bind derived binding locations with gold standard protein binding data.

VecNet

  1. VecNet-Unseen_Nodes.ipynb: We create the network-derived negatives, execute a 5-fold cross-validation on unseen nodes (Inductive test), and make predictions on SARS-CoV-2 viral proteins and human proteins associated with COVID-19.
  2. VecNet-Unseen_Targets.ipynb: We execute a 5-fold cross-validation on unseen targets (Semi-Inductive test) here.

External Resources

  1. Learn auto-docking using Autodock Vina: https://www.youtube.com/watch?v=BLbXkhqbebs
  2. Learn to visualize active binding sites using PyMOL: https://www.youtube.com/watch?v=mBlMI82JRfI

Cite AI-Bind

If you find AI-Bind useful in your research, please add the following citation:

@article{Chatterjee2023,
  doi = {10.1038/s41467-023-37572-z},
  url = {https://doi.org/10.1038/s41467-023-37572-z},
  year = {2023},
  month = apr,
  publisher = {Springer Science and Business Media {LLC}},
  volume = {14},
  number = {1},
  author = {Ayan Chatterjee and Robin Walters and Zohair Shafi and Omair Shafi Ahmed and Michael Sebek and Deisy Gysi and Rose Yu and Tina Eliassi-Rad and Albert-L{\'{a}}szl{\'{o}} Barab{\'{a}}si and Giulia Menichetti},
  title = {Improving the generalizability of protein-ligand binding predictions with {AI}-Bind},
  journal = {Nature Communications}
}

ai-bind's People

Contributors

chatterjeeayan avatar rsfwalters avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ai-bind's Issues

Trouble loading model and running prediction

Hi,

First of all, thanks for all your work. I was able to get everything up and running, however I am running into an issue when trying
to predict. I looked at the Jupyter Notebook file and there are a couple of lines that throw errors in the Loading Pre-Trained Vec and Prediction section:

vecnet_object = AIBind.AIBind(interactions_location = '/root/data/Network_Derived_Negatives.csv',
                              interactions = None,
                              interaction_y_name = 'Y',

                          absolute_negatives_location = None,
                          absolute_negatives = None,

                          drugs_location = None,
                          drugs_dataframe = drugs,
                          drug_inchi_name = 'InChiKey',
                          drug_smile_name = 'SMILE',

                          targets_location = None,
                          targets_dataframe = targets, 
                          target_seq_name = 'target_aa_code',

                          mol2vec_location = None,
                          mol2vec_model = None,

                          protvec_location = None, 
                          protvec_model = None,

                          nodes_test = targets_test, 
                          nodes_validation = targets_validation, 

                          edges_test = edges_test, 
                          edges_validation = edges_validation, 

                          model_out_dir = '/root/data/',

                          debug = False)

This runs okay. However, I am confused by the next line, where you seem to be trying to allocate a pickle file to the previously created class instance?

with open('/root/data/VecNet_unseen_nodes.pickle', 'rb') as file: vecnet_object = pkl.load(file)

Is this correct? Because this gives me the following error:

`Traceback (most recent call last):
File "", line 2, in
ModuleNotFoundError: No module named 'AIBind.AIBind'; 'AIBind' is not a package'

Also, I cannot follow the prediction example.
You read in the test.csv with the following line:

nodes_df = pd.read_csv('Test.csv')

However, nodes_df is then not used in the example:

unseen_nodes_example_5fold_average = vecnet_object.get_fold_averaged_prediction_results(model_name = None, version_number = None, model_paths = ['/root/data/VecNet_unseen_nodes.pickle'], optimal_validation_model = None, test_sets = [targets_test[1].dropna()], get_drug_embed = True, pred_drug_embeddings = nodes_df['InChiKey'], get_target_embed = True, pred_target_embeddings = nodes_df['target_aa_code'], drug_filter_list = [], target_filter_list = [], return_dataframes = True )

Can you please elaborate how to load the csv-data into the prediction?

Thanks!

Michael

`

Pre-Trained Models for Multiple Target Classes

Hello,

I am impressed with the project and its scope. I am currently exploring options for leveraging pre-trained models, specifically ones that might have been trained on the entire BindingDB dataset, or similar. My goal is to make predictions for several target classes without the need for pretraining, similar to the functionality provided by DeepPurpose.

From what I have read, it seems that the existing models are predominantly focused on SARS-CoV-2. Could you please provide some insights or confirm whether there are other models available that cover a broader range of targets?

Thank you for your assistance.

Regards,

Marawan

pip installing requirements file leads to error

Hello guys,

I tried pip installing the requirements file, but I'm getting an error because the specified version of bazel seems to not exist:

(base) pwengert@Peters-Macbook-Pro AI-Bind-main % pip install -r requirements.txt
...
ERROR: Could not find a version that satisfies the requirement bazel==0.0.0.20200723 (from versions: none)
ERROR: No matching distribution found for bazel==0.0.0.20200723

Do you know where I could get this file?

Best,
Peter

Difficulty running notebook

Hello again,

I hope you are doing well.

I've created a csv file with the InChiKey, SMILE, and target_aa_code columns and I'm trying to run the VecNet-User-Frontend.ipynb notebook, but I'm coming across a bunch of issues.

There doesn't appear to be any /root/data/ directory, but loads of files in the script are supposedly stored there.
For example, under the pre-trained vecnet section, I get:

FileNotFoundError Traceback (most recent call last)
Cell In[15], line 1
----> 1 with open('/root/data/VecNet_unseen_nodes.pickle', 'rb') as file:
2 vecnet_object = pkl.load(file)

When I try to ignore that first section and get right to the "Prediction" portion of the notebook I end up getting the error "NameError: name 'vecnet_object' is not defined", presumably because I couldn't run the pre-trained vecnet section.
Is there a place where I can find the root/data directory and download all its contents?

As an aside, I'm new to using jupyter notebooks, so I'm not sure which parts of the notebook I have to run in order to get things to work. I obviously run the imports, but I'm not sure if I have to run things like the GPU settings subheader. This subheader throws an error because I'm using a macbook pro, which doesn't use nvidia. Therefore, nvidia-smi isn't a valid command on my machine. I skipped over this part because I just want to use the pre-trained model. Is this valid?

Best,
Peter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.