Giter VIP home page Giter VIP logo

kexinhuang12345 / deeppurpose Goto Github PK

View Code? Open in Web Editor NEW
912.0 30.0 263.0 14.8 MB

A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)

Home Page: https://doi.org/10.1093/bioinformatics/btaa1005

License: BSD 3-Clause "New" or "Revised" License

Python 2.79% Jupyter Notebook 97.21%
drug-repurposing deep-learning drug-target-interactions toolkit covid19 virtual-screening drug-discovery ppi ddi dti-prediction

deeppurpose's Introduction

logo

A Deep Learning Library for Compound and Protein Modeling
DTI, Drug Property, PPI, DDI, Protein Function Prediction

Applications in Drug Repurposing, Virtual Screening, QSAR, Side Effect Prediction and More


PyPI version Downloads Downloads GitHub Repo stars GitHub Repo forks

This repository hosts DeepPurpose, a Deep Learning Based Molecular Modeling and Prediction Toolkit on Drug-Target Interaction Prediction, Compound Property Prediction, Protein-Protein Interaction Prediction, and Protein Function prediction (using PyTorch). We focus on DTI and its applications in Drug Repurposing and Virtual Screening, but support various other molecular encoding tasks. It allows very easy usage (several lines of codes only) to facilitate deep learning for life science research.

News!

  • [05/21] 0.1.2 Support 5 new graph neural network based models for compound encoding (DGL_GCN, DGL_NeuralFP, DGL_GIN_AttrMasking, DGL_GIN_ContextPred, DGL_AttentiveFP), implemented using DGL Life Science! An example is provided here!
  • [12/20] DeepPurpose is now supported by TDC data loader, which contains a large collection of ML for therapeutics datasets, including many drug property, DTI datasets. Here is a tutorial!
  • [12/20] DeepPurpose can now be installed via pip!
  • [11/20] DeepPurpose is published in Bioinformatics!
  • [11/20] Added 5 more pretrained models on BindingDB IC50 Units (around 1Million data points).
  • [10/20] Google Colab Installation Instructions are provided here. Thanks to @hima111997 !
  • [10/20] Using DeepPurpose, we made a humans-in-the-loop molecular design web UI interface, check it out! [Website, paper]
  • [09/20] DeepPurpose has now supported three more tasks: DDI, PPI and Protein Function Prediction! You can simply call from DeepPurpose import DDI/PPI/ProteinPred to use, checkout examples below!
  • [07/20] A simple web UI for DTI prediction can be created under 10 lines using Gradio! A demo is provided here.
  • [07/20] A blog is posted on the Towards Data Science Medium column, check this out!
  • [07/20] Two tutorials are online to go through DeepPurpose's framework to do drug-target interaction prediction and drug property prediction (DTI, Drug Property).
  • [05/20] Support drug property prediction for screening data that does not have target proteins such as bacteria! An example using RDKit2D with DNN for training and repurposing for pseudomonas aeruginosa (MIT AI Cures's open task) is provided as a demo.
  • [05/20] Now supports hyperparameter tuning via Bayesian Optimization through the Ax platform! A demo is provided in here.

Features

  • 15+ powerful encodings for drugs and proteins, ranging from deep neural network on classic cheminformatics fingerprints, CNN, transformers to message passing graph neural network, with 50+ combined models! Most of the combinations of the encodings are not yet in existing works. All of these under 10 lines but with lots of flexibility! Switching encoding is as simple as changing the encoding names!

  • Realistic and user-friendly design:

    • support DTI, DDI, PPI, molecular property prediction, protein function predictions!
    • automatic identification to do drug target binding affinity (regression) or drug target interaction prediction (binary) task.
    • support cold target, cold drug settings for robust model evaluations and support single-target high throughput sequencing assay data setup.
    • many dataset loading/downloading/unzipping scripts to ease the tedious preprocessing, including antiviral, COVID19 targets, BindingDB, DAVIS, KIBA, ...
    • many pretrained checkpoints.
    • easy monitoring of training process with detailed training metrics output such as test set figures (AUCs) and tables, also support early stopping.
    • detailed output records such as rank list for repurposing result.
    • various evaluation metrics: ROC-AUC, PR-AUC, F1 for binary task, MSE, R-squared, Concordance Index for regression task.
    • label unit conversion for skewed label distribution such as Kd.
    • time reference for computational expensive encoding.
    • PyTorch based, support CPU, GPU, Multi-GPUs.

NOTE: We are actively looking for constructive advices/user feedbacks/experiences on using DeepPurpose! Please open an issue or contact us.

Cite Us

If you found this package useful, please cite our paper:

@article{huang2020deeppurpose,
  title={DeepPurpose: A Deep Learning Library for Drug-Target Interaction Prediction},
  author={Huang, Kexin and Fu, Tianfan and Glass, Lucas M and Zitnik, Marinka and Xiao, Cao and Sun, Jimeng},
  journal={Bioinformatics},
  year={2020}
}

Installation

Try it on Binder! Binder is a cloud Jupyter Notebook interface that will install our environment dependency for you.

Binder

Video tutorial to install Binder.

We recommend to install it locally since Binder needs to be refreshed every time launching. To install locally, we recommend to install from pip:

pip

conda create -n DeepPurpose python=3.6
conda activate DeepPurpose
conda install -c conda-forge notebook
pip install git+https://github.com/bp-kelley/descriptastorus 
pip install DeepPurpose

Build from Source

First time:

git clone https://github.com/kexinhuang12345/DeepPurpose.git ## Download code repository
cd DeepPurpose ## Change directory to DeepPurpose
conda env create -f environment.yml  ## Build virtual environment with all packages installed using conda
conda activate DeepPurpose ## Activate conda environment (use "source activate DeepPurpose" for anaconda 4.4 or earlier) 
jupyter notebook ## open the jupyter notebook with the conda env

## run our code, e.g. click a file in the DEMO folder
... ...

conda deactivate ## when done, exit conda environment 

In the future:

cd DeepPurpose ## Change directory to DeepPurpose
conda activate DeepPurpose ## Activate conda environment
jupyter notebook ## open the jupyter notebook with the conda env

## run our code, e.g. click a file in the DEMO folder
... ...

conda deactivate ## when done, exit conda environment 

Video tutorial to install locally from source.

Example

Case Study 1(a): A Framework for Drug Target Interaction Prediction, with less than 10 lines of codes.

In addition to the DTI prediction, we also provide repurpose and virtual screening functions to rapidly generation predictions.

Click here for the code!
from DeepPurpose import DTI as models
from DeepPurpose.utils import *
from DeepPurpose.dataset import *

SAVE_PATH='./saved_path'
import os 
if not os.path.exists(SAVE_PATH):
  os.makedirs(SAVE_PATH)


# Load Data, an array of SMILES for drug, an array of Amino Acid Sequence for Target and an array of binding values/0-1 label.
# e.g. ['Cc1ccc(CNS(=O)(=O)c2ccc(s2)S(N)(=O)=O)cc1', ...], ['MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTH...', ...], [0.46, 0.49, ...]
# In this example, BindingDB with Kd binding score is used.
X_drug, X_target, y  = process_BindingDB(download_BindingDB(SAVE_PATH),
					 y = 'Kd', 
					 binary = False, 
					 convert_to_log = True)

# Type in the encoding names for drug/protein.
drug_encoding, target_encoding = 'CNN', 'Transformer'

# Data processing, here we select cold protein split setup.
train, val, test = data_process(X_drug, X_target, y, 
                                drug_encoding, target_encoding, 
                                split_method='cold_protein', 
                                frac=[0.7,0.1,0.2])

# Generate new model using default parameters; also allow model tuning via input parameters.
config = generate_config(drug_encoding, target_encoding, transformer_n_layer_target = 8)
net = models.model_initialize(**config)

# Train the new model.
# Detailed output including a tidy table storing validation loss, metrics, AUC curves figures and etc. are stored in the ./result folder.
net.train(train, val, test)

# or simply load pretrained model from a model directory path or reproduced model name such as DeepDTA
net = models.model_pretrained(MODEL_PATH_DIR or MODEL_NAME)

# Repurpose using the trained model or pre-trained model
# In this example, loading repurposing dataset using Broad Repurposing Hub and SARS-CoV 3CL Protease Target.
X_repurpose, drug_name, drug_cid = load_broad_repurposing_hub(SAVE_PATH)
target, target_name = load_SARS_CoV_Protease_3CL()

_ = models.repurpose(X_repurpose, target, net, drug_name, target_name)

# Virtual screening using the trained model or pre-trained model 
X_repurpose, drug_name, target, target_name = ['CCCCCCCOc1cccc(c1)C([O-])=O', ...], ['16007391', ...], ['MLARRKPVLPALTINPTIAEGPSPTSEGASEANLVDLQKKLEEL...', ...], ['P36896', 'P00374']

_ = models.virtual_screening(X_repurpose, target, net, drug_name, target_name)

Case Study 1(b): A Framework for Drug Property Prediction, with less than 10 lines of codes.

Many dataset is in the form of high throughput screening data, which have only drug and its activity score. It can be formulated as a drug property prediction task. We also provide a repurpose function to predict over large space of drugs.

Click here for the code!
from DeepPurpose import CompoundPred as models
from DeepPurpose.utils import *
from DeepPurpose.dataset import *


SAVE_PATH='./saved_path'
import os 
if not os.path.exists(SAVE_PATH):
  os.makedirs(SAVE_PATH)


# load AID1706 Assay Data
X_drugs, _, y = load_AID1706_SARS_CoV_3CL()

drug_encoding = 'rdkit_2d_normalized'
train, val, test = data_process(X_drug = X_drugs, y = y, 
			    drug_encoding = drug_encoding,
			    split_method='random', 
			    random_seed = 1)

config = generate_config(drug_encoding = drug_encoding, 
                         cls_hidden_dims = [512], 
                         train_epoch = 20, 
                         LR = 0.001, 
                         batch_size = 128,
                        )
model = models.model_initialize(**config)
model.train(train, val, test)

X_repurpose, drug_name, drug_cid = load_broad_repurposing_hub(SAVE_PATH)

_ = models.repurpose(X_repurpose, model, drug_name)

Case Study 1(c): A Framework for Drug-Drug Interaction Prediction, with less than 10 lines of codes.

DDI is very important for drug safety profiling and the success of clinical trials. This framework predicts interaction based on drug pairs chemical structure.

Click here for the code!
from DeepPurpose import DDI as models
from DeepPurpose.utils import *
from DeepPurpose.dataset import *

# load DB Binary Data
X_drugs, X_drugs_, y = read_file_training_dataset_drug_drug_pairs("toy_data/ddi.txt")

drug_encoding = 'rdkit_2d_normalized'
train, val, test = data_process(X_drug = X_drugs, X_drug_ = X_drugs_, y = y, 
			    drug_encoding = drug_encoding,
			    split_method='random', 
			    random_seed = 1)

config = generate_config(drug_encoding = drug_encoding, 
                         cls_hidden_dims = [512], 
                         train_epoch = 20, 
                         LR = 0.001, 
                         batch_size = 128,
                        )

model = models.model_initialize(**config)
model.train(train, val, test)

Case Study 1(d): A Framework for Protein-Protein Interaction Prediction, with less than 10 lines of codes.

PPI is important to study the relations among targets.

Click here for the code!
from DeepPurpose import PPI as models
from DeepPurpose.utils import *
from DeepPurpose.dataset import *

# load DB Binary Data
X_targets, X_targets_, y = read_file_training_dataset_protein_protein_pairs("toy_data/ppi.txt")

target_encoding = 'CNN'
train, val, test = data_process(X_target = X_targets, X_target_ = X_targets_, y = y, 
			    target_encoding = target_encoding,
			    split_method='random', 
			    random_seed = 1)

config = generate_config(target_encoding = target_encoding, 
                         cls_hidden_dims = [512], 
                         train_epoch = 20, 
                         LR = 0.001, 
                         batch_size = 128,
                        )

model = models.model_initialize(**config)
model.train(train, val, test)

Case Study 1(e): A Framework for Protein Function Prediction, with less than 10 lines of codes.

Protein function prediction help predict various useful functions such as GO terms, structural classification and etc. Also, for biologics drugs, it is also useful for screening.

Click here for the code!
from DeepPurpose import ProteinPred as models
from DeepPurpose.utils import *
from DeepPurpose.dataset import *

# load DB Binary Data
X_targets, y = read_file_protein_function()

target_encoding = 'CNN'
train, val, test = data_process(X_target = X_targets, y = y, 
			    target_encoding = target_encoding,
			    split_method='random', 
			    random_seed = 1)

config = generate_config(target_encoding = target_encoding, 
                         cls_hidden_dims = [512], 
                         train_epoch = 20, 
                         LR = 0.001, 
                         batch_size = 128,
                        )

model = models.model_initialize(**config)
model.train(train, val, test)

Case Study 2 (a): Antiviral Drugs Repurposing for SARS-CoV2 3CLPro, using One Line.

Given a new target sequence (e.g., SARS-CoV2 3CL Protease), retrieve a list of repurposing drugs from a curated drug library of 81 antiviral drugs. The Binding Score is the Kd values. Results aggregated from five pretrained model on BindingDB dataset! (Caution: this currently is for educational purposes. The pretrained DTI models only cover a small dataset and thus cannot generalize to every new unseen protein. For the best use case, train your own model with customized data.)

Click here for the code!
from DeepPurpose import oneliner
from DeepPurpose.dataset import *
oneliner.repurpose(*load_SARS_CoV2_Protease_3CL(), *load_antiviral_drugs(no_cid = True))
----output----
Drug Repurposing Result for SARS-CoV2 3CL Protease
+------+----------------------+------------------------+---------------+
| Rank |      Drug Name       |      Target Name       | Binding Score |
+------+----------------------+------------------------+---------------+
|  1   |      Sofosbuvir      | SARS-CoV2 3CL Protease |     190.25    |
|  2   |     Daclatasvir      | SARS-CoV2 3CL Protease |     214.58    |
|  3   |      Vicriviroc      | SARS-CoV2 3CL Protease |     315.70    |
|  4   |      Simeprevir      | SARS-CoV2 3CL Protease |     396.53    |
|  5   |      Etravirine      | SARS-CoV2 3CL Protease |     409.34    |
|  6   |      Amantadine      | SARS-CoV2 3CL Protease |     419.76    |
|  7   |      Letermovir      | SARS-CoV2 3CL Protease |     460.28    |
|  8   |     Rilpivirine      | SARS-CoV2 3CL Protease |     470.79    |
|  9   |      Darunavir       | SARS-CoV2 3CL Protease |     472.24    |
|  10  |      Lopinavir       | SARS-CoV2 3CL Protease |     473.01    |
|  11  |      Maraviroc       | SARS-CoV2 3CL Protease |     474.86    |
|  12  |    Fosamprenavir     | SARS-CoV2 3CL Protease |     487.45    |
|  13  |      Ritonavir       | SARS-CoV2 3CL Protease |     492.19    |
....

Case Study 2(b): Repurposing using Customized training data, with One Line.

Given a new target sequence (e.g., SARS-CoV 3CL Pro), training on new data (AID1706 Bioassay), and then retrieve a list of repurposing drugs from a proprietary library (e.g., antiviral drugs). The model can be trained from scratch or finetuned from the pretraining checkpoint!

Click here for the code!
from DeepPurpose import oneliner
from DeepPurpose.dataset import *

oneliner.repurpose(*load_SARS_CoV_Protease_3CL(), *load_antiviral_drugs(no_cid = True),  *load_AID1706_SARS_CoV_3CL(), \
		split='HTS', convert_y = False, frac=[0.8,0.1,0.1], pretrained = False, agg = 'max_effect')
----output----
Drug Repurposing Result for SARS-CoV 3CL Protease
+------+----------------------+-----------------------+-------------+-------------+
| Rank |      Drug Name       |      Target Name      | Interaction | Probability |
+------+----------------------+-----------------------+-------------+-------------+
|  1   |      Remdesivir      | SARS-CoV 3CL Protease |     YES     |     0.99    |
|  2   |      Efavirenz       | SARS-CoV 3CL Protease |     YES     |     0.98    |
|  3   |      Vicriviroc      | SARS-CoV 3CL Protease |     YES     |     0.98    |
|  4   |      Tipranavir      | SARS-CoV 3CL Protease |     YES     |     0.96    |
|  5   |     Methisazone      | SARS-CoV 3CL Protease |     YES     |     0.94    |
|  6   |      Letermovir      | SARS-CoV 3CL Protease |     YES     |     0.88    |
|  7   |     Idoxuridine      | SARS-CoV 3CL Protease |     YES     |     0.77    |
|  8   |       Loviride       | SARS-CoV 3CL Protease |     YES     |     0.76    |
|  9   |      Baloxavir       | SARS-CoV 3CL Protease |     YES     |     0.74    |
|  10  |     Ibacitabine      | SARS-CoV 3CL Protease |     YES     |     0.70    |
|  11  |     Taribavirin      | SARS-CoV 3CL Protease |     YES     |     0.65    |
|  12  |      Indinavir       | SARS-CoV 3CL Protease |     YES     |     0.62    |
|  13  |   Podophyllotoxin    | SARS-CoV 3CL Protease |     YES     |     0.60    |
....

Demos

Checkout 10+ demos & tutorials to start:

Name Description
Dataset Tutorial Tutorial on how to use the dataset loader and read customized data
Drug Repurposing for 3CLPro Example of one-liner repurposing for 3CLPro
Drug Repurposing with Customized Data Example of one-liner repurposing with AID1706 Bioassay Data, training from scratch
Virtual Screening for BindingDB IC50 Example of one-liner virtual screening
Reproduce DeepDTA Reproduce DeepDTA with DAVIS dataset and show how to use the 10 lines framework
Virtual Screening for DAVIS and Correlation Plot Example of one-liner virtual screening and evaluate on unseen dataset by plotting correlation
Binary Classification for DAVIS using CNNs Binary Classification for DAVIS dataset using CNN encodings by using the 10 lines framework.
Pretraining Model Tutorial Tutorial on how to load pretraining models

and more in the DEMO folder!

Contact

Please contact [email protected] or [email protected] for help or submit an issue.

Encodings

Currently, we support the following encodings:

Drug Encodings Description
Morgan Extended-Connectivity Fingerprints
Pubchem Pubchem Substructure-based Fingerprints
Daylight Daylight-type fingerprints
rdkit_2d_normalized Normalized Descriptastorus
ESPF Explainable Substructure Partition Fingerprint
ErG 2D pharmacophore descriptions for scaffold hopping
CNN Convolutional Neural Network on SMILES
CNN_RNN A GRU/LSTM on top of a CNN on SMILES
Transformer Transformer Encoder on ESPF
MPNN Message-passing neural network
DGL_GCN Graph Convolutional Network
DGL_NeuralFP Neural Fingerprint
DGL_GIN_AttrMasking Pretrained GIN with Attribute Masking
DGL_GIN_ContextPred Pretrained GIN with Context Prediction
DGL_AttentiveFP Attentive FP, Xiong et al. 2020
Target Encodings Description
AAC Amino acid composition up to 3-mers
PseudoAAC Pseudo amino acid composition
Conjoint_triad Conjoint triad features
Quasi-seq Quasi-sequence order descriptor
ESPF Explainable Substructure Partition Fingerprint
CNN Convolutional Neural Network on target seq
CNN_RNN A GRU/LSTM on top of a CNN on target seq
Transformer Transformer Encoder on ESPF

Data

DeepPurpose supports the following dataset loaders for now and more will be added:

Public Drug-Target Binding Benchmark Dataset

Data Function
BindingDB download_BindingDB() to download the data and process_BindingDB() to process the data
DAVIS load_process_DAVIS() to download and process the data
KIBA load_process_KIBA() to download and process the data

Repurposing Dataset

Data Function
Curated Antiviral Drugs Library load_antiviral_drugs() to load and process the data
Broad Repurposing Hub load_broad_repurposing_hub() downloads and process the data

Bioassay Data for COVID-19 (Thanks to MIT AI Cures)

Data Function
AID1706 load_AID1706_SARS_CoV_3CL() to load and process

COVID-19 Targets

Data Function
SARS-CoV 3CL Protease load_SARS_CoV_Protease_3CL()
SARS-CoV2 3CL Protease load_SARS_CoV2_Protease_3CL()
SARS_CoV2 RNA Polymerase load_SARS_CoV2_RNA_polymerase()
SARS-CoV2 Helicase load_SARS_CoV2_Helicase()
SARS-CoV2 3to5_exonuclease load_SARS_CoV2_3to5_exonuclease()
SARS-CoV2 endoRNAse load_SARS_CoV2_endoRNAse()

DeepPurpose also supports reading from users' txt file. It assumes the following data format.

Click here for the format expected!

For drug target pairs:

Drug1_SMILES Target1_Seq Score/Label
Drug2_SMILES Target2_Seq Score/Label
....

Then, use

from DeepPurpose import dataset
X_drug, X_target, y = dataset.read_file_training_dataset_drug_target_pairs(PATH)

For bioassay training data:

Target_Seq
Drug1_SMILES Score/Label
Drug2_SMILES Score/Label
....

Then, use

from DeepPurpose import dataset
X_drug, X_target, y = dataset.read_file_training_dataset_bioassay(PATH)

For drug property prediction training data:

Drug1_SMILES Score/Label
Drug2_SMILES Score/Label
....

Then, use

from DeepPurpose import dataset
X_drug, y = dataset.read_file_compound_property(PATH)

For protein function prediction training data:

Target1_Seq Score/Label
Target2_Seq Score/Label
....

Then, use

from DeepPurpose import dataset
X_drug, y = dataset.read_file_protein_function(PATH)

For drug-drug pairs:

Drug1_SMILES Drug1_SMILES_ Score/Label
Drug2_SMILES Drug2_SMILES_ Score/Label
....

Then, use

from DeepPurpose import dataset
X_drug, X_target, y = dataset.read_file_training_dataset_drug_drug_pairs(PATH)

For protein-protein pairs:

Target1_Seq Target1_Seq_ Score/Label
Target2_Seq Target2_Seq_ Score/Label
....

Then, use

from DeepPurpose import dataset
X_drug, X_target, y = dataset.read_file_training_dataset_protein_protein_pairs(PATH)

For drug repurposing library:

Drug1_Name Drug1_SMILES 
Drug2_Name Drug2_SMILES
....

Then, use

from DeepPurpose import dataset
X_drug, X_drug_names = dataset.read_file_repurposing_library(PATH)

For target sequence to be repurposed:

Target_Name Target_seq 

Then, use

from DeepPurpose import dataset
Target_seq, Target_name = dataset.read_file_target_sequence(PATH)

For virtual screening library:

Drug1_SMILES Drug1_Name Target1_Seq Target1_Name
Drug1_SMILES Drug1_Name Target1_Seq Target1_Name
....

Then, use

from DeepPurpose import dataset
X_drug, X_target, X_drug_names, X_target_names = dataset.read_file_virtual_screening_drug_target_pairs(PATH)

Checkout Dataset Tutorial.

Pretrained models

We provide more than 10 pretrained models. Please see Pretraining Model Tutorial on how to load them. It is as simple as

from DeepPurpose import DTI as models
net = models.model_pretrained(model = 'MPNN_CNN_DAVIS')
or
net = models.model_pretrained(FILE_PATH)

The list of available pretrained models:

Model name consists of first the drug encoding, then the target encoding and then the trained dataset.

Note that for DTI models, the BindingDB and DAVIS are trained on the log scale. But DeepPurpose allows you to specify conversion between log scale (e.g., pIC50) and original scale by the variable convert_y.

Click here for the models supported!
Model Name
CNN_CNN_BindingDB_IC50
Morgan_CNN_BindingDB_IC50
Morgan_AAC_BindingDB_IC50
MPNN_CNN_BindingDB_IC50
Daylight_AAC_BindingDB_IC50
CNN_CNN_DAVIS
CNN_CNN_BindingDB
Morgan_CNN_BindingDB
Morgan_CNN_KIBA
Morgan_CNN_DAVIS
MPNN_CNN_BindingDB
MPNN_CNN_KIBA
MPNN_CNN_DAVIS
Transformer_CNN_BindingDB
Daylight_AAC_DAVIS
Daylight_AAC_KIBA
Daylight_AAC_BindingDB
Morgan_AAC_BindingDB
Morgan_AAC_KIBA
Morgan_AAC_DAVIS

Documentations

https://deeppurpose.readthedocs.io is under active development.

Disclaimer

The output list should be inspected manually by experts before proceeding to the wet-lab validation, and our work is still in active developement with limitations, please do not directly use the drugs.

deeppurpose's People

Contributors

0ling avatar alex-golts avatar chao1224 avatar cyrusmaher avatar futianfan avatar gumgo91 avatar haokaixina avatar hima111997 avatar jeanpaulrsoucy avatar kexinhuang12345 avatar la1av1a avatar lucasmglass avatar markcheung avatar navanchauhan avatar printomi avatar pykao avatar skviswa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeppurpose's Issues

pretrained model not found

image

i tried to download them manually using wget on colab using the link in the code

image

I tried to open the link and gave me this

image

How can I do transfer learning based on your pre-trained DTI models

Thank you so much for your great work. If I have a small amount of data (for example I only have 100 drug-protein pairs) for a specific problem, I can't train my data from scratch. So, I just want to know how I can use your pre-trained model for transfer learning. Many thanks

Where did you save "pretrained models on BindingDB IC50"

When I tried your DEMO "oneliner-3CLpro-finetuning-AID1706.ipynb", I got

FileNotFoundError: [Errno 2] No such file or directory: './save_folder/pretrained_models/DeepPurpose_BindingDB/model_MPNN_CNN/config.pkl'

I couldn't find ./save_folder/. In your readme, you said " [11/20] Added 5 more pretrained models on BindingDB IC50 Units (around 1Million data points)"

Thank you

Missing models.py

Cool package! The models.py file seems to be missing from master, so from DeepPurpose import models doesn't work.

error when load pretrained model

Error is AttributeError: 'DBTA' object has no attribute 'lower'
And my code is

config = utils.generate_config(
    drug_encoding='CNN',
    target_encoding='CNN',
    result_folder='DeepPurpose_model/Human/DeepDTA/d/0',
    **model_settings['DeepDTA']['config']
)

model = models.model_initialize(**config)
model = models.model_pretrained('DeepPurpose_model/Human/DeepDTA/d/0', model)
print(model)

how to generate drug or protein embeddings

Hi, I am working on a related project and trying to use DeepPurpose to generate drug and protein embeddings for other downstream tasks.
I would like to ask is there a function/method in DeepPurpose to generate representation vectors given a list of drugs or proteins, instead of directly predicting an affinity score.

Question regarding to DAVIS dataset

Hi Kexin,

For the DAVIS dataset, it has 68 drugs, 379 protein, and 30,056 interactions. It looks wired to me. If there are only one interaction between one drug and one protein, the maximum number of interaction would be 68x379 = 25,772. How can we have more than 25,772 interactions?

Best,
Po-Yu

CNN_Transformer_DAVIS pre-trained model link not present in utils.py

Command:

net = models.model_pretrained(model = 'CNN_Transformer_DAVIS')
net.config
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-55-0074dc9e707f> in <module>()
----> 1 net = models.model_pretrained(model = 'CNN_Transformer_DAVIS')
      2 net.config

1 frames
/content/DeepPurpose/DeepPurpose/utils.py in download_pretrained_model(model_name, save_dir)
    786 
    787         pretrained_dir = os.path.join(save_dir, 'pretrained_model')
--> 788         pretrained_dir_ = wget.download(url, pretrained_dir)
    789 
    790         print('Downloading finished... Beginning to extract zip file...')

UnboundLocalError: local variable 'url' referenced before assignment

This is because there is no elif model condition to assign the link to the variable for CNN_Transformer_DAVIS even though it is listed on the README's pertained model section

Did you fine tune every model?

Thank you so much for your great repo. From your demos, you always set epochs=100 for training. If we want to use some of the models, do we need to fine tune the hyperparameters and retrain them?

questions about usage

How could I use DTBA model trained by myself to predict a new pair, and what is the inputs' format?
Thanks! :)

error in mpnn_feature_collate_func

Please note that I may have been using this function completely wrong (I called it outside of where it's supposed to be called) but I figured I should submit the bug report.

TypeError Traceback (most recent call last)
in
31 t_start = time()
32 for epo in range(train_epoch):
---> 33 for i, (v_d, v_p, label) in enumerate(training_generator):
34 if self.target_encoding == 'Transformer':
35 v_p = v_p

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in next(self)
343
344 def next(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
--> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data_utils\fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\DTI.py in mpnn_collate_func(x)
219 mpnn_feature = [i[0] for i in x]
220 #print("len(mpnn_feature)", len(mpnn_feature), "len(mpnn_feature[0])", len(mpnn_feature[0]))
--> 221 mpnn_feature = mpnn_feature_collate_func(mpnn_feature)
222 from torch.utils.data.dataloader import default_collate
223 x_remain = [[i[1], i[2]] for i in x]

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\DTI.py in mpnn_feature_collate_func(x)
212 def mpnn_feature_collate_func(x):
213 ## first version
--> 214 return [torch.cat([x[j][i] for j in range(len(x))], 0) for i in range(len(x[0]))]
215
216 def mpnn_collate_func(x):

TypeError: object of type 'numpy.float64' has no len()

The latest version of BindingDB

Hi,

I think the BindingDB version you used in this repo is BindingDB_All_2020m2. Would you mind to use version BindingDB_All_2020m10? I can make a PR if you think this is a good idea?

Best,
Ken

utils.py exists in two places

There's a version of utils.py in the root directory and a newer version of utils.py in the DeepPurpose/DeepPurpose directory. Should the one in the root directory be deleted?

error in GetSequenceOrderCouplingNumber

When using the Quasi-seq encoding on the BindingDB dataset, I ran into the following error:

Drug Target Interaction Prediction Mode...
in total: 1073803 drug-target pairs
encoding drug...
unique drugs: 549205
encoding protein...
unique target sequence: 5078

KeyError Traceback (most recent call last)
in
1 train, val, test = utils.data_process(X_drugs, X_targets, y,
2 drug_encoding, target_encoding,
----> 3 split_method='cold_drug',frac=[0.7,0.1,0.2])

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in data_process(X_drug, X_target, y, drug_encoding, target_encoding, split_method, frac, random_seed, sample_frac, mode, X_drug_, X_target_)
419 if DTI_flag:
420 df_data = encode_drug(df_data, drug_encoding)
--> 421 df_data = encode_protein(df_data, target_encoding)
422 elif DDI_flag:
423 df_data = encode_drug(df_data, drug_encoding, 'SMILES 1', 'drug_encoding_1')

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in encode_protein(df_data, target_encoding, column_name, save_column_name)
317 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]]
318 elif target_encoding == 'Quasi-seq':
--> 319 AA = pd.Series(df_data[column_name].unique()).apply(GetQuasiSequenceOrder)
320 AA_dict = dict(zip(df_data[column_name].unique(), AA))
321 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]]

~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder(ProteinSequence, maxlag, weight)
1908 """
1909 result = dict()
-> 1910 result.update(GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, _Distance1))
1911 result.update(GetQuasiSequenceOrder2SW(ProteinSequence, maxlag, weight, _Distance1))
1912 result.update(

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, distancematrix)
1794 for i in range(maxlag):
1795 rightpart = rightpart + GetSequenceOrderCouplingNumber(
-> 1796 ProteinSequence, i + 1, distancematrix
1797 )
1798 AAC = GetAAComposition(ProteinSequence)

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetSequenceOrderCouplingNumber(ProteinSequence, d, distancematrix)
1601 temp1 = ProteinSequence[i]
1602 temp2 = ProteinSequence[i + d]
-> 1603 tau = tau + math.pow(distancematrix[temp1 + temp2], 2)
1604 return round(tau, 3)
1605

KeyError: 'mg'

AttributeError in virtual_screening

Code

X_drug = []
X_drug_names = []
file = open("./data/drugs.csv")
for aline in file:
  values = aline.split(",")
  X_drug.append(values[-1])
  print("Loaidng Drug",values[0])
  X_drug_names.append(values[0])
file.close()

target, target_name = dataset.load_SARS_CoV2_Protease_3CL()

net = models.model_pretrained(model = 'Transformer_CNN_BindingDB')
net.config

models.virtual_screening(X_drug, target, net, X_drug_names, target_name)

Error

virtual screening...
in total: 133 drug-target pairs
encoding drug...
unique drugs: 124
drug encoding finished...
encoding protein...
unique target sequence: 1
protein encoding finished...
Done.
predicting...
---------------
Virtual Screening Result
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-76-a764e04fe067> in <module>()
----> 1 models.virtual_screening(X_drug, target, net, X_drug_names, target_name)

/content/DeepPurpose/DeepPurpose/models.py in virtual_screening(X_repurpose, target, model, drug_names, target_names, result_folder, convert_y, output_num_max, verbose)
    459                         f_d = max([len(o) for o in drug_names]) + 1
    460                         f_p = max([len(o) for o in target_names]) + 1
--> 461                         for i in range(target.shape[0]):
    462                                 if model.binary:
    463                                         if y_pred[i] > 0.5:

AttributeError: 'str' object has no attribute 'shape'

I haven't gone through the entire codebase yet but should it be length of the string rather than its shape?

models importing issue

First of all I would like to appreciate your work , i am facing a little bit error in models importing from DeepPurpose other modules working fine of DeepPurpose.

from DeepPurpose import models


ImportError Traceback (most recent call last)
in
----> 1 from DeepPurpose import models

ImportError: cannot import name 'models'

error in calcPubChemFingerPart1

When running data_process on the BindingDB dataset, I'm getting the following error:


AttributeError Traceback (most recent call last)
in
1 train, val, test = utils.data_process(X_drugs, X_targets, y,
2 drug_encoding, target_encoding,
----> 3 split_method='cold_drug',frac=[0.7,0.1,0.2])

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in data_process(X_drug, X_target, y, drug_encoding, target_encoding, split_method, frac, random_seed, sample_frac, mode, X_drug_, X_target_)
418
419 if DTI_flag:
--> 420 df_data = encode_drug(df_data, drug_encoding)
421 df_data = encode_protein(df_data, target_encoding)
422 elif DDI_flag:

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in encode_drug(df_data, drug_encoding, column_name, save_column_name)
265 df_data[save_column_name] = [unique_dict[i] for i in df_data[column_name]]
266 elif drug_encoding == 'Pubchem':
--> 267 unique = pd.Series(df_data[column_name].unique()).apply(calcPubChemFingerAll)
268 unique_dict = dict(zip(df_data[column_name].unique(), unique))
269 df_data[save_column_name] = [unique_dict[i] for i in df_data[column_name]]

~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in calcPubChemFingerAll(s)
3377
3378 def calcPubChemFingerAll(s):
-> 3379 mol = Chem.MolFromSmiles(s)
3380 AllBits=[0]*881
3381 res1=list(calcPubChemFingerPart1(mol).ToBitString())

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in calcPubChemFingerPart1(mol, **kwargs)
2690 if count == 0:
2691 res[i + 1] = mol.HasSubstructMatch(patt)
-> 2692 else:
2693 print('ne')
2694 matches = mol.GetSubstructMatches(patt)

AttributeError: 'NoneType' object has no attribute 'GetSubstructMatches'

why there are different results when i use the same inputs in repurpose and virtual_screening functions?

I used repurpose and virtual_screening functions from oneliner.py. the drugs and the protein were the same in the two cases and I used the pretrained models, however, the results were different.

why did this happen?
the inputs (drugs smiles and protein sequence) and the models are the same, so should not the results be the same too?
in the case of virtual_screening I used one sequence but wrote it many times

the input file for repurpose was as follows:
smile_files:
drug_name1 drug_smiles1
drug_name2 drug_smiles2
drug_name3 drug_smiles3
...

protein:
I used this function load_SARS_CoV2_Helicase()

the input file for virtual_screening was as follows:

input_file:
drug_smile1 protein_sequence
drug_smile2 protein_sequence
drug_smile3 protein_sequence
...

error in GetSequenceOrderCouplingNumber

I'm calling GetQuasiSequenceOrder for every protein in the BindingDB list and running into this error. I get that these things may be happening because I'm calling the functions directly rather than using data_process, but I'd still like to be able to call the encoding functions on their own.


KeyError Traceback (most recent call last)
in
8 for func in prot_func_list:
9 save_column_name = func.name
---> 10 AA = pd.Series(df_data[column_name].unique()).apply(func)
11 AA_dict = dict(zip(df_data[column_name].unique(), AA))
12 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]]

~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder(ProteinSequence, maxlag, weight)
1908 """
1909 result = dict()
-> 1910 result.update(GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, _Distance1))
1911 result.update(GetQuasiSequenceOrder2SW(ProteinSequence, maxlag, weight, _Distance1))
1912 result.update(

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, distancematrix)
1794 for i in range(maxlag):
1795 rightpart = rightpart + GetSequenceOrderCouplingNumber(
-> 1796 ProteinSequence, i + 1, distancematrix
1797 )
1798 AAC = GetAAComposition(ProteinSequence)

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetSequenceOrderCouplingNumber(ProteinSequence, d, distancematrix)
1601 temp1 = ProteinSequence[i]
1602 temp2 = ProteinSequence[i + d]
-> 1603 tau = tau + math.pow(distancematrix[temp1 + temp2], 2)
1604 return round(tau, 3)
1605

KeyError: 'IX'

config params error

Hi, it seems that the config in demo cannot work now:(

I used the following cfg

'MPNNAACDTA': {
      'drug_encoding': 'MPNN',
      'target_encoding': 'AAC',
      'cls_hidden_dims': [1024, 1024, 512],
      'train_epoch': 100,
      'LR': 0.001,
      'batch_size': 128,
      'hidden_dim_drug': 128,
      'hidden_dim_protein': 128,
      'input_dim_protein': 128,
      'mlp_hidden_dims_target': [128],
      'mpnn_hidden_size': 128,
      'mpnn_depth': 3,
      'cnn_target_filters': [32, 64, 96],
      'cnn_target_kernels': [4, 8, 12]
  }

and it got the error RuntimeError: mat1 dim 1 must match mat2 dim 0

My DeepPurpose version is 0.0.5.

Error when I chose 'ErG' as my drug_encoding

After I changed drug_encoding from 'cnn' to ErG' in "DeepDTA_Reproduce_KIBA.ipynb",
when I ran this cell
model.train(train, val, test)
I got:

AttributeError Traceback (most recent call last)
in
----> 1 model = models.model_initialize(**config)
2 model.train(train, val, test)

~/projects/DeepPurpose/DeepPurpose/DTI.py in model_initialize(**config)
57
58 def model_initialize(**config):
---> 59 model = DBTA(**config)
60 return model
61

~/projects/DeepPurpose/DeepPurpose/DTI.py in init(self, **config)
267 self.model_drug = MPNN(config['hidden_dim_drug'], config['mpnn_depth'])
268 else:
--> 269 raise AttributeError('Please use one of the available encoding method.')
270
271 if target_encoding == 'AAC' or target_encoding == 'PseudoAAC' or target_encoding == 'Conjoint_triad' or target_encoding == 'Quasi-seq' or target_encoding == 'ESPF':

AttributeError: Please use one of the available encoding method.

Convert from nM to p

Hi,

I am confused on your convert_y_unit function. I think this function mainly convert Kd to pKd, and convert pKd back to Kd. However, why do not you just use y = -np.log10(y*1e-9) here?

def convert_y_unit(y, from_, to_):
	# basis as nM

	if from_ == 'nM':
		y = y
	elif from_ == 'p':
		y = 10**(-y) / 1e-9

	if to_ == 'p':
		y = -np.log10(y*1e-9 + 1e-10)
	elif to_ == 'nM':
		y = y

	return y

print(convert_y_unit(convert_y_unit(100, 'p', 'nM'), 'nM', 'p'))
print(convert_y_unit(convert_y_unit(100, 'nM', 'p'), 'p', 'nM'))
print(convert_y_unit(100, 'p', 'p'))
print(convert_y_unit(100, 'nM', 'nM'))

It gave me:

10.0
100.09999999999994
10.0
100

I think the answers should be 100 for those four different combinations of convert_y_unit functions.

What is the purpose to add 1e-10 in the log function?

Best,
Ken

Model configuration error in Tutorial_2_Drug_Property_Pred_Assay_Data

Hello again,
I am having trouble initializing a model using the code in "Tutorial 2: Training a Drug Property Prediction Model from Scratch for Assay Data". Here are the errors I'm getting:

config = utils.generate_config(drug_encoding = drug_encoding, 
                         cls_hidden_dims = [1024,1024,512], 
                         train_epoch = 5, 
                         LR = 0.001, 
                         batch_size = 128,
                         hidden_dim_drug = 128,
                         mpnn_hidden_size = 128,
                         mpnn_depth = 3
                        )

model = models.model_initialize(**config)
model

AttributeError Traceback (most recent call last)
in
----> 1 model = models.model_initialize(**config)
2 model

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\DTI.py in model_initialize(**config)
57
58 def model_initialize(**config):
---> 59 model = DBTA(**config)
60 return model
61

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\DTI.py in init(self, **config)
259 self.model_protein = transformer('protein', **config)
260 else:
--> 261 raise AttributeError('Please use one of the available encoding method.')
262
263 self.model = Classifier(self.model_drug, self.model_protein, **config)

AttributeError: Please use one of the available encoding method.

model.train(train, val, test)

Let's use CPU/s!
--- Data Preparation ---
--- Go for Training ---


KeyError Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2890 try:
-> 2891 return self._engine.get_loc(casted_key)
2892 except KeyError as err:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'target_encoding'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
in
----> 1 model.train(train, val, test)

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\DTI.py in train(self, train, val, test, verbose)
392 t_start = time()
393 for epo in range(train_epoch):
--> 394 for i, (v_d, v_p, label) in enumerate(training_generator):
395 if self.target_encoding == 'Transformer':
396 v_p = v_p

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in next(self)
343
344 def next(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
--> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data_utils\fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data_utils\fetch.py in (.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\utils.py in getitem(self, index)
519 if self.config['drug_encoding'] == 'CNN' or self.config['drug_encoding'] == 'CNN_RNN':
520 v_d = drug_2_embed(v_d)
--> 521 v_p = self.df.iloc[index]['target_encoding']
522 if self.config['target_encoding'] == 'CNN' or self.config['target_encoding'] == 'CNN_RNN':
523 v_p = protein_2_embed(v_p)

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in getitem(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if (

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
989
990 # Similar to Index.get_value, but we do not fall back to positional
--> 991 loc = self.index.get_loc(label)
992 return self.index._get_values_for_loc(self, loc, label)
993

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2891 return self._engine.get_loc(casted_key)
2892 except KeyError as err:
-> 2893 raise KeyError(key) from err
2894
2895 if tolerance is not None:

KeyError: 'target_encoding'

How did you train MPNN_CNN_BindingDB_IC50?

Hi,

I am trying to train a MPNN/CNN model using around 1.2M IC50 interactions from BindingDB dataset (2021m0). However, the first problem I encountered was the memory issue of MPNN drug encoder. If I want to train the model with all interactions, I need to set MAX_ATOM = 700 that gives me the memory issue even if my server has 252GB memories.

Do you know how did you solve this kind of issue to train the MPNN_CNN_BindingDB_IC50 successfully? Did you train the model with previous version (non-parallel) of MPNN drug encoder? Or, did you ignore those interactions with long SMILES sequence?

Best,
Ken Kao

How to accumulate Target protein's amino acid sequence (t) and drug's SMILES strings (d)

I am novice in DTI research. I want to know how to get : an array of drug's SMILES strings (d), an array of target protein's amino acid sequence (t) . In order to learn "Tutorial_1_DTI_Prediction"

Suppose I have found the following using DrugBank data:
Drug ID Target ID Score

DB08604 P0AEK4 0.931528
DB07181 P0AEK4 0.931504
DB08642 P16184 0.931335
DB03233 P0A884 0.931334
DB07411 P0AEK4 0.931313
DB07209 P27338 0.931300
DB03072 P0AEK4 0.931230
DB02727 Q9Y296 0.931186
DB06840 Q9Y296 0.931151
DB07972 P0AEK4 0.931095
DB08700 P0AEK4 0.931029
DB07647 P0AEK4 0.931003
DB01861 P96945 0.930968
...........................................
............................................

Questions:
1.How to get target protein's amino acid sequence (t) for large no of Target ID
2.How to get drug's SMILES strings for the large no of Drug IDs

BUG

auc, auprc, f1, logits = self.test_(testing_generator, model_max, test = True)

function test_ returns 5 items not 4 when binary is True

this line should be:

				auc, auprc, f1, log_loss, logits = self.test_(testing_generator, model_max, test = True)

try except of max_atoms/bond error

Greetings sir,
I was doing VS using virtual_screening function when it gave me this error. the same drugs were used but with a different protein without giving me this error
`
Traceback (most recent call last):
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 264, in smiles2mpnnfeature
assert atoms_completion_num >= 0 and bonds_completion_num >= 0
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "play_VS.py", line 20, in
play(dest_repur_db, dest_vs_db, dest_save +'/')
File "play_VS.py", line 9, in play
save_dir= dest_save)
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/oneliner.py", line 261, in virtual_screening
y_pred = models.virtual_screening(X_repurpose, target, model, drug_names, target_name, convert_y = convert_y, result_folder = result_folder_path, verbose = False)
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/DTI.py", line 163, in virtual_screening
model.drug_encoding, model.target_encoding, 'virtual screening')
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 578, in data_process_repurpose_virtual_screening
split_method='repurposing_VS')
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 499, in data_process
df_data = encode_drug(df_data, drug_encoding)
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 364, in encode_drug
unique = pd.Series(df_data[column_name].unique()).apply(smiles2mpnnfeature)
File "/share/apps/conda_envs/DeepPurpose/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 266, in smiles2mpnnfeature
raise Exception("increase MAX_ATOM and MAX_BOND in utils")
Exception: increase MAX_ATOM and MAX_BOND in utils
`

Do you include MolTrans in this repo

You have another DTI repo called MolTrans. I think you didn't include this model in this toolkit, am I right? If not, what is the difference between DTI model of MolTrans and other models in this repo. Which one is better? Thanks a lot.

error loading BindingDB data in load_data_tutorial

I was getting a different error before, not sure how to reproduce it unfortunately, here's the error I'm getting now:

data_path = dataset.download_BindingDB('./data/')

Beginning to download dataset...
100% [......................................................................] 327218168 / 327218168Beginning to extract zip file...
Done!

X_drugs, X_targets, y = dataset.process_BindingDB(path = data_path, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30

File "", line 1
X_drugs, X_targets, y = dataset.process_BindingDB(path = data_path, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30
^
SyntaxError: unexpected EOF while parsing

Filename Issue in Tutorial_2_Drug_Property_Pred_Assay_Data.ipynb

Hi Kexin,

There is an issue on loading the HIV data. After I ran the following commands:

X_drugs, y, drugs_index = dataset.load_HIV(path = './data')
print('Drug 1: ' + X_drugs[0])
print('Score 1: ' + str(y[0])

It gave me an error about FileNotFoundError.

The code /DeepPurpose/dataset.py tried to fin the file hiv.csv under data folder. However, after unzipping the hiv.zip file, it results in the HIV.csv file. Therefore, there is a name mismatch issue here (hiv.csv vs HIV.csv).

Afterward, I changed the filename from HIV.csv to hiv.csv, and it stopped getting me the filename error.

Best,
Ken

Encounter RuntimeError When Running Tutorial_1_DTI_Prediction

Dear Kexin Huang,

This is an amazing work. Thank you for making the DTI prediction more easier for both scientists and engineers.

I tried to run the Tutorial_1_DTI_Prediction but it gives me error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-4686be42c026> in <module>
----> 1 model.train(train, val, test)

~/DeepPurpose/DeepPurpose/DTI.py in train(self, train, val, test, verbose)
    438                     #score = self.model(v_d, v_p.float().to(self.device))
    439 
--> 440                 score = self.model(v_d, v_p)
    441                 label = Variable(torch.from_numpy(
    442                     np.array(label)).float()).to(self.device)

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    150             return self.module(*inputs[0], **kwargs[0])
    151         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 152         outputs = self.parallel_apply(replicas, inputs, kwargs)
    153         return self.gather(outputs, self.output_device)
    154 

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    160 
    161     def parallel_apply(self, replicas, inputs, kwargs):
--> 162         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    163 
    164     def gather(self, outputs, output_device):

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     83         output = results[i]
     84         if isinstance(output, ExceptionWrapper):
---> 85             output.reraise()
     86         outputs.append(output)
     87     return outputs

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
    392             # (https://bugs.python.org/issue2651), so we work around it.
    393             msg = KeyErrorMessage(msg)
--> 394         raise self.exc_type(msg)

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/ken/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ken/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ken/DeepPurpose/DeepPurpose/DTI.py", line 48, in forward
    v_D = self.model_drug(v_D)
  File "/home/ken/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ken/DeepPurpose/DeepPurpose/encoders.py", line 267, in forward
    n_a = atoms_bonds[i,0].item()
RuntimeError: CUDA error: device-side assert triggered

I think it might be a parallel error of CUDA. Could you please guide me to solve this problem?

The CUDA version is 10.2.89, and the driver version is 450.66.

I ran the data parallelism tutorial from PyTorch and it works for me.

Best,
Ken

How can I know if the model is overfitting?

From your demo and tutorials, you always set epoch=100, the learning rate is a constant, and you didn't show the comparison between the training losses and the validation losses. I saw somewhere in your codes for early stopping, but I don't know how to set it. Did you have a learning rate scheduling function? Thank you!

What is the limit of a good binding score?

Hi!
I have used the deeppurpose library to screen a database and now I want to select the best binding drugs based on the binding score.
Is there a limit for binding score below which I can select the drugs?

Thanks

Training Configuration of pre-trained MPNN_CNN

Hi Kexin Huang,

I am using the provided pre-trained MPNN_CNN model. When I looked into its model configuration file, it looks wired to me.

{'input_dim_drug': 1024,
'input_dim_protein': 8420,
'hidden_dim_drug': 128,
'hidden_dim_protein': 256,
'cls_hidden_dims': [1024, 1024, 512],
'batch_size': 16,
'train_epoch': 1,
'LR': 0.001,
'drug_encoding': 'MPNN',
'target_encoding': 'CNN',
'result_folder': './result/',
'binary': False,
'mpnn_hidden_size': 128,
'mpnn_depth': 3,
'cnn_target_filters': [32, 64, 96],
'cnn_target_kernels': [4, 8, 12],
'num_workers': 0,
'decay': 0}

Did you only train this model for only 1 epoch with batch size 16?

Best regards,
Po-Yu Kao

How to use DeepPurpose for Virtual screening?

Greetings sir,

I want to use DeepPurpose for Virtual screening using drugs downloaded from databases with a certain protein.

Can you give me information on how to do this? such as the preparation of drugs and the protein?

Thanks

errors when I ran "MPNN_AAC_Kiba.ipynb"

I got this error when I ran "MPNN_AAC_Kiba.ipynb"

RuntimeError: CUDA error: device-side assert triggered

It happened again when I ran "case-study-II-Virtual-Screening-for-BindingDB-IC50.ipynb"

import error in Tutorial_2_Drug_Property_Pred_Assay_Data

Hi,
When trying to run the first cell of "Tutorial 2: Training a Drug Property Prediction Model from Scratch for Assay Data", I am running into the following error:

---------------------------------------------------------------------------ImportError Traceback (most recent call last)
<ipython-input-17-5d2978e4b9f3> in <module>
----> 3 from DeepPurpose import utils, dataset, property_pred
ImportError: cannot import name 'property_pred' from 'DeepPurpose' (C:\Users\Julia\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\__init__.py)

I am using Windows but have tried using WSL and Amazon Linux and the error persists.

Pre-train Transformer of Drug

Dear Kexin,

According to MT-DTI paper, they pre-trained the transformer on 97,092,853 molecules with canonical SMILES from PubChem. I just curious if I call drug_encoding='Transformer', does your code use the pre-trained weights?

Thank you for your answering.

Best,
Po-Yu Kao

The training epochs of KIBA

Hi, Kexin. I'm writing to ask about the reproduction of DeepPurpose. Here I want to get the result for MPNN+AAC in the KIBA dataset. However, it seems that 150 epochs aren't enough for KIBA while they work for DAVIS. I can only get 0.73 for C-index, much smaller than that in your paper. So I wonder how many epochs need to be set while training KIBA.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.