Transfer Learning for Drug-Target Interaction Prediction

Abstract

Utilizing AI-driven approaches for drug–target interaction (DTI) prediction require large volumes of training data which are not available for the majority of target proteins. In this study, we investigate the use of deep transfer learning for the prediction of interactions between drug candidate compounds and understudied target proteins with scarce training data. The idea here is to first train a deep neural network classifier with a generalized source training dataset of large size and then to reuse this pre-trained neural network as an initial configuration for re-training/fine-tuning purposes with a small-sized specialized target training dataset. To explore this idea, we selected six protein families that have critical importance in biomedicine: kinases, G-protein-coupled receptors (GPCRs), ion channels, nuclear receptors, proteases, and transporters. In two independent experiments, the protein families of transporters and nuclear receptors were individually set as the target datasets, while the remaining five families were used as the source datasets. Several size-based target family training datasets were formed in a controlled manner to assess the benefit provided by the transfer learning approach.

Here, we present a systematic evaluation of our approach by pre-training a feed-forward neural network with source training datasets and applying different modes of transfer learning from the pre-trained source network to a target dataset. The performance of deep transfer learning is evaluated and compared with that of training the same deep neural network from scratch. We found that when the training dataset contains fewer than 100 compounds, transfer learning outperforms the conventional strategy of training the system from scratch, suggesting that transfer learning is advantageous for predicting binders to under-studied targets.

The source code and datasets are available in this repository. Our web-based service containing the ready-to-use pre-trained models is accessible at https://tl4dti.kansil.org.

Figure. Overview of the model training phase. We first trained a source neural network model using the bioactivity dataset of the selected source protein family (Stage I). This pre-trained source model is then used in the context of transfer learning for further trainin (i.e., fine tuning) with a small-sized target bioactivity dataset that belong to another protein family (Stage II). We also trained an FNN having exactly the same configuration from scratch (the reference model), as well as a shallow classifier (base model), using this same target training dataset.

Descriptions of folders and files in the TransferLearning4DTI repository

bin folder includes the source code of TransferLearning4DTI.
training_files folder contains various traininig/test datasets mostly formatted for observational purposes and for employment in future studies
- gpcr contains training data points, features, data splits for GPCR dataset. compound_feature_vectors/ecfp4.tsv includes ecfp4 features of ligands. compound_feature_vectors/chemprop.tsv includes chemprop features of ligands. comp_targ_binary.tsv is a csv file where each line is formatted as <compound_id>,<target_id>,<active/inactive>.
- ionchannel contains training data points, features, data splits for Ion Channel dataset. compound_feature_vectors/ecfp4.tsv includes ecfp4 features of ligands. compound_feature_vectors/chemprop.tsv includes chemprop features of ligands. comp_targ_binary.tsv is a csv file where each line is formatted as <compound_id>,<target_id>,<active/inactive>.
- kinase contains training data points, features, data splits for Kinase dataset. compound_feature_vectors/ecfp4.tsv includes ecfp4 features of ligands. compound_feature_vectors/chemprop.tsv includes chemprop features of ligands. comp_targ_binary.tsv is a csv file where each line is formatted as <compound_id>,<target_id>,<active/inactive>.
- nuclearreceptor contains training data points, features, data splits for Nuclear Receptor dataset. compound_feature_vectors/ecfp4.tsv includes ecfp4 features of ligands. compound_feature_vectors/chemprop.tsv includes chemprop features of ligands. comp_targ_binary.tsv is a csv file where each line is formatted as <compound_id>,<target_id>,<active/inactive>.
- protease contains training data points, features, data splits for Protease dataset. compound_feature_vectors/ecfp4.tsv includes ecfp4 features of ligands. compound_feature_vectors/chemprop.tsv includes chemprop features of ligands. comp_targ_binary.tsv is a csv file where each line is formatted as <compound_id>,<target_id>,<active/inactive>.
- transporter contains training data points, features, data splits for Kinase dataset based on Setting-2. compound_feature_vectors/ecfp4.tsv includes ecfp4 features of ligands. compound_feature_vectors/chemprop.tsv includes chemprop features of ligands. comp_targ_binary.tsv is a csv file where each line is formatted as <compound_id>,<target_id>,<active/inactive>.
Download the remaining training files here

Dependencies

PyTorch 1.12.1

Pandas 1.3.5

Sklearn 1.1.2

Numpy 1.22.4

RDKit 2022.9.1

Chemprop

How to re-produce performance comparison results in the TransferLearning4DTI article

Clone the Git Repository
Run the below commands for each dataset

Explanation of Parameters

--chln: number of neurons in compound hidden layers (default: 1200_300)

--lr:learning rate (default: 0.0001)

--bs: batch size (default: 256)

--td: the name of the target dataset (default: transporter)

--sd: the name of the source dataset (default: kinase)

--do: dropout rate (default: 0.1)

--en: the name of the experiment (default: my_experiment)

--model: model name (default: fc_2_layer)

--epoch: number of epochs (default: 100)

--sf: subset flag (default: 0)

--tlf: transfer learning flag (default: 0)

--ff: freeze flag (default: 0)

--fl: hidden layer to be frozen (default: 1)

--el: layer to be extracted (default: 0)

--ss: subset size (default: 10)

--cf: compound features separated by underscore character (default: chemprop)

--setting: Determines the setting (1: train_val_test, 2:extract layer train_val_test, 3:training_test, 4:only training, 5:extract layer train and test) (default: 1)

--et: external test dataset (default: -)

--nc: number of result classes (default: 2)

--train: (1) train or (0) extract features(default: 1)

Option 1

to reproduce performance results

create a Transporter small dataset with a size 6

python create_small_dataset.py --d transporter --ss 6

obtain baseline performance results for the created dataset

python baseline_training.py --setting 2 --tlf 0 --td transporter --ss 6 --en 0 --sf 1

obtain scratch performance results for the same dataset

python main_training.py --setting 3 --epoch 50 --ss 6 --en 0 --tlf 0 --sf 1 --td transporter

extract the output of the first hidden layer (--el 2 for the output of the second hidden layer )

python main_training.py --setting 5 --train 0 --epoch 50 --ss 6 --en 0 --el 1 --tlf 1 --sf 1 --td transporter --sd kinase

obtain shallow classifier performance results using the output of the first hidden layer (--el 2 for the output of the second hidden layer )

python baseline_training.py --setting 2 --tlf 1 --el 1 --td transporter --sd kinase --ss 6 --en 0 --sf 1

obtain full fine-tuning performance results for the same dataset

python main_training.py --setting 3 --epoch 50 --ss 6 --en 0 --tlf 1 --sf 1 --td transporter --sd kinase

obtain fine-tuning with freezing layer 1 performance result for the same dataset (--fl 2 with freezing layer 2 )

python main_training.py --setting 3 --epoch 50 --ss 6 --en 0 --ff 1 --fl 1 --tlf 1 --ff 1 --sf 1 --td transporter --sd kinase

Option 2 - Fine-tune your training dataset

extract features using the chemprop tool for your training dataset by running the following command

--trainisc: the name of the training file. it should contains id, smiles and compound columns (default: input/train.csv)

--name: the name of the your protein or protein family (default: new_family)

--sc: the name of the source checkpoint (default: kinase)

python convert_chemprop_ftune.py --trainisc input/train.csv --name your_family_name --sc kinase

train your dataset by running the following command without transfer learning

python main_training.py --setting 4 --td your_family_name

train your dataset by running the following command with transfer learning

python main_training.py --setting 4 --td your_family_name --sd kinase --tlf 1

Option 3 Predict your test dataset

extract features using the chemprop tool for your test dataset by running the following command (it will create test_chemprop file under output directory)

--testisf: the name of the test file. it should contains id and smiles columns (default: input/test.csv)

--sc: the name of the source checkpoint (default: kinase)

python convert_chemprop_predict.py --testisf input/test.csv --sc kinase

get predictions for your test dataset without training you can run the following command

python main_training.py --setting 6 --sd kinase --et output/test_chemprop.tsv --tlf 1

get predictions for your test dataset by training you can run the following command

python main_training.py --setting 4 --td transporter --sd kinase --et output/test_chemprop.tsv --tlf 1

Option 4

Fine-tune your training dataset and get predictions for your test dataset using the fine-tuned model (training and testing)

extract features using the chemprop tool for your training and test dataset by running the following command (it will create test_chemprop file under output directory)

--trainisc: the name of the training file. it should contains id, smiles and compound columns (default: input/train.csv)

--name: the name of the your protein or protein family (default: new_family)

--testisf: the name of the test file. it should contains id and smiles columns (default: input/test.csv)

--sc: the name of the source checkpoint (default: kinase)

python convert_chemprop_ftune_predict.py --trainisc input/train.csv --name your_family_name --sc kinase --testisf input/test.csv

To get predictions for your test dataset you can run the following command

full fine-tune predictions

python main_training.py --setting 4 --td your_family_name --sd kinase --et output/test_chemprop.tsv --tlf 1

fine-tune with freeze predictions

python main_training.py --setting 4 --td your_family_name --sd kinase --et output/test_chemprop.tsv --tlf 1 --ff 1 --fl 1

Output of the scripts

main_training.py creates a folder under named experiment_name (given as argument --en) under result_files folder. One file is created under results_files/<experiment_name>: performance_results.txt which contains the best performance results for test dataset. Sample output files for Transporter dataset is given under results_files/transporter.

Results

4,000	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.531 + 0.005	0.533 + 0.008	0.522 + 0.010	0.518 + 0.008	0.534 + 0.011
AUROC	0.770 + 0.003	0.771 + 0.004	0.764 + 0.005	0.763 + 0.005	0.771 + 0.005
Precision	0.684 + 0.013	0.680 + 0.006	0.682 + 0.017	0.681 + 0.018	0.690 + 0.015
Recall	0.763 + 0.023	0.773 + 0.009	0.752 + 0.028	0.751 + 0.018	0.768 + 0.010
F1-Score	0.721 + 0.005	0.724 + 0.005	0.715 + 0.007	0.712 + 0.007	0.723 + 0.006
Accuracy	0.771 + 0.004	0.771 + 0.004	0.767 + 0.007	0.766 + 0.006	0.772 + 0.005

1,000	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.481 + 0.014	0.517 + 0.014	0.494 + 0.011	0.489 + 0.011	0.523 + 0.013
AUROC	0.745 + 0.007	0.765 + 0.007	0.751 + 0.005	0.749 + 0.006	0.768 + 0.007
Precision	0.644 + 0.014	0.661 + 0.013	0.661 + 0.012	0.659 + 0.009	0.667 + 0.014
Recall	0.756 + 0.025	0.786 + 0.020	0.744 + 0.018	0.740 + 0.019	0.788 + 0.017
F1-Score	0.695 + 0.009	0.717 + 0.008	0.700 + 0.006	0.697 + 0.008	0.720 + 0.007
Accuracy	0.743 + 0.008	0.760 + 0.008	0.753 + 0.007	0.751 + 0.005	0.763 + 0.007

400	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.400 + 0.008	0.488 + 0.019	0.469 + 0.015	0.464 + 0.015	0.495 + 0.016
AUROC	0.705 + 0.004	0.750 + 0.010	0.739 + 0.007	0.736 + 0.008	0.753 + 0.008
Precision	0.584 + 0.011	0.647 + 0.014	0.645 + 0.017	0.639 + 0.016	0.649 + 0.014
Recall	0.749 + 0.029	0.778 + 0.026	0.734 + 0.027	0.738 + 0.029	0.788 + 0.025
F1-Score	0.656 + 0.006	0.702 + 0.011	0.686 + 0.009	0.684 + 0.010	0.706 + 0.009
Accuracy	0.695 + 0.007	0.744 + 0.010	0.740 + 0.009	0.736 + 0.009	0.748 + 0.010

96	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.383 + 0.017	0.410 + 0.031	0.427 + 0.030	0.423 + 0.028	0.419 + 0.027
AUROC	0.694 + 0.009	0.709 + 0.017	0.718 + 0.016	0.716 + 0.014	0.714 + 0.014
Precision	0.551 + 0.013	0.609 + 0.020	0.614 + 0.023	0.612 + 0.022	0.610 + 0.021
Recall	0.807 + 0.031	0.755 + 0.062	0.726 + 0.051	0.723 + 0.044	0.770 + 0.064
F1-Score	0.654 + 0.009	0.659 + 0.021	0.664 + 0.020	0.662 + 0.018	0.663 + 0.019
Accuracy	0.669 + 0.013	0.709 + 0.013	0.716 + 0.016	0.714 + 0.015	0.710 + 0.015

48	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.368 + 0.025	0.373 + 0.041	0.410 + 0.033	0.405 + 0.031	0.385 + 0.035
AUROC	0.685 + 0.014	0.689 + 0.022	0.709 + 0.017	0.706 + 0.016	0.696 + 0.018
Precision	0.541 + 0.018	0.590 + 0.032	0.597 + 0.031	0.594 + 0.031	0.586 + 0.029
Recall	0.807 + 0.050	0.725 + 0.096	0.738 + 0.060	0.735 + 0.056	0.740 + 0.105
F1-Score	0.647 + 0.015	0.630 + 0.028	0.657 + 0.020	0.655 + 0.018	0.643 + 0.023
Accuracy	0.658 + 0.019	0.690 + 0.022	0.702 + 0.022	0.700 + 0.021	0.690 + 0.022

12	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.293 + 0.082	0.280 + 0.080	0.367 + 0.032	0.370 + 0.029	0.307 + 0.062
AUROC	0.638 + 0.048	0.640 + 0.041	0.686 + 0.017	0.687 + 0.015	0.652 + 0.034
Precision	0.490 + 0.044	0.536 + 0.055	0.563 + 0.034	0.575 + 0.035	0.543 + 0.050
Recall	0.856 + 0.073	0.675 + 0.167	0.745 + 0.079	0.725 + 0.082	0.687 + 0.135
F1-Score	0.619 + 0.028	0.577 + 0.053	0.637 + 0.022	0.634 + 0.023	0.596 + 0.046
Accuracy	0.589 + 0.069	0.638 + 0.047	0.673 + 0.025	0.680 + 0.021	0.645 + 0.045

6	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.227 + 0.109	0.214 + 0.130	0.367 + 0.029	0.372 + 0.027	0.269 + 0.098
AUROC	0.600 + 0.061	0.605 + 0.065	0.686 + 0.015	0.688 + 0.014	0.632 + 0.050
Precision	0.460 + 0.051	0.494 + 0.071	0.556 + 0.031	0.559 + 0.029	0.515 + 0.060
Recall	0.889 + 0.088	0.694 + 0.209	0.764 + 0.076	0.768 + 0.075	0.734 + 0.163
F1-Score	0.600 + 0.029	0.551 + 0.074	0.640 + 0.018	0.643 + 0.018	0.585 + 0.061
Accuracy	0.536 + 0.091	0.594 + 0.076	0.668 + 0.025	0.671 + 0.022	0.615 + 0.065

2	scratch	shallow	Mode 1	Mode 2	Mode 3
MCC	0.200 + 0.122	0.126 + 0.168	0.365 + 0.023	0.369 + 0.023	0.227 + 0.124
AUROC	0.589 + 0.065	0.558 + 0.081	0.683 + 0.013	0.685 + 0.013	0.603 + 0.065
Precision	0.454 + 0.053	0.453 + 0.107	0.554 + 0.033	0.558 + 0.035	0.500 + 0.087
Recall	0.883 + 0.111	0.622 + 0.275	0.767 + 0.088	0.773 + 0.087	0.692 + 0.254
F1-Score	0.592 + 0.033	0.490 + 0.127	0.638 + 0.018	0.640 + 0.017	0.543 + 0.111
Accuracy	0.523 + 0.098	0.548 + 0.088	0.665 + 0.026	0.668 + 0.027	0.583 + 0.086

Datasets for AKT1

Family	# of active	# of inactive	training	test
2.7.11.1 (Source)	14,025	14,025	23,376	4,674
AKT1 (Target)	1,633	1,633	100	3,166

Candidate drugs for AKT1.

ID	Drug Name	Score
DB11577	Indigotindisulfonic acid	0.998
DB09462	Glycerin	0.996
DB11127	Selenious acid	0.995
DB15479	Zirconium chloride hydroxide	0.994
DB03175	Propyl alcohol	0.994
DB11387	Chloroform	0.992
DB01839	Propylene glycol	0.987
DB11343	Silanol	0.986
DB00898	Ethanol	0.982
DB11091	Hydrogen peroxide	0.982

Datasets for CDK1

Family	# of active	# of inactive	training	test
2.7.11.22 (Source)	694	694	1158	230
CDK1 (Target)	674	674	100	1248

Candidate drugs for CDK1.

ID	Drug Name	Score
DB11577	Indigotindisulfonic acid	0.998
DB09462	Glycerin	0.996
DB11127	Selenious acid	0.995
DB15479	Zirconium chloride hydroxide	0.994
DB03175	Propyl alcohol	0.994
DB11387	Chloroform	0.992
DB01839	Propylene glycol	0.987
DB11343	Silanol	0.986
DB00898	Ethanol	0.982
DB11091	Hydrogen peroxide	0.982

Datasets for CDK2

Family	# of active	# of inactive	training	test
2.7.11.22 (Source)	694	694	1158	230
CDK2 (Target)	1176	1176	100	2252

Candidate drugs for CDK2.

ID	Drug Name	Score
DB11577	Indigotindisulfonic acid	0.972
DB11730	Ribociclib	0.944
DB15442	Trilaciclib	0.932
DB09073	Palbociclib	0.888
DB00413	Pramipexole	0.887
DB11092	Stannous fluoride	0.882
DB11732	Lasmiditan	0.849
DB00360	Sapropterin	0.844
DB06193	Pixantrone	0.84
DB12941	Darolutamide	0.814

Datasets for LOX

Family	# of active	# of inactive	training	test
1.-.-.- (Source)	10280	10280	17134	3426
LOX (Target)	31	31	52	10

Candidate drugs for LOX.

ID	Drug Name
CHEMBL1201222	LISDEXAMFETAMINE
CHEMBL3989949	CENOBAMATE
CHEMBL1201141	IBUPROFEN LYSINE
CHEMBL1909285	NITREFAZOLE
CHEMBL2096646	AMINOSALICYLATE SODIUM
CHEMBL3989691	ELTROMBOPAG OLAMINE
CHEMBL3182733	PRAMIPEXOLE DIHYDROCHLORIDE
CHEMBL1201785	HEXAMINOLEVULINATE HYDROCHLORIDE
CHEMBL3989777	TRIENTINE HYDROCHLORIDE
CHEMBL2107703	FENTICONAZOLE NITRATE

dharmogata / transferlearning4dti Goto Github PK

transferlearning4dti's Introduction

Transfer Learning for Drug-Target Interaction Prediction

Abstract

Descriptions of folders and files in the TransferLearning4DTI repository

Dependencies

How to re-produce performance comparison results in the TransferLearning4DTI article

Explanation of Parameters

Option 1

to reproduce performance results

create a Transporter small dataset with a size 6

obtain baseline performance results for the created dataset

obtain scratch performance results for the same dataset

extract the output of the first hidden layer (--el 2 for the output of the second hidden layer )

obtain shallow classifier performance results using the output of the first hidden layer (--el 2 for the output of the second hidden layer )

obtain full fine-tuning performance results for the same dataset

obtain fine-tuning with freezing layer 1 performance result for the same dataset (--fl 2 with freezing layer 2 )

Option 2 - Fine-tune your training dataset

extract features using the chemprop tool for your training dataset by running the following command

train your dataset by running the following command without transfer learning

train your dataset by running the following command with transfer learning

Option 3 Predict your test dataset

extract features using the chemprop tool for your test dataset by running the following command (it will create test_chemprop file under output directory)

get predictions for your test dataset without training you can run the following command

get predictions for your test dataset by training you can run the following command

Option 4

Fine-tune your training dataset and get predictions for your test dataset using the fine-tuned model (training and testing)

extract features using the chemprop tool for your training and test dataset by running the following command (it will create test_chemprop file under output directory)

To get predictions for your test dataset you can run the following command

full fine-tune predictions

fine-tune with freeze predictions

Output of the scripts

Results

transferlearning4dti's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org