Swift Dock 🚀

In this study, we explored various machine learning (ML) models to forecast docking scores of ligands for specific target proteins, aiming to reduce the need for extensive docking calculations. Our primary goal? Find a regression model that can determine the docking scores of ligands from a chemical library in relation to a target protein. We achieve this with data from explicit docking of a select few molecules.

Among the ML models:

🧠 An LSTM-based Neural Network (common in Natural Language Processing tasks like speech recognition). Combined with an attention mechanism, it effectively extracts ligand data. We used Pytorch for this.
🌳 Models like XGBoost, Decision Tree Regression, and Stochastic Gradient Descent from libraries like XGBoost and scikit-learn.

Setting up the Environment 🛠️

Ensure Python 3.7 is installed 🐍
Create a virtual environment and execute pip install -r requirements.txt 📦
Navigate to 'swifty' and run sudo chmod -R 777 logs 📑

Setting up the Environment - Apple Silicon 🍎

Ensure Python 3.8 is installed 🐍
Create a virtual environment and execute pip -r apple-silcon-requirements.txt 📦
Navigate to 'swifty' and run sudo chmod -R 777 logs 📑

Training Using LSTM 🧠

Build & Validate 🛠️

Add your target to the 'dataset' folder. Follow the format in sample_input.csv.
Example: Lets say you want to train the lstm model for sample_input for mac descriptor and a training set size of 50 without cross validation. First, Navigate to src/models and run the below command. Note: All possible descriptors are mac, onehot, and morgan_onehot_mac:

Command

python main_lstm.py --input sample_input --descriptors mac --training_sizes 50 --cross_validation False

Command Format

python main_lstm.py --input <YOUR_INPUT_FILE> --descriptors <DESCRIPTOR> --training_sizes <TRAINING_SIZE> --cross_validation <CROSS_VALIDATION>

This will produce a result directory with 5 categories. Each file follows the format: lstm_target_descriptor_training_size.

project_info: Details like training size and durations.
serialized_models: Trained model post-training.
test_predictions: Each docking score and corresponding model prediction.
testing_metrics: Metrics such as R-squared, mean absolute error from testing.
validation_metrics: Metrics from 5-fold cross-validation (only if --cross_validation True).

More examples

Training Using Multiple Descriptors

python main_lstm.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 --cross_validation False

Training Using Multiple Descriptors and Multiple Training set sizes

python main_lstm.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 100 --cross_validation False

Training Using Multiple Descriptors, Multiple Training set sizes and Multiple Targets

python main_lstm.py --input sample_input sample_input_2 --descriptors mac morgan_onehot_mac --training_sizes 50 100 --cross_validation False

Making Predictions with LSTM 🎯

Run

python lstm_inference.py --input_file <YOUR_INPUT_FILE> --output_dir <YOUR_OUTPUT_DIRECTORY> --model_name <YOUR_MODEL_NAME>

Ensure than <YOUR_INPUT_FILE> follows the format of molecules_for_prediction.csv in the 'dataset' folder. Example

python lstm_inference.py --input_file molecules_for_prediction.csv --output_dir prediction_results --model_name lstm_target_mac_50_model.pt

Training Using other models (from scikit-learn) 🌳

Add your target to the 'dataset' folder. It should match the format of sample_input.csv
Run this command to prepare the dataset

Example

python create_fingerprint_data.py --input sample_input --descriptors mac

Command Format

python create_fingerprint_data.py --input <YOUR_INPUT_FILE> --descriptors <DESCRIPTOR>

More examples For creating the datasets

Crate dataset for training using Multiple Descriptors

python create_fingerprint_data.py --input sample_input --descriptors mac morgan_onehot_mac

Run this to train

python main_ml.py --input sample_input --descriptors mac --training_sizes 50 --regressor sgreg

Command Format

python main_ml.py --input <YOUR_INPUT_FILE> --descriptors <DESCRIPTOR> --training_sizes  <TRAINING_SIZE> --regressor  <REGRESSOR>

Note: All possible descriptors are mac, morgan_onehot_mac and onehot. All possible regressors are sgreg, xgboost and decision_tree

More examples

Training Using Multiple Descriptors

python main_ml.py --input sample_input --descriptors mac  morgan_onehot_mac --training_sizes 50 --regressor sgreg

Training Using Multiple Descriptors and Multiple Training set sizes

python main_ml.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 100 --regressor sgreg

Training Using Multiple Descriptors, Multiple Training set sizes and Multiple Models

python main_ml.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 100 --regressor sgreg xgboost

This will give you a result directory with similar categories and file formats as mentioned in the LSTM section.

Making Predictions with other Models 🎯

Your input CSV should match the format of molecules_for_prediction.csv in the 'dataset' folder.
Run

python other_models_inference.py --input_file <YOUR_INPUT_FILE> --output_dir <YOUR_OUTPUT_DIRECTORY> --model_name <YOUR_MODEL_NAME>

Ensure than <YOUR_INPUT_FILE> follows the format of molecules_for_prediction.csv in the 'dataset' folder.

abdulsalam-bande / swifty Goto Github PK

swifty's Introduction

Swift Dock 🚀

Setting up the Environment 🛠️

Setting up the Environment - Apple Silicon 🍎

Training Using LSTM 🧠

Build & Validate 🛠️

Command

Command Format

More examples

Making Predictions with LSTM 🎯

Training Using other models (from scikit-learn) 🌳

Example

Command Format

More examples For creating the datasets

Crate dataset for training using Multiple Descriptors

More examples

Making Predictions with other Models 🎯

swifty's People

Contributors

Stargazers

Watchers

Forkers

swifty's Issues

Recommend Projects

Recommend Topics

Recommend Org