Giter VIP home page Giter VIP logo

cov_seqs's Introduction

cov_seqs

Code for reproducing the results in Zhang, et. al. Predicting binding affinities of emerging variants of SARS-CoV-2 using spike protein sequencing data: Observations, caveats and recommendations (Briefings in bioinformatics 2022, in press)

Overview

protocol

This repository provides the code to reproduce the results of the following models: Smith-Waterman alignment score -> topological regression; VHSE -> 1D CNN; Bi-LSTM - FC layers or ridge regression; ProtBERT embedding -> FC layers or ridge regression.

Special dependencies:

Model performances are evaluated with module myToolbox. https://github.com/Ribosome25/myToolbox

Navigate to the site-package dir and clone the repo. such as ~/conda/envs/<env name>/lib/python3.8/site-packages$ git clone https://github.com/Ribosome25/myToolbox.git

Argument parser TAP: pip install typed-argument-parser

scikit-bio for Smith-Waterman alignment score calculation pip install scikit-bio. If erros pop up when importing skbio, please try early version scikit-bio==0.5.6

fast-parquet for loading and saving pd.DF to parquet files pip install fastparquet==0.5.0

mpi4py for MPI parallel computing: pip install mpi4py==3.0.3

keras_tuner for random search ANN and CNN parameters.

Improtant library versions:

python == 3.7.10
keras == 2.4.3
tensorflow == 2.3.0
torch == 1.10.0
pandas == 1.2.3
scikit-learn == 1.0.2
numpy == 1.21.6
cuda == 10.2

Data availability:

Some essential processed data can be downloaded via https://texastechuniversity-my.sharepoint.com/:u:/g/personal/ruibo_zhang_ttu_edu/EVlhSLZMzq5Crphdn6bw7qoBAnW1pkRc7bX7wNZxDTw4xg?e=yzx8ML

Other processed data available on request.

Data Preprocessing and Embedding Generation

Distance calculation

Smith-Waterman distances (negative alignment scores) and Levenshtein distances are computed and used in subsequent modeling. For subsampled datasets, distances are computed at runtime. For the whole set, distances are precomputed using MPI parallel computing for time concerns.

Check out scripts/SW_distance.sh and mpi_SW_distance.py for using MPI distance precomputation.

LSTM / BERT embedding generation

LSTM is modified from https://github.com/brianhie/viral-mutation. bin/combinatorial_fitness.py, bin/mutation.py and bin/cov.py are modified for our usage. The modified files are provided in ref_models/science2021.

An example of using the embedding generation: after cloning Brian Hie's repo, replace bin/combinatorial_fitness.py, bin/mutation.py and bin/cov.py with the code provided here and run

python ./viral-mutation/bin/cov.py bilstm --checkpoint ../models/cov.hdf5 --self-embed-fitness

to generate BiLSTM embeddings. Modify the model types and checkpoint path as needed.

BERT embedding is generated with ref_models/ProtBert/ProtBert.py.

VHSE

VHSE encoding is generated with MPI parallelly.

Run ref_models/mpi_VHSE_gen.py with input and output path as arguments. See ref_models/args.py for arguments details. Checkout ref_models/mpi_vhse.py for details.

Experiments and Results

The script for most of the modeling and performance evaluation is provided in scripts/model_CoV_seq.py.

A set of functions is provided for testing each model. Examples of how to use these functions are provided after the if __name__ == "__main__": statement.

Topological regression

Refer to model_CoV_seq.py and model_CoV_seq_clf

Examples:

def SW_distance_holdout(n_anchors=100, mdl=LR(), metric='Levenshtein', knn=10,
is for 0.5% and 5% subsampled dataset.

and

def SW_distance_fullset(n_anchors=200, mdl=LR(), knn=10, rbf_gamma=None):
is for whole dataset (pre-computed distances).

Calling these functions at

# (3) Topological Regression

Topo Reg using CNN as similarity metric: model_CoV_CNN_topo.py

VHSE - CNN

For model output, performance, and embedding generation: model_CoV_VHSE_CNN and model_CoV_VHSE_CNN_clf

Grid search parameter tuning: model_VHSE_CNN_gridsearch_clf

Random search parameter tuning: keras_tuner_CNN_run.py

LSTM and ProtBERT

Embeddings are first generated.

Ridge and linear regression performance are evaluated in model_CoV_seq.py and model_CoV_seq_clf

Example:

def embedding_ref_mdl(ref_mdl = LR(), dataset='ProtBert', pooling='avg'):

  • Pass scikit-learn Estimators to ref_mdl to evaluate different regressors (or clfs)
  • Pass 'sci' to dataset for LSTM embedding, and "ProtBert" for BERT embedding.
  • pooling indicates pooling method of the embeddings. See
    for pl in ['avg', 'last', 'cscs', 'cls']:
  • Comment / uncomment L378 - L382 to choose between the binding dataset or expression dataset (different target values).

FC-layers parameter grid search and performances evaluation are in model_NLP_fc_gridsearch.py and model_NLP_fc_gridsearch_clf.py Random search in keras_tuner_ANN_run.py

KNN models

Evaluated in model_CoV_seq.py and model_CoV_seq_clf

Whole dataset: knn_ref_holdout_model()

def knn_ref_holdout_model():
.

Subsampled dataset: knn_ref_small_holdout_model()

Clustering scatter plot and visualization

See visualization/GisAid1203.ipynb

mds-variant

cov_seqs's People

Contributors

ribosome25 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.