giancarlok / mhc_experiments Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 6.0 8.76 MB

Jupyter Notebook 100.00%

mhc_experiments's People

Stargazers

Watchers

Forkers

hammer wavelets adiamb weizhiting jmrinaldi bs2537

mhc_experiments's Issues

Getting started with MHCflurry

IEDB database:

http://www.iedb.org/

One usually tries to train on IEDB 2009 and test on IEDB 2013

Link for IEDB 2009:

Link for IEDB 2013:

Data loading and preparation

df = pd.read_table("data.txt")

loads data into data the frame df

df.groupby('mhc').size().nlargest(11)

shows you list of the 11 most frequent alleles.

df_h=df[df['mhc']=='HLA-A-0201'][['sequence','meas']]

loads HLA-A-0201 data into a new dataframe df_h (in case you are working on allele-specific predictor)

df_h['log_meas']=1-np.log(df_h['meas'])/math.log(50000)

log-transforms the labels into [0,1] values.

Amino acid encoding

Index encoding

from mhcflurry.amino_acid import common_amino_acids

if you want to have a look at amino_acid.py here is the link: https://github.com/hammerlab/mhcflurry/blob/master/mhcflurry/amino_acid.py

def amino_acid_encoding(s, maxlen):
    a = 1+common_amino_acids.index_encoding([s],len(s)).flatten()
    return np.concatenate([a, np.zeros(maxlen-len(a),dtype=int)])

transforms peptide sequence string into integer vector, where empty is sent to 0, A is sent to 1 etc All vectors are of same length namely the length of the longest peptide sequence in the list aka maxlen, here it is 15.

df_h['encoded_peptides'] = df_h.sequence.apply(lambda seq: amino_acid_encoding(seq, max_len))

applies it to every peptide sequence in the data frame, where

max_len=df_h['sequence'].str.len().max()

is the maximum length of all peptides in the data.

Hotshot encoding
MHC sequence encoding

https://github.com/ANHIG/IMGTHLA/blob/Latest/fasta/A_prot.fasta
not sure though how to properly read this one in

MHC pseudo-sequence encoding
PFR (peptide flanking region) encoding
C- terminal encoding
N- terminal encoding
BLOSUM encoding
Conventional sparse encoding
9 mer reduction encoding
Peptide length encoding

df_h['peptide_length'] = df_h['sequence'].str.len()

creates a new column with respective peptide lengths.

stability (NetMHCstab)
cleavage sites of the human proteasome (NetChop)
Which one is the best encoding method ? Advantages vs Disadvantages

Training / test split and cross-validation

X_train = pd.DataFrame(list(df_h['encoded_peptides'].iloc[train])).values
Y_train = pd.DataFrame(list(df_h['log_meas'].iloc[train])).values
X_test = pd.DataFrame(list(df_h['encoded_peptides'].iloc[test])).values
Y_test = pd.DataFrame(list(df_h['log_meas'].iloc[test])).values

The choice of variables train and test determine what training / test split one wishes to have.

Keras

Useful packages to load

from keras import models, layers, optimisers
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Dense, Dropout, Embedding, LSTM, Input, merge, Convolution1D, AveragePooling1D, Activation, Flatten

How to compile a bidirectional LSTM in Keras

sequence = Input( shape= (max_len,), dtype='int32')
embedding_dimension = 100
embedded = Embedding(21, embedding_dimension , mask_zero = True)(sequence)

First we have to embed the sequences we wish to use as the input for our LSTM.
Recall that max_len is defined above as maximal peptide length in the dataset.

Note that mask_zero = True cuts out all the 0s and reduces the input integer vector to the largest zero-free subvector. One wishes to cut out the 0s before they get embedded as otherwise they get treated as amino acids and thus skew the result or performance. Also note that when using input vectors of varying length one cannot use a FFN straight away, as it requires to have vectors of same lengths, thats why the methods in Nielsen et al. use either only 9 mer peptides as inputs or some technique that reduces all non - 9 mer peptides to 9 mer peptides, which of course comes at the cost of a loss of information. Here we don't need to worry about having input data that has same length as we are using an LSTM and LSTMs can handle sequences of varying length. So if one wishes to compare the performance of LSTM and say 3-layer FFN one cannot do it directly, but either (a) reduce all peptides to 9 mers, or (b) compare only on 9 mers, or (c) have a FFN for each peptide length and then combine the output each a FFN in some way.

LSTM_hidden_layer_size = 80
forwards = LSTM(LSTM_hidden_layer_size)(embedded)
backwards = LSTM(LSTM_hidden_layer_size, go_backwards=True)(embedded)
merged = Merge([forwards, backwards], mode='concat', concat_axis=-1)

The 'Merge' layer supports a number of pre-defined modes:

'sum' (default): element-wise sum
'concat': tensor concatenation. You can specify the concatenation axis via the argument 'concat_axis', as we did.
'mul': element-wise multiplication
'ave': tensor average
'dot': dot product. You can specify which axes to reduce along via the argument dot_axes.
'cos': cosine proximity between vectors in 2D tensors.

You can also pass a function as the 'mode' argument, allowing for arbitrary transformations such as:

merged = Merge([left_branch, right_branch], mode=lambda x, y: x - y)

see: https://keras.io/getting-started/sequential-model-guide/

dropout_probability = 0.5
after_dp = Dropout(dropout_probability)(merged)

adds some regularisation to out LSTM via dropout with probability 0.5.

output = Dense(1, activation='hard_sigmoid')(after_dp)
model = Model(input=sequence, output=output)

finalizes the model with a single output layer. The choice of the activation function seems to be quite relevant for the overall performance. There are several choices for activation functions in Keras such as: 'softplus' , 'softmax', 'softsign', 'linear', 'sigmoid', 'tanh', 'relu', and 'hard_sigmoid'. see: https://keras.io/activations/

model.compile(optimizer = 'Adam', loss='mean_squared_error')

finally complies the model. There are several choices for optimizers:
'Adagrad', 'Adamax', 'Adadelta', 'SGD', 'RMSprop', 'Adam' and 'Nadam'
see: https://keras.io/optimizers/

as well as several choices for the loss function:
'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_logarithmic_error', 'squared_hinge', 'hinge', 'binary_crossentropy', 'categorical_crossentropy', 'sparse_categorical_crossentropy', 'kullback_leibler_divergence', 'poisson', 'cosine_proximity'.
see: https://keras.io/objectives/

How to train LSTM on batch

batch_size = 16
n_epochs = 100
epoch = 0
for epoch in range(n_epochs):
    for batch_idx in range(len(X_train) // batch_size):   
        model.train_on_batch(X_train[batch_idx * batch_size:(batch_idx + 1) * batch_size], 
                             Y_train[batch_idx * batch_size:(batch_idx + 1) * batch_size])
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

Note that by using X_train and X_test, we implicitly use " index encoding " as our amino acid encoding.

Evaluating LSTM performance via AUC

You might be confused why on earth I am talking about AUC, as this is only used for evaluating categorical output, whereas the log-transformed binding is continuous. Well, as it seems, there is a conventional binding affinity cutoff of 500 nM that distinguishes "binders" from "non-binders". Thus the definition of the following function:

def measured_affinity_less_than_500(Y):
    IC50 = 50000**(1-Y)
    return IC50 < 500

Note that the function is defined in terms of Y, a numpy array, which has the advantage of being more memory efficient.

The reason we use AUC is merely because it has been used in previous literature, and thus we can directly compare our results to previous results in the literature, and hopefully outperform them.

At the end of the second for loop in the batch training (cf. previous bulletpoint), one can add the following print statement:

print(epoch, batch_idx, roc_auc_score(measured_affinity_less_than_500(Y_train),train_pred), 
          roc_auc_score(measured_affinity_less_than_500(Y_test),test_pred))

this gives you an overview what happens after each epoch. Note that the reason why used measured_affinity_less_than_500 is because roc_auc_score requires as a first argument a categorical 0-1 value, whereas the second argument can be continuous. Don't forget to include:

from sklearn.metrics import roc_auc_score

How to add convolutional filter to our LSTM
How to add attentional mechanism to out LSTM

netMHC: history & state of the art

General overview:

A general overview can be found here: http://www.cbs.dtu.dk/services/NetMHC/versions.php

There

netMHC-1.0

Sensitive quantitative predictions of peptide-MHC binding by a 'Query by Committee' artificial neural network approach. Buus S, Lauemoller SL, Worning P, Kesmir C, Frimurer T, Corbet S, Fomsgaard A, Hilden J, Holm A, Brunak S. Tissue Antigens., 62:378-84, 2003.

http://www.ncbi.nlm.nih.gov/pubmed/14617044

netMHC-2.0

Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Nielsen M, Lundegaard C, Worning P, Lauemoller SL, Lamberth K, Buus S, Brunak S, Lund O. Protein Sci., 12:1007-17, 2003

http://www.ncbi.nlm.nih.gov/pubmed/12717023

Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach.
Nielsen M, Lundegaard C, Worning P, Hvid CS, Lamberth K, Buus S, Brunak S, Lund O.
Bioinformatics, 20(9):1388-97, 2004.

http://www.ncbi.nlm.nih.gov/pubmed/14962912

Features:
conventional sparse encoding: 9x20 = 180 features
BLOSUM50 encoding: 9x20 = 180 features
Gibbs sampling: 9 features (optional: for comb II)

Model:
conventional 3-layer FFN with 2-10 hidden units

netMHC-3.0

Accurate approximation method for prediction of class I MHC affinities for peptides of length 8, 10 and 11 using prediction tools trained on 9mers. Lundegaard C, Lund O, Nielsen M. Bioinformatics, 24(11):1397-98, 2008.

http://bioinformatics.oxfordjournals.org/content/24/11/1397

NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11 Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M. Nucleic Acids Res. 1;36(Web Server issue):W509-12. 2008

http://www.ncbi.nlm.nih.gov/pubmed/18463140

netMHC-4.0

Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Andreatta M, Nielsen M. Bioinformatics (2015)

http://www.ncbi.nlm.nih.gov/pubmed/26515819

Features:
9 mer reduction & conventional sparse encoding: 9x20 =180 features
9 mer reduction & BLOSUM encoding: 9x20 =180 features
deletion or insertion length of the 9 mer reduction
length & composition of PFR (peptide flanking region)
peptide length L: L \leq 8, L = 9, L = 10, L \geq 11

Model:
conventional 3-layer FFN with 5 hidden units

netMHCpan: history & state of the art

General overview

A general overview can be found here: http://www.cbs.dtu.dk/services/NetMHCpan/versions.php

netMHCpan-1.0

NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence. Nielsen M, et al. (2007) PLoS ONE 2(8): e796. doi:10.1371/journal.pone.0000796

http://www.cbs.dtu.dk/services/NetMHCpan-1.0/PONE2_796.pdf

Features:
a) conventional sparse encoding for both: peptide and HLA pseudo-sequence
b) BLOSUM encoding for both: peptide and HLA pseudo-sequence
c) sparse encoding for peptide and BLOSUM encoding for HLA pseudo-sequence

Model:

conventional 3-layer FFN with 22-86 hidden units

netMHCpan-2.0

NetMHCpan - MHC class I binding prediction beyond humans
Ilka Hoof, Bjoern Peters, John Sidney, Lasse Eggers Pedersen, Ole Lund, Soren Buus, and Morten Nielsen. Immunogenetics 61.1 (2009): 1-13

http://www.ncbi.nlm.nih.gov/pubmed/19002680

Features: (same as netMHCpan-1.0: only extended to non-human alleles)
a) conventional sparse encoding for both: peptide and MHC pseudo-sequence
b) BLOSUM encoding for both: peptide and MHC pseudo-sequence
c) sparse encoding for peptide and BLOSUM encoding for MHC pseudo-sequence

Model: (same as netMHCpan-1.0)

conventional 3-layer FFN with 22-86 hidden units.

netMHCpan-3.0

NetMHCpan-3.0: improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length data sets Morten Nielsen and Massimo Andreatta
Genome Medicine (2016): 8:33

https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-016-0288-x

Features:

9 mer reduction of peptides & BLOSUM encoding of both peptides and MHC pseudo-sequence:
deletion or insertion length of the 9 mer reduction
length & composition of PFR (peptide flanking region)
peptide length L: L \leq 8, L = 9, L = 10, L \geq 11

Model:

conventional 3-layer FFN with 56 or 66 hidden units.

netMHC

General overview

A general overview can be found here: http://www.cbs.dtu.dk/services/NetMHC/versions.php

netMHC-1.0

References

http://www.ncbi.nlm.nih.gov/pubmed/14617044

Method

netMHC-2.0

References

http://www.ncbi.nlm.nih.gov/pubmed/12717023

http://www.ncbi.nlm.nih.gov/pubmed/14962912

Method

Features:
conventional sparse encoding: 9x20 = 180 features
BLOSUM50 encoding: 9x20 = 180 features
Gibbs sampling: 9 features (optional: for comb II)

Model:
conventional 3-layer FFN with 2-10 hidden units

netMHC-3.0

References

http://bioinformatics.oxfordjournals.org/content/24/11/1397

http://www.ncbi.nlm.nih.gov/pubmed/18463140

Method

netMHC-4.0

References

Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Andreatta M, Nielsen M. Bioinformatics (2015)

http://www.ncbi.nlm.nih.gov/pubmed/26515819

Method

Model:
conventional 3-layer FFN with 5 hidden units

netMHCpan

General overview

A general overview can be found here: http://www.cbs.dtu.dk/services/NetMHCpan/versions.php

netMHCpan-1.0

References

http://www.cbs.dtu.dk/services/NetMHCpan-1.0/PONE2_796.pdf

Method

Model:

conventional 3-layer FFN with 22-86 hidden units

netMHCpan-2.0

References

NetMHCpan - MHC class I binding prediction beyond humans
Ilka Hoof, Bjoern Peters, John Sidney, Lasse Eggers Pedersen, Ole Lund, Soren Buus, and Morten Nielsen. Immunogenetics 61.1 (2009): 1-13

http://www.ncbi.nlm.nih.gov/pubmed/19002680

Method

Model: (same as netMHCpan-1.0)

conventional 3-layer FFN with 22-86 hidden units.

netMHCpan-3.0

References

https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-016-0288-x

Method

Features:

Model:

conventional 3-layer FFN with 56 or 66 hidden units.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.