Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers

1. Google Drive Organization

Our work took place on the Google Drive environment, in order to collaborate each other and keep the development and the cooperation fast and easy. In particular, we exploited the Google Colab notebooks, because everyone can modify them once they’re shared on the drive, and thanks to import-ipynb we can import the notebooks in the workflow of other notebooks, increasing the modularity and the reproducibility of our code.

All the notebooks are contained in the Colab subfolder of our drive. The folder /Classes contains the model classes and other useful classes like the DataManager. Those notebooks are exploited by the several workflow notebooks which are located in /WORKFLOW_GPU and /WORKFLOW_TPU The most important notebooks are:

DataManager.ipynb: used to manage all the datasets;
CNN1D.ipynb: contains a class that manages all our convolutional solutions (DivideEtImpera included), and areproduction of the ”Xpresso” original neural network;
BioLSTM.ipynb: contains a class that manages our main LSTM solutions;
Transformer.ipynb: contains a class that handle our transformer ssolutions.

The Workflow notebooks are organized as follows:

at the beginning we have a call to the DataManager in order to retrieve useful data
then, we choose the best suited model for our research choosing between the ones available in projCNN1D, projTransformer, and BioLSTM class.

The construction of our workflows gave us the possibility to implement any kind of Deep Neural Network and change parameters in matters of seconds. The workflow name suggests the model and the data used

{
    "P" : "Promoter", 
    "H" : "Half−Life", 
    "T" : "TranscriptionFactors",
    "M" : "MicroRNA"
}

Example: Workflow_CNN1D_PH_GPU uses one or more models defined in CNN1D, it takes Promoters and Halflife features, and deploy the models on GPU.

DeepLncLoc from scratch

This is a notebook that reimplement from scratch the full pipeline of the DeepLncLoc embedding and can be found in the ./Colab/varie folder. It can be used to retrain the Word2Vec model and to generate dataset with different parameters.

2. Classes

2.1 DataManager class

The DataManager class manages and imports the dataset in various forms, depending on the model that will be used for training.

dm = DataManager(
    datadir         = datadir, 
    transformer     = False, 
    micro           = False, 
    tf              = False, 
    datadir_micro   = "Dataset/microRNA FINALE", 
    datadir_tf      = "Dataset/dataset_aumentati", 
    remove_indicted = False,
    DeepLncLoc      = False,
)

dataset = dm.get_train(
    np_format  = True,  # boolean to transform in numpy
    translated = False, # sequence data translated into categorical (e.g. for transformers)
    micro      = False  # if True, it returns also microRNA
)

dataset = dm.get_train(
    np_format  = True, 
    translated = False,
    micro      = False 
)

dataset = dm.get_train(
    np_format  = True, 
    translated = False,
    micro      = False 
)

Parameter	Description	Values
datadir	choose the directory of the dataset	str
transformer	sequence data translated into categorical	boolean
micro	retrieve also microRNA data	boolean
tf	retrieve also Transcription Factor data	boolean
datadir_micro	directory to the microRNA data	str
datadir_tf	directory to the Transcription Factor data	str
remove_indicted	keep only sequences of lenght 20_000	boolean
DeeplncLoc	retrieve DeepLncLoc embedded sequences (e.g. for our LSTM solution)	boolean

The DataManager reads the different .h5 format datasets from the Dataset/ folder.

The standard data are taken from (Xpresso Original Dataset):

- "Dataset/pM10Kb_1KTest/train.h5"
- "Dataset/pM10Kb_1KTest/valid.h5"
- "Dataset/pM10Kb_1KTest/test.h5"

The data integrated with microRNA are taken from:

- "Dataset/microRNA FINALE/train.h5"
- "Dataset/microRNA FINALE/valid.h5"
- "Dataset/microRNA FINALE/test.h5"

The data integrated with Transcription Factors are taken from:

- "Dataset/dataset_aumentati/train_tf.h5"
- "Dataset/dataset_aumentati/validation_tf.h5"
- "Dataset/dataset_aumentati/test_tf.h5"

The standard translated data (for Transformers) are taken from:

- "Dataset/dataset_aumentati/translated_transformers.h5"

The standard translated data (for Transformers) integrated with Transcription Factors are taken from:

- "Dataset/dataset_aumentati/translated_transformers_tf.h5"

2.2 projCNN1D class

The projCNN1D class exposes the models based on Convolutional 1D Neural Networks.

model = projCNN1D(
    checkpoint_dir     = "",
    model_type         = "Xpresso",
    n_epochs           = 10, 
    batch_size         = 32, 
    learning_rate      = 5e-4,
    momentum           = 0.9,
    CNN_input          = (10_500, 4),
    miRNA_input        = (2064,),
    lr_reduction_epoch = None,
    dropout_rate       = 0.4,
    shuffle            = True,
    logdir             = None,
    patience           = 30,
    opt                = "SGD",
    loss               = "mse"
)

Parameter	Description	Values
model_type	choose the correct model for each dataset	'Xpresso', 'Xpresso_nohalf', 'Xpresso_TF', 'Xpresso_micro', 'Xpresso_ort', 'DivideEtImpera', 'DivideEtImpera_TF', 'DivideEtImpera_onlyPromo'
n_epochs	number of epochs	int
batch_size	size of the batch	int
CNN_input	shape of the sequence data	(n,m)
dropout_rate	select the dropout for the Dense layers	float
logdir	tensorboard directory, if this parameter is set to None, the tensorboard_callback will not be used by the model	str
checkpoint_dir	directory in which is saved the best model	str
patience	patience for the earlystopping	int
learning_rate	learning rate	float
lr_reduction_epoch	milestone where lr_scheduler is invoked to reduce learning rate	int
momentum	momentumtum for the SGD optimizer	float
shuffle	data shuffle	boolean
opt	choose the optimizer	'SGD', anything else(='Adam')

model_type	Description
Xpresso	official Xpresso model
Xpresso_nohalf	Xpresso with input only promoters
Xpresso_TF	Xpresso with Promoter, Half-life and TF as input
Xpresso_micro	Xpresso with Promoter, Half-life and microRNA as input
Xpresso_ort	Basically it is equal to Xpresso, but with an orthogonal initialization of the weight
DivideEtImpera	DivideEtImpera with only Promoters and Half-life data as input
DivideEtImpera_TF	DivideEtImpera model with also TF data as input
DivideEtImpera_onlyPromo	DivideEtImpera with only promoters as input

2.3 BioLSTM class

The BioLSTM class exposes the models based on LSTM block(s).

model = BioLSTM(
    model_type = 'classic', 
    n_epochs   = 50, 
    batch_size = 128, 
    timestep   = 210, 
    features   = 64,
    datadir    = 'Dataset/embedded_data', 
    opt        = 'adam', 
    lr         = 3e-4
)

Parameter	Description	Values
model_type	choose the correct model for each dataset	'classic', 'tf', 'only promoter'
n_epochs	number of epochs	int
batch_size	size of the batch	int
timestep	length of the timestep of the sequence data	int
features	length of the features for each timestep in the sequence data	int
datadir	directory in which is saved the best model	str
opt	optimizer used by the model	'adam', 'sgd'
lr	learning rate	float

model_type	Description
classic	BioLSTM model with input promoters and Half-life data
tf	BioLSTM with Promoter, Half-life data and TF as input
only promoter	BioLSTM with promoters as input

2.4 projTFNet class

The projTFNet class exposes the Fully Connected Neural Network to make the regression only on transcription factor data.

model = projTFNet(
    checkpoint_dir     = "",
    model_type         = "TF",
    n_epochs           = 10, 
    batch_size         = 32, 
    learning_rate      = 5e-4,
    momentum           = 0.9,
    lr_reduction_epoch = None,
    shuffle            = True,
    logdir             = None,
    patience           = 30
)

Parameter	Description	Values
model_type	choose the correct model for each dataset	'TF'
n_epochs	number of epochs	int
batch_size	size of the batch	int
CNN_input	shape of the sequence data	(n,m)
dropout_rate	select the dropout for the Dense layers	float
logdir	tensorboard directory	str
checkpoint_dir	directory in which is saved the best model	str
patience	patience for the earlystopping	int
learning_rate	learning rate	float
lr_reduction_epoch	milestone where lr_scheduler is invoked to reduce learning rate	int
momentum	momentumtum for the SGD optimizer	float
shuffle	data shuffle	boolean

2.5 projTransformer class

The projTransformer (Classes/Transformer.ipynb) class exposes the Transformers-like models.

model = projTransformer(
    checkpoint_dir     = "",
    model_type         = "best",
    n_epochs           = 300, 
    batch_size         = 32, 
    learning_rate      = 1e-4,
    momentum           = 0.9,
    maxlen             = 10500,
    embed_dim          = 32,
    num_heads          = 4,
    ff_dim             = 64,
    vocab_size         = 5,
    dense              = 64,
    lr_reduction_epoch = None,
    dropout_rate       = 0.1,
    t_rate             = 0.1,
    patience           = 20,
    optimizer          = "SGD",
    warmup_steps       = 8000,
    shuffle            = True,
    loss               = "mse",
    logdir             = None
)

Parameter	Description	Values
model_type	choose the correct model for each dataset	'DeepLncLoc', 'DeepLncLoc_TF', 'DeepLncLoc_Promoter'
n_epochs	number of epochs	int
batch_size	size of the batch	int
maxlen	length of the sequence data	int
dropout_rate	select the dropout for the Dense layers	float
logdir	tensorboard directory	str
checkpoint_dir	directory in which is saved the best model	str
patience	patience for the earlystopping	int
learning_rate	learning rate	float
lr_reduction_epoch	milestone where lr_scheduler is invoked to reduce learning rate	int
momentum	momentum for the SGD optimizer	float
shuffle	data shuffle during train	boolean
embed_dim	dimension of the embedding (depth of w2v)	int
num_heads	number of heads	int
ff_dim	number of neurons of the feed forward network	int
vocab_size	used in word2vec	int
dense	number of neurons of Dense layers	int
t_rate	dropout rate used in the dense layers of transformers blocks	float
warmup_steps	steps to warmup the learning rate in original transformer scheduler/optimizer	int
loss	loss used	instance of tf.keras losses (or just string name e.g. "mse")
optimizer	optimizer used	'SGD', 'Adam', 'Adadelta', 'Adamax', 'Original'

model_type	Description
DeepLncLoc	DeepLncLoc Transformer
DeepLncLoc_TF	DeepLncLoc Transformer and as input promoters, Half-Life and TF data
DeepLncLoc_onlyPromo	DeepLncLoc Transformer, but only promoters as input

3. Workflow Example (GPU)

3.1 Prerequisites

Connect your Google Drive account and allow to import all the notebooks (e.g. DataManager, CNN1D)
Once you have been added to the drive repository, in order to let the import to work properly, you need to add a shortcut of the repository in the main directory of your drive, otherwise the %cd command will fail and you will not be able to run our notebook properly.

!pip install import-ipynb
import import_ipynb

import os
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

%cd "drive/MyDrive/Bionformatics_Project/Colab"

3.2 Classes import:

from Classes.DataManager import DataManager
from Classes.Transformer import projTransformer
from tensorflow import keras
import numpy as np

%load_ext tensorboard

3.3 Retrieve training data from the Data Manager:

dm = DataManager(transformer=False, micro=False)

X_trainhalflife, X_trainpromoter, y_train, _, _                 = dm.get_train(True, False, False)
X_validationhalflife, X_validationpromoter, y_validation, _, _  = dm.get_validation(True, False, False)
X_testhalflife, X_testpromoter, y_test, _, _                    = dm.get_test(True, False, False)

leftpos  = 3_500          
rightpos = 13_500         
maxlen   = rightpos-leftpos

X_trainpromoter      = X_trainpromoter[:, leftpos:rightpos, :]
X_validationpromoter = X_validationpromoter[:, leftpos:rightpos, :]
X_testpromoter       = X_testpromoter[:, leftpos:rightpos, :]

3.4 Train the model:

model_type     = "Xpresso"
checkpoint_dir = "myFirstTrain/"
logdir         = "tensorboardDir/"

net = projCNN1D(
    checkpoint_dir =checkpoint_dir, 
    model_type     = model_type, 
    n_epochs       = 300, 
    batch_size     = 256, 
    learning_rate  = 5e-4, 
    CNN_input      = (maxlen, 4), 
    dropout_rate   = 0.5, 
    logdir         = logdir
)
        
net.train_model(
    [X_trainpromoter, X_trainhalflife],
    y_train, 
    [X_validationpromoter, X_validationhalflife], 
    y_validation
)

3.5 Evaluate the model:

3.5.1 At the last epoch:

net.evaluate([X_testpromoter, X_testhalflife], y_test)

3.5.2 At the epoch with lowest validation loss:

net.evaluate_best([X_testpromoter, X_testhalflife], y_test)

3.6 Visualize the distribution of the predictions with respect to the target:

net.plot_train()
net.plot_r2([X_testpromoter, X_testhalflife], y_test)
net.plot_kde([X_testpromoter, X_testhalflife], y_test)

plot_r2 shows a scatter plot of the distance of the prediction from the true values
plt_kde match the kernel density distribution of the predicted values and the targets

4. Workflow Example (TPU)

4.1 Prerequisites:

The TPU workflow configuration is barely the same of the GPU one, you have to just to be aware of defining the model into the TPU scope, like here:

import tensorflow as tf
tf.keras.backend.clear_session()

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

strategy = tf.distribute.experimental.TPUStrategy(resolver)


with strategy.scope():
    net = projTransformer(checkpoint_dir="batch128/",model_type=model_type, n_epochs=500, 
                      batch_size=256, learning_rate=5e-4, lr_reduction_epoch=200, 
                      maxlen=maxlen, embed_dim=64, num_heads=4, ff_dim=384, dense=100, 
                      dropout_rate=0.1, optimizer="Adam", warmup_steps=4_000, patience=20)

4.2 Considerations:

You do not need to call net.train_model() into the TPU scope, and we suggest to call it outside, maybe in the following notebook cell;
You may need to increase the batch_size in order to exploit a higher parallelization. A rule of thumb suggests to multuply by an 8 factor the batch_size. In this project we used the same batch_size that we used with GPU models in order to avoid a further hyperparameter research, giving the fact that the TPU models gives the same results when deployed on GPU if they share the same hyperparameters. Obviously, even if the batch_size is the same, the model will train faster on TPU.
Due to the fact that the model runs on the TPU cloud, we cannot exploit some callbacks like ModelCheckpoint. So we decided to use a parameter of EarlyStopping which allow us to restore the net's weight of of the epoch with mininum validation loss. As a consequence, once the training is completed, net.evaluate() and net.evaluate_best() will return the same value. Moreover, net.model.save() will not work for the same reasons.
Tensorboard cannot be exploited when you run on TPU, for the same reason we cited before (TPU models runs on cloud)
BioLSTM has no the TPU counterpart because it is alredy fast when deployed on GPU, so it would have been useless to implement it.
remove_indicted option work only on for the Xpresso dataset. If you want to train our transformer with the transcription factor dataset you will need to set vocab_size = 5.

5. Further Considerations:

When we started this project, the latest tensorflow version was 2.4.2, but after two months it changed to 2.5.x, so if you have problem in loading our Saved_Models, you may need to force the installation of the correct tensorflow library.

!pip install tensorflow==2.4.2

6. Best HyperParameters:

BioLSTM (BioLSTM in BioLSTM.ipynb):
    model_type = "classic"
    n_epochs = 140
    batch_size = 128
    learning_rate = 3e-4
    patience = 15
    optimizer = "Adam"
    timestep = 210
    features = 64

DivideEtImpera (projCNN1D in CNN1D.ipynb):
    model_type = "DivideEtImpera"
    n_epochs = 300
    batch_size = 256
    learning_rate = 1e-3
    patience = 15
    optimizer = "SGD"
    lr_reduction_epoch = 70
    momentum = 0.9

Transformer DeepLncLoc (projTransformer in Transformer.ipynb):
    model_type = "DeepLncLoc"
    n_epochs = 300
    batch_size = 256
    learning_rate = 5e-4
    patience = 20
    optimizer = "Adam"
    lr_reduction_epoch = None
    embed_dim = 64
    num_heads = 4
    ff_dim = 384
    dense = 100
    dropout_rate = 0.1

N.B:

The parameters that are not present in these lists have to be assumed as the predefined parameters of the respective class.

Credits

The data relating to the promoters, half-life features and median gene expression levels are available thanks to the following publication:

Agarwal V, Shendure J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. 2020. Cell Reports 31 (7), 107663

The data relating to the transcription factors are available thanks to the following work:

Rouillard AD, Gundersen GW, Fernandez NF, Wang Z, Monteiro CD, McDermott MG, Ma'ayan A. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database (Oxford). 2016 Jul 3;2016. pii: baw100

quvance / predicting-gene-expression-levels-from-dna-sequences-and-post-transcriptional-info-with-transformers Goto Github PK

predicting-gene-expression-levels-from-dna-sequences-and-post-transcriptional-info-with-transformers's Introduction