MPRA-DragoNN: Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

This project applies convolutional neural networks to predict output from massively parallel reporter assays (MPRAs), with the aim of systematically decoding regulatory sequence patterns and identifying noncoding variants that may affect gene expression.

Data

This project uses the Sharpr-MPRA dataset from Ernst et al. 2016 (https://www.nature.com/articles/nbt.3678). The raw data were minimally processed, as described in the paper and below. The prepared training, validation, and testing datasets in hdf5 format can be downloaded from here.

Technical details:

The raw data for each of the Sharpr-MPRA experiments was downloaded from the Gene Expression Omnibus, accession number GSE71279. These raw files are hosted at the following Dropbox link.

The raw counts from the experiments were processed by (1) computing log2(RNA+1 / DNA+1) for each 145bp sequence in each of the 12 tasks (described below); (2) column-wise z-score normalization of the log fold-changes (i.e., each task's output values had mean 0 and variance 1); (3) including the reverse complement of each sequence as another datapoint with the same activity values.

Each datapoint was then converted to a 145x4 NumPy array corresponding to the one-hot encoding of the sequence's ACGT representation. The label for each sequence was a length-12 array containing the normalized activity values for the 12 tasks. The data was split as follows: sequences on chr8 for validation (~30K), chr18 for testing (~20K), and the remaining chromosomes for training (~900K); the resulting hdf5 files are the ones at the data link.

Task description:

"k562_minp_rep1": K562 cell line, minimal promoter, replicate 1
"k562_minp_rep2": K562 cell line, minimal promoter, replicate 2
"k562_minp_avg": K562 cell line, minimal promoter, average*
"k562_sv40p_rep1": K562 cell line, strong SV40 promoter, replicate 1
"k562_sv40p_rep1": K562 cell line, strong SV40 promoter, replicate 2
"k562_sv40p_avg": K562 cell line, strong SV40 promoter, average*
"hepg2_minp_rep1": HepG2 cell line, minimal promoter, replicate 1
"hepg2_minp_rep2": HepG2 cell line, minimal promoter, replicate 2
"hepg2_minp_avg": HepG2 cell line, minimal promoter, average*
"hepg2_sv40p_rep1": HepG2 cell line, strong SV40 promoter, replicate 1
"hepg2_sv40p_rep1": HepG2 cell line, strong SV40 promoter, replicate 2
"hepg2_sv40p_rep1": HepG2 cell line, strong SV40 promoter, average*

*The "average" tasks are computed by pooling counts between replicates, i.e. computing log2(RNA_Rep1 + RNA_Rep2 + 1) - log(DNA + 1).

Model Training

The inputs to our model are shape (145, 4) NumPy arrays corresponding to one-hot encoded 145 base-pair DNA sequences. The outputs are 12 continuous values corresponding to normalized activity levels of the sequence in different cellular contexts (described above).

The neural network model used for MPRA activity prediction is a fairly standard convolutional architecture for genomics. We use three convolution layers (ReLU activation), each containing 120 filters of width 5, followed by a single fully connected layer (linear activation) to predict the 12 tasks. Our model uses task-wise mean squared error loss and our primary evaluation criteria (for validation/testing) is the Spearman correlation (robust to outliers, unlike Pearson).

The models have been implemented in Keras with a Tensorflow backend. To train the model:

python main.py --data_path /path/to/data

To resume training from an existing checkpoint:

python main --data_path /path/to/data --pretrained_model_checkpoint /path/to/checkpoint/model

During training, the model produces logs in the experiments directory, which can be visualized using tensorboard as:

tensorboard --logdir /path/to/log/dir/in/experiments

To evaluate on test set, pass the --evaluate 1 flag in addition to resuming the model from the checkpoint.

For other inputs, such as hyperparameters, refer

python main.py --help

Prediction and Interpretation

We have provided pretrained models in kipoi/ConvModel and kipoi/DeepFactorizedModel directories. We provide support for prediction and interpretation through Kipoi. The model json and yaml files are also available in the kipoi directory. The models can be loaded as:

import kipoi

model = kipoi.get_model("SNPpet/ConvModel")    # or "SNPpet/DeepFactorizedModel"

Follow the instructions on Kipoi to make predictions on arbitrary sequences and for interpreting the model.

Help

Feel free to direct questions about this project to Rajiv Movva: rmovva at mit dot edu, or open an Issue.

Citation

If you use this code for your research, please cite our paper: Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A (2019). Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS ONE 14(6): e0218073. https://doi.org/10.1371/journal.pone.0218073

zhangfishergit / mpra-dragonn Goto Github PK

mpra-dragonn's Introduction

MPRA-DragoNN: Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Data

Model Training

Prediction and Interpretation

Help

Citation

mpra-dragonn's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent