bernardo-de-almeida / deepstarr Goto Github PK

Deep learning model built to quantitatively predict the activities of developmental and housekeeping enhancers from DNA sequence in Drosophila melanogaster S2 cells

License: MIT License

R 27.55% Shell 23.70% Jupyter Notebook 41.43% Python 7.32%

deepstarr's Introduction

DeepSTARR

DeepSTARR is a deep learning model built to quantitatively predict the activities of developmental and housekeeping enhancers from DNA sequence in Drosophila melanogaster S2 cells.

For more information, see the DeepSTARR publication:
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers
Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark. Nature Genetics, 2022.
Presentation at ISCB Webinar

This repository contains the code used to process genome-wide and oligo UMI-STARR-seq data and train DeepSTARR.

Genome-wide enhancer activity maps of developmental and housekeeping enhancers

We used UMI-STARR-seq (Arnold et al., 2013; Neumayr et al., 2019) to generate genome-wide high resolution, quantitative activity maps of developmental and housekeeping enhancers, representing the two main transcriptional programs in Drosophila S2 cells (Arnold et al., 2017; Haberle et al., 2019; Zabidi et al., 2015).

The raw sequencing data are available from GEO under accession number GSE183939.
You can find the code to process the data here.

DeepSTARR model

DeepSTARR is a multi-task convolutional neural network that maps 249 bp long DNA sequences to both their developmental and their housekeeping enhancer activities. We adapted the Basset convolutional neural network architecture (Kelley et al., 2016) and designed DeepSTARR with four convolution layers, each followed by a max-pooling layer, and two fully connected layers. The convolution layers identify local sequence features (e.g. TF motifs) and increasingly complex patterns (e.g. TF motif syntax), while the fully connected layers combine these features and patterns to predict enhancer activity separately for each enhancer type.

You can find the code used to train DeepSTARR and compute nucleotide contribution scores here.
Data used to train and evaluate the DeepSTARR model as well as the final trained model are available on zenodo at https://doi.org/10.5281/zenodo.5502060.
DeepSTARR is also deposited in Kipoi.

Tutorial

An end-to-end example to train DeepSTARR, compute the nucleotide contribution scores and modisco TF motifs is contained in the following colab notebook: https://colab.research.google.com/drive/1Xgak40TuxWWLh5P5ARf0-4Xo0BcRn0Gd. You can run this notebook yourself to experiment with DeepSTARR.

Predict developmental and housekeeping enhancer activity of new DNA sequences

To predict the developmental and housekeeping enhancer activity in Drosophila melanogaster S2 cells for new DNA sequences, please run:

# Clone this repository
git clone https://github.com/bernardo-de-almeida/DeepSTARR.git
cd DeepSTARR/DeepSTARR

# download the trained DeepSTARR model from zenodo (https://doi.org/10.5281/zenodo.5502060)

# create 'DeepSTARR' conda environment by running the following:
conda create --name DeepSTARR python=3.7 tensorflow=1.14.0 keras=2.2.4 # or tensorflow-gpu/keras-gpu if you are using a GPU
source activate DeepSTARR
pip install git+https://github.com/AvantiShri/shap.git@master
pip install 'h5py<3.0.0'
pip install deeplift==0.6.13.0

# Run prediction script
python DeepSTARR_pred_new_sequence.py -s Sequences_example.fa -m DeepSTARR.model

Where:

-s FASTA file with input DNA sequences

UMI-STARR-seq with designed oligo libraries to test more than 40,000 wildtype and mutant Drosophila and human enhancers

We designed and synthesised (in oligo pools by Twist Bioscience) wildtype and TF motif-mutant sequences of Drosophila and human enhancers. The activity of each sequence in the oligo libraries was assessed experimentally by UMI-STARR-seq in Drosophila melanogaster S2 (both developmental and housekeeping UMI-STARR-seq; see figure below) and human HCT116 cells, respectively.

The raw sequencing data are available from GEO under accession number GSE183939.
You can find the code to analyse Drosophila and human oligo UMI-STARR-seq screens here.

Code for Figures

Scripts to reproduce each main figure can be found here and the respective processed data here.

UCSC Genome Browser tracks

Genome browser tracks showing genome-wide UMI-STARR-seq and DeepSTARR predictions in Drosophila, including nucleotide contribution scores for all enhancer sequences, together with the enhancers used for mutagenesis, mutated motif instances and respective log2 fold-changes in enhancer activity, are available at https://genome.ucsc.edu/s/bernardo.almeida/DeepSTARR_manuscript.
Dynamic sequence tracks and contribution scores are also available as a Reservoir Genome Browser session.

Questions

If you have any questions/requests/comments please contact me at [email protected].

deepstarr's People

Contributors

Stargazers

Watchers

Forkers

benxiahu vahidelyasi agduncan94 beyond-hao xp876 tanyongjun0815 lengfei5 liu-zhiyang jing-wan jasonlinjc

deepstarr's Issues

Kipoi error

Hi, I'm receiving the following Kipoi error:

ValueError: Error when checking input: expected input_12 to have 3 dimensions, but got array with shape (4, 249, 4, 1)

when using any of the following:

(1)

pred = model.pipeline.predict_example(batch_size=4)`

(2)

# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
# predict for a batch
batch_pred = model.predict_on_batch(batch['inputs'])

(3)

dl_kwargs = {'intervals_file': intervals_file_motif, 'fasta_file': fasta_file_motif}
dl = model.default_dataloader(**dl_kwargs)
batch_iterator = dl.batch_iter(batch_size=4)
pred = model.pipeline.predict(dl_kwargs, batch_size=4)

(1) and (2) are from http://kipoi.org/models/DeepSTARR/, while (3) uses a custom intervals and fasta file, which I can confirm has worked for other kipoi models (e.g., BPNet).

My environment is the same as in your README.md, with the addition of pip install kipoi and pip install kipoiseq -- any help would be greatly appreciated.

DeepSTARR running error

Hello,
Thanks for developping DeepSTARR which is great tool to predict enhancer activity from DNA sequences. I am trying to run DeepSTARR using the following command:
python DeepSTARR_new_predict.py -s Sequences_example.fa -m DeepSTARR.model
But i got the following error:
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
Traceback (most recent call last):
File "DeepSTARR_new_predict.py", line 71, in
keras_model, keras_model_weights, keras_model_json = load_model(model_ID)
File "DeepSTARR_new_predict.py", line 67, in load_model
keras_model.load_weights(keras_model_weights)
File "/nas/longleaf/home/anaconda2/envs/DeepSTARR_gpu/lib/python3.7/site-packages/keras/engine/network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "/nas/longleaf/home/anaconda2/envs/DeepSTARR_gpu/lib/python3.7/site-packages/keras/engine/saving.py", line 1004, in load_weights_from_hdf5_group
original_keras_version = f.attrs['keras_version'].decode('utf8')
AttributeError: 'str' object has no attribute 'decode'

Would you like to tell me how to deal with this issue?
Best,

Converting predictions to bigWig

Hey Bernardo!

Random question. When you predict across the dm3 genome, how do you save the outputs and then convert them to bigWig tracks that nicely match the STARR-seq bigWigs in magnitude? Maybe convert the predictions to bigBed and then use something like bigBedToBigWig.sh?

Adam

how to make figure 1A

Hello,
The DeepSTARR looks a great tool to predict enhancer activity from DNA sequences.
Would you like to tell me whether DeepSTARR can predict the motif from gene promoters? I also want to plot the figure 1A.

Constant y_pred values

Hi,

I was trying to run DeepSTARR_training.ipynb, but spearman correlations are always nan suggesting a constant input values. Also it stops at 14 epoch, please see the information below. Have any idea why? Or I just ignore it and run downstream analysis

Tensor("Dense_Dev_5/BiasAdd:0", shape=(None, 1), dtype=float32)
Tensor("Dense_Dev_target_5:0", shape=(None, None), dtype=float32)
Tensor("Dense_Hk_5/BiasAdd:0", shape=(None, 1), dtype=float32)
Tensor("Dense_Hk_target_5:0", shape=(None, None), dtype=float32)
Train on 402296 samples, validate on 40570 samples
Epoch 1/100
402296/402296 [==============================] - 67s 168us/step - loss: 5.0083 - Dense_Dev_loss: 2.3143 - Dense_Hk_loss: 2.6940 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1719 - val_Dense_Dev_loss: 2.3108 - val_Dense_Hk_loss: 2.8607 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 2/100
402296/402296 [==============================] - 66s 164us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1691 - val_Dense_Dev_loss: 2.3082 - val_Dense_Hk_loss: 2.8605 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 3/100
402296/402296 [==============================] - 68s 168us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1697 - val_Dense_Dev_loss: 2.3089 - val_Dense_Hk_loss: 2.8604 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 4/100
402296/402296 [==============================] - 68s 169us/step - loss: 5.0052 - Dense_Dev_loss: 2.3112 - Dense_Hk_loss: 2.6940 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1678 - val_Dense_Dev_loss: 2.3070 - val_Dense_Hk_loss: 2.8605 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 5/100
402296/402296 [==============================] - 68s 168us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1708 - val_Dense_Dev_loss: 2.3099 - val_Dense_Hk_loss: 2.8605 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 6/100
402296/402296 [==============================] - 67s 168us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1683 - val_Dense_Dev_loss: 2.3068 - val_Dense_Hk_loss: 2.8611 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 7/100
402296/402296 [==============================] - 68s 168us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1713 - val_Dense_Dev_loss: 2.3101 - val_Dense_Hk_loss: 2.8608 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 8/100
402296/402296 [==============================] - 70s 175us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1696 - val_Dense_Dev_loss: 2.3086 - val_Dense_Hk_loss: 2.8607 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 9/100
402296/402296 [==============================] - 69s 172us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1700 - val_Dense_Dev_loss: 2.3079 - val_Dense_Hk_loss: 2.8617 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 10/100
402296/402296 [==============================] - 70s 173us/step - loss: 5.0052 - Dense_Dev_loss: 2.3112 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1680 - val_Dense_Dev_loss: 2.3073 - val_Dense_Hk_loss: 2.8604 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 11/100
402296/402296 [==============================] - 70s 174us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1712 - val_Dense_Dev_loss: 2.3104 - val_Dense_Hk_loss: 2.8604 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 12/100
402296/402296 [==============================] - 70s 173us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1686 - val_Dense_Dev_loss: 2.3078 - val_Dense_Hk_loss: 2.8604 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 13/100
402296/402296 [==============================] - 69s 173us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1706 - val_Dense_Dev_loss: 2.3099 - val_Dense_Hk_loss: 2.8604 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan
Epoch 14/100
402296/402296 [==============================] - 69s 173us/step - loss: 5.0052 - Dense_Dev_loss: 2.3113 - Dense_Hk_loss: 2.6939 - Dense_Dev_Spearman: nan - Dense_Hk_Spearman: nan - val_loss: 5.1704 - val_Dense_Dev_loss: 2.3092 - val_Dense_Hk_loss: 2.8608 - val_Dense_Dev_Spearman: nan - val_Dense_Hk_Spearman: nan

Thanks,
Long