arnavmdas / epiphany Goto Github PK

License: MIT License

Python 5.65% Shell 0.11% R 0.20% Jupyter Notebook 94.04%

epiphany's Introduction

Epiphany

Epiphany: predicting Hi-C contact maps from 1D epigenomic signals

Epiphany, a neural network to predict cell-type-specific Hi-C contact maps from widely available epigenomic tracks. Epiphany uses bidirectional long short-term memory layers to capture long-range dependencies and optionally a generative adversarial network architecture to encourage contact map realism and sharpness. Epiphany shows excellent generalization to held-out chromosomes within and across cell types, yields accurate TAD and interaction calls, and predicts structural changes caused by perturbations of epigenomic signals.

Model Input and Output

Input: any combination of epigenomic tracks for certain cell type of interest.
Output: Hi-C contact map of the same cell type.

Epiphany is creating a connection between 1D epigenomic signals and the 3D chromatin structure, enabling the interpretation of feature importance of epigenomic signals from specific tracks in relation to structural changes. Any combination of epigenomic tracks can be used as input. Through our ablation analysis, we found that using a two-track combination (ATAC + CTCF) along yields commendable prediction quality. Furthermore, incorporating ATAC or CTCF in conjunction with other relevant epigenomic tracks as the input set significantly enhances the predictive capabilities.

Roadmap

This repo includes scripts and related files for the Epiphany model [preprint].

Resource repo: Zenodo

Sample datasets: GM12878_X.h5 and GM12878_y.pickle for input and target sample datasets for Epiphany training
Pretrained model weights:

pretrained_10kb.pt_model: pretrained weights of 10kb model

pretrained_5kb.pt_model: pretrained weights of 5kb model

Quick start training

Clone Repository

git clone https://github.com/arnavmdas/epiphany.git

Training

Move to training directory

cd epiphany/epiphany

Download dataset from google drive

mkdir ./Epiphany_dataset
cd ./Epiphany_dataset
wget --no-check-certificate https://drive.google.com/drive/u/2/folders/1UJX6cp-4s0Jbud9jovzuaqnBeORg5R8x -O GM12878_X.h5
wget --no-check-certificate https://drive.google.com/drive/u/2/folders/1UJX6cp-4s0Jbud9jovzuaqnBeORg5R8x -O GM12878_y.pickle
cd ..

Run training script

python3 adversarial.py --wandb

Prediction using pretrained models

Generate contact map of GM12878 chromosome 3 using pre-trained model at 10kb resolution: Google colab
Generate contact map of a certain region on H1ES cell chromosome 8 [chr8:53167500-55167500] with original and perturbed epigenomic signals using pretrained model at 5kb resolution: Google colab

Contact

If you have any questions, please feel free to contact Rui Yang ([email protected]), Arnav Das ([email protected]).

epiphany's People

Contributors

Stargazers

Watchers

Forkers

ruy204 alexbelov3

epiphany's Issues

OSError: Unable to open file (file signature not found)

Hi, thank you for the great work. I'm very interested in this model and playing around with it. I downloaded the data the ran the training script as instructed, but got OSError: Unable to open file (file signature not found). Here's the full error message.

I googled this error which said this might due to file corruption. Would you mind taking a look at it? Thank you!

About prediction of all chromosomes

Dear all,

In the prediction example you provided, you provided the Chr3 ground truth data to predict Chr 3 interactions and maybe it is possible to compare between ground truth and predictions.

Suppose I need to predict all other chromosomes. Is the ground truth data of other chromosomes available? Or how can I create a file similar to the chr3 ground truth file?

Thanks

Getting error in prediction using pretrained models

Hi all,

Thanks for the great work.

when I try to run the google colab commands in my Jupyter notebook for: Generate contact map of GM12878 chromosome 3 using pre-trained model at 10kb resolution

I get an error in:

results_generation(chrom = chrom, net=net, 
                    cell_type = "GM12878", 
                    bwfile_dir = "/content/epiphany/bigWig",
                    submatrix_location = "/content/intermediate_matrices.txt", assemble_matrix_location = "/content/assembled_chromosome.txt",
                    ground_truth_file = '/content/epiphany/ground_truth/chr3_ground_truth.txt', ground_truth_location = "/content/ground_truth_corresponding_location.txt", 
                    window_size = wsize) #normcounts, zvalue, zfull

Of course I am adjusting the file paths

The error is:

[readRTreeIdx] Mismatch in the magic number!
[bwOpen] bwg->idx is NULL bwg->hdr->dataOffset 0x1919!
[pyBwOpen] bw is NULL!

Another question,

When I tried to see the file: chr3_ground_truth.txt, I found that it is not a normal text file. It is like a binary file although the .txt extension.

Can you please help in these 2 questions?

Thanks

ATAC-seq vs DNaseI-seq

Dear authors,

As ATAC and DNaseI assess open regions of the genomes, would a model trained on ATAC+(other epigenetic 1D maps mentioned in the Epiphany paper) be better suited than DNaseI for the prediction of HiC contact maps?

I am very interested in using Epiphany to predict contact maps in different subtypes of B lymphocytes (which GM12878 is). However, the high DNA input required by DNaseI-seq is limiting; 10-50 million cells. Whereas, ATAC-seq needs just 50k and can also go down to scATAC-seq scalings too.

I presume the main model presented in the paper would also perform great with ATAC instead of DNase too. However, with limited computing resources, my side can generate the ATAC-seq data for GM12878 BUT cannot train it, together with the other epigenetic 1D maps, for a fine-tuned/revised model.

I was also desperate to just fit the ATAC-seq bigwig file, as a pseudo-DNaseI file, for prediction on your Google Collab python notebook too. Maybe I will go ahead with this and see if the predictions differ much from the original runs that used DNaseI-seq.

BTW, I was wondering if your side has a model trained on (ATAC,CTCF, H3K27ac, H3K27me3, H3K4me3) that is available for the community to use too. Thanks much for reading thru this lengthy post!