Giter VIP home page Giter VIP logo

epiphany's Introduction

Epiphany

Epiphany: predicting Hi-C contact maps from 1D epigenomic signals

Epiphany, a neural network to predict cell-type-specific Hi-C contact maps from widely available epigenomic tracks. Epiphany uses bidirectional long short-term memory layers to capture long-range dependencies and optionally a generative adversarial network architecture to encourage contact map realism and sharpness. Epiphany shows excellent generalization to held-out chromosomes within and across cell types, yields accurate TAD and interaction calls, and predicts structural changes caused by perturbations of epigenomic signals.

Model Input and Output

  • Input: any combination of epigenomic tracks for certain cell type of interest.
  • Output: Hi-C contact map of the same cell type.

Epiphany is creating a connection between 1D epigenomic signals and the 3D chromatin structure, enabling the interpretation of feature importance of epigenomic signals from specific tracks in relation to structural changes. Any combination of epigenomic tracks can be used as input. Through our ablation analysis, we found that using a two-track combination (ATAC + CTCF) along yields commendable prediction quality. Furthermore, incorporating ATAC or CTCF in conjunction with other relevant epigenomic tracks as the input set significantly enhances the predictive capabilities.

Roadmap

This repo includes scripts and related files for the Epiphany model [preprint].

Resource repo: Zenodo DOI

  • Sample datasets: GM12878_X.h5 and GM12878_y.pickle for input and target sample datasets for Epiphany training
  • Pretrained model weights:
  • pretrained_10kb.pt_model: pretrained weights of 10kb model
  • pretrained_5kb.pt_model: pretrained weights of 5kb model

Quick start training

Clone Repository

git clone https://github.com/arnavmdas/epiphany.git

Training

Move to training directory

cd epiphany/epiphany

Download dataset from google drive

mkdir ./Epiphany_dataset
cd ./Epiphany_dataset
wget --no-check-certificate https://drive.google.com/drive/u/2/folders/1UJX6cp-4s0Jbud9jovzuaqnBeORg5R8x -O GM12878_X.h5
wget --no-check-certificate https://drive.google.com/drive/u/2/folders/1UJX6cp-4s0Jbud9jovzuaqnBeORg5R8x -O GM12878_y.pickle
cd ..

Run training script

python3 adversarial.py --wandb

Prediction using pretrained models

  • Generate contact map of GM12878 chromosome 3 using pre-trained model at 10kb resolution: Google colab
  • Generate contact map of a certain region on H1ES cell chromosome 8 [chr8:53167500-55167500] with original and perturbed epigenomic signals using pretrained model at 5kb resolution: Google colab

Contact

If you have any questions, please feel free to contact Rui Yang ([email protected]), Arnav Das ([email protected]).

epiphany's People

Contributors

ruy204 avatar arnavmdas avatar

Stargazers

Joachim Wolff avatar Silvia González-López avatar Xiao Wang avatar  avatar Sen Ai avatar  avatar Nikolai Bykov avatar Junru Jin avatar Lin avatar  avatar

Watchers

 avatar  avatar

Forkers

ruy204 alexbelov3

epiphany's Issues

OSError: Unable to open file (file signature not found)

Hi, thank you for the great work. I'm very interested in this model and playing around with it. I downloaded the data the ran the training script as instructed, but got OSError: Unable to open file (file signature not found). Here's the full error message.
Screen Shot 2022-08-17 at 11 27 01 PM
I googled this error which said this might due to file corruption. Would you mind taking a look at it? Thank you!

About prediction of all chromosomes

Dear all,

In the prediction example you provided, you provided the Chr3 ground truth data to predict Chr 3 interactions and maybe it is possible to compare between ground truth and predictions.

Suppose I need to predict all other chromosomes. Is the ground truth data of other chromosomes available? Or how can I create a file similar to the chr3 ground truth file?

Thanks

Getting error in prediction using pretrained models

Hi all,

Thanks for the great work.

when I try to run the google colab commands in my Jupyter notebook for: Generate contact map of GM12878 chromosome 3 using pre-trained model at 10kb resolution

I get an error in:

results_generation(chrom = chrom, net=net, 
                    cell_type = "GM12878", 
                    bwfile_dir = "/content/epiphany/bigWig",
                    submatrix_location = "/content/intermediate_matrices.txt", assemble_matrix_location = "/content/assembled_chromosome.txt",
                    ground_truth_file = '/content/epiphany/ground_truth/chr3_ground_truth.txt', ground_truth_location = "/content/ground_truth_corresponding_location.txt", 
                    window_size = wsize) #normcounts, zvalue, zfull

Of course I am adjusting the file paths

The error is:

[readRTreeIdx] Mismatch in the magic number!
[bwOpen] bwg->idx is NULL bwg->hdr->dataOffset 0x1919!
[pyBwOpen] bw is NULL!

Another question,

When I tried to see the file: chr3_ground_truth.txt, I found that it is not a normal text file. It is like a binary file although the .txt extension.

Can you please help in these 2 questions?

Thanks

ATAC-seq vs DNaseI-seq

Dear authors,

As ATAC and DNaseI assess open regions of the genomes, would a model trained on ATAC+(other epigenetic 1D maps mentioned in the Epiphany paper) be better suited than DNaseI for the prediction of HiC contact maps?

I am very interested in using Epiphany to predict contact maps in different subtypes of B lymphocytes (which GM12878 is). However, the high DNA input required by DNaseI-seq is limiting; 10-50 million cells. Whereas, ATAC-seq needs just 50k and can also go down to scATAC-seq scalings too.

I presume the main model presented in the paper would also perform great with ATAC instead of DNase too. However, with limited computing resources, my side can generate the ATAC-seq data for GM12878 BUT cannot train it, together with the other epigenetic 1D maps, for a fine-tuned/revised model.

I was also desperate to just fit the ATAC-seq bigwig file, as a pseudo-DNaseI file, for prediction on your Google Collab python notebook too. Maybe I will go ahead with this and see if the predictions differ much from the original runs that used DNaseI-seq.

BTW, I was wondering if your side has a model trained on (ATAC,CTCF, H3K27ac, H3K27me3, H3K4me3) that is available for the community to use too. Thanks much for reading thru this lengthy post!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.