Giter VIP home page Giter VIP logo

hicgan's Introduction

hicGAN

We proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs)

This work has been presented in ISMB2019 conference in an oral talk during July 21-25, Switzerland.

model

hicGAN consists of two networks that compete with each other. G tries to generate super resolution samples that are highly similar to real high resolution samples while D tries to discriminate generated super resolution samples from real high resolution Hi-C samples.

Requirements

  • TensorFlow == 1.13.1
  • TensorLayer == 1.9.1
  • hickle == 2.1.0
  • Java JDK == 1.8.0
  • Juicer Tool

Installation

hicGAN can be downloaded by

git clone https://github.com/kimmo1019/hicGAN

Installation has been tested in a Linux/MacOS platform with python2.7.

Instructions

We provide detailed step-by-step instructions for running hicGAN model for reproducing the results in the original paper and inferring high resolution Hi-C data of your own interst.

Step 1: Download raw aligned sequencing reads from Hi-C experiments

We preprocess Hi-C data from alighed sequencing reads (e.g. GSM1551550_HIC001_merged_nodups.txt.gz from Rao et al. 2014). One can directly download raw Hi-C data from GEO database or refer to our raw_data_download_script.sh script in the preprocess folder. You will generate raw Hi-C data under a PATH-to-hicGAN/data/CELL folder. Please note that the download may take long time.

Step 2: Generate Hi-C raw contacts for both high resolutio Hi-C data and down-sampled low resolution Hi-C data given a resolution

We use Juicer toolbox for preprocessing the raw Hi-C data. Ensure that Java and Juicer toolbox are installed in your system. One can generate Hi-C raw contacts for both high resolutio Hi-C data and down-sampled low resolution Hi-C data by running preprocess.sh script in the preprocess folder. Note that one can speed up the preprocessing using slurm by modify one line of preprocess.sh. See annotation in preprocess.sh.

bash preprocess.sh <CELL> <Resolution> <path/to/juicer_tools.jar>

For example, one can directly run bash preprocess.sh GM12878 10000 path/to/juicer_tools.jar to extract Hi-C raw contacts of GM12878 cell line with resolution 10k.

Step 3: Preprate the training and test data

Typically, Hi-C samples from chromosomes 1-17 will be kept for training and chromosomes 18-22 will be kept for testing in each cell type.

python data_split.py  <CELL>

For example, one can directly run python data_split.py GM12878 to generate train_data.hkl and test_data.hkl under the data/GM12878 data folder.

Step 4: Run hicGAN model

After preparing the training and test data, one can run the following commond to run hicGAN

python run_hicGAN.py <GPU_ID> <checkpoint> <graph> <CELL>

For example, one can run python run_hicGAN.py 0 checkpoint/GM12878 graph/GM12878 GM12878 Note that checkpoint is the folder to save model and graph is the folder for visualization with TensorBoard. The three folders will be created if not exist.

Step 5: Evaluate hicGAN model

After model training, one can evaluate the hicGAN by calculating MSR, PSNR and SSIM measurements, just run the following commond

python hicGAN_evaluate.py <GPU_ID> <MODEL_PATH> <CELL>

For example, one can run python hicGAN_evaluate.py 0 checkpoint GM12878 for model evaluation. The predicted data will be saved in data/<CELL> folder.

We finally provide a demo.ipynb to illustrate the above steps with a demo of Hi-C model.

We also provide a Results_reproduce to show how the results in our paper were produced.

Note that we also provide a pre-trained model of hicGAN which was trained in GM12878 cell line.

Run hicGAN on your own data

We provided instructions on implementing hicGAN model from raw aligned sequencing reads. One could directly run hicGAN model with custom data by constructing low resolution data and corresponding high resolution data in run_hicGAN.py with custom data by the following instructions.

Step 1: Modify one line in run_hicGAN.py

You can find lr_mats_train_full, hr_mats_train_full = hkl.load(...) in run_hicGAN.py. All you need to do is to generate lr_mats_train_full and hr_mats_train_full by yourself.

Note that hr_mats_train_full and lr_mats_train_full are high resolution Hi-C training samples and low resolution Hi-C training samples, respectively. The size of hr_mats_train_full and lr_mats_train_full are (nb_train,40,40,1) and (nb_train,40,40,1).

We extracted training examples in the original Hi-C matrices by cropping non-overlaping 40 by 40 squares (resolution: 10k bp) within 2M bp. See details in data_split.py if necessary.

Step 2: Modify one line in hicGAN_evaluate.py

You can find lr_mats_test,hr_mats_test, _ = hkl.load(...) in hicGAN_evaluate.py. All you need to do is to generate lr_mats_test and hr_mats_test by youself.

Then run the following command:

python hicGAN_evaluate.py <GPU_ID> <MODEL_PATH> <CELL>

The predicted data will be saved in data/<CELL> folder.

We also provided a script hicGAN_predict.py for which the ground truth of test data is unknown. One can run the following commond:

python hicGAN_predict.py <GPU_ID> <MODEL_PATH> <DATA_PATH> <SAVE_DIR>

[GPU_ID] : GPU ID (e.g. 0)

[MODEL_PATH]:  path for weights file for hicGAN_g(e.g. checkpoint/g_hicgan_best.npz)

[DATA_PATH]: data path for enhancing (e.g. lr_mat_test.npy)

[SAVE_DIR]: save directory for predicted data 

You need to generate your own test data in npy format and use DATA_PATH to load it. The predcited data will be saved in SAVE_DIR folder.

Feel free to contact [email protected] if you have any problem in implementing your own hicGAN model.

Citation

Liu Q, Lv H, Jiang R. hicGAN infers super resolution Hi-C data with generative adversarial networks[J]. Bioinformatics, 2019, 35(14): i99-i107.

@article{liu2019hicgan,
  title={hicGAN infers super resolution Hi-C data with generative adversarial networks},
  author={Liu, Qiao and Lv, Hairong and Jiang, Rui},
  journal={Bioinformatics},
  volume={35},
  number={14},
  pages={i99--i107},
  year={2019},
  publisher={Oxford University Press}
}

License

This project is licensed under the MIT License - see the LICENSE.md file for details

hicgan's People

Contributors

kimmo1019 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hicgan's Issues

Load pretrain model error

Hi,
Thanks for the model.

I'm trying to use hicGAN to predict Hi-C matrix (data on GM12878).
I trained the model and stopped it after training for 7 days. Since I'm not familiar with TensorFlow V1, is there any hint for training? It trains quite slow.
It works fine to predict by loading "g_hicgan_best.npz", but the prediction was bad.

So I want to use the pre-train model. However, It shows an error when it loading "g_hicgan_GM12878_weights.npz":

tl.files.load_and_assign_npz(sess=sess, name=model_name, network=net_g)
File "/rhome/yhu/bigdata/.conda/envs/env_hicgan/lib/python3.6/site-packages/tensorlayer/files/utils.py", line 1712, in load_and_assign_npz
params = load_npz(name=name)
File "/rhome/yhu/bigdata/.conda/envs/env_hicgan/lib/python3.6/site-packages/tensorlayer/files/utils.py", line 1645, in load_npz
return d['params']
File "/rhome/yhu/bigdata/.conda/envs/env_hicgan/lib/python3.6/site-packages/numpy/lib/npyio.py", line 251, in __getitem__
pickle_kwargs=self.pickle_kwargs)
File "/rhome/yhu/bigdata/.conda/envs/env_hicgan/lib/python3.6/site-packages/numpy/lib/format.py", line 663, in read_array
 "to numpy.load" % (err,))

Any suggestion?

Bests.

The shape of predicted tensor is not correct

Hi kimmo1019,
I am wondering how do you recover the matrix?
We used some data to train your model and the shape of the tensor is (768, 40, 40, 1).
However, the shape of the related index file is (1407, 2).
It is not consistent(768!=1407). Could you tell me where is the index file created by your program? Thank you very much in advance for your time!

Bad Prediction

Hello there,

I am using your model. Even with pre-trained model parameters, I obtain pretty bad predictions.

I want to combine the subregions into per chromosome matrix and then compare instead of one-to-one subregion comparison. Below is the reverse of your code to combine subregions into the whole matrix:

import numpy as np
import hickle as hkl
import pandas as pd
import math

we'll need chromosome sizes and indices of submatrices from the original test_data

df = pd.read_csv("../../../chromosome.txt", sep="\t", header=None)
chrsizes = df.values[0:25,1] # sex and mitochondrial chrs included hg19
re_mat = hkl.load("../test_data.hkl")
dists = re_mat[2]
distc = []
for i in range(0,len(dists)):
	distc.append(dists[i][1])
predchr = pd.unique(distc)
sub_inds = np.load("test_allchr_subregion_inds.npy")
thred = 200
size = 40
c = 1
for cname in predchr:
	pr_mat = np.load('%s/sr_mats_pre.npy'%cname)
	remat_ind = sum(sub_inds[:c])
	c +=1
	rematCond = re_mat[2][remat_ind:sum(sub_inds[:c])]
	pp = 0
	cnum = int(cname.split("chr")[1])
	bin = int(math.ceil(chrsizes[cnum-1]/10000.0)) # ceil returns float ! 
	row,col = bin,bin
	sr_mat = -1*np.ones((row,col))

recombine the predicted matrix into original dimensions

	for idx1 in range(0,row-size,size):
		for idx2 in range (0,col-size,size):
			my_cond = rematCond[pp][:]==[idx1-idx2,cname]
			if (abs(idx1-idx2)<thred) & (my_cond):
				sr_mat[idx1:idx1+size,idx2:idx2+size] = pr_mat[pp].reshape(40,40)
				pp+=1			
			if pp==pr_mat.shape[0]:
				break;		
		if pp==pr_mat.shape[0]:
			break;

	np.save("./pred_%s_hicGAN.npy"%cname,sr_mat)

I deal with your model for a while now and I couldn't detect my mistake if there is any.

So what do you think, when you updated the model do you think there appeared a problem?

Thank you,

Required memory for data_split.py on Rao2014 data

Hey there,

Currently, I am trying to reproduce your work with Rao2014 GM12878 data.
However,at the stage where we preprocess and normalize the data and split it using data_split.py, the process gets killed due to excessive memory requirement.
I work on a workstation with 64gb RAM.
I am wondering on what memory specs did you convey this analysis?

The article states the used GPU cards but not the required memory.

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.