suwonglab / peca Goto Github PK

PECA is a software for inferring context specific gene regulatory network from paired gene expression and chromatin accessibility data

Shell 26.26% MATLAB 73.74%

gene-regulatory-network gene-expression chromatin-accessibiity dnase-seq atac-seq rna-seq

peca's Introduction

PECA

Introduction:

PECA is a software for inferring context specific gene regulatory network from paired gene expression and chromatin accessibility data. please cite PECA and PECA2 papers:

Duren, Zhana, et al. "Modeling gene regulation from paired expression and chromatin accessibility data." Proceedings of the National Academy of Sciences 114.25 (2017): E4914-E4923.

Duren, Zhana, et al. "Time course regulatory analysis based on paired expression and chromatin accessibility data." Genome research 30.4 (2020): 622-634.

Quickly start:

wget https://github.com/SUwonglab/PECA/archive/master.zip
unzip master.zip
cd PECA-master/
bash install.sh

bash PECA.sh sampleName genome

Install:

bash install.sh

Run PECA:

Run PECA by following two steps:

Step 1: Input

Put the input files in folder named ./Input. Three files: ${SampleName}.txt, ${SampleName}.bam, ${SampleName}.bam.bai.

${SampleName}.txt is gene expression file containing two columns (tab delimited), gene Symbol and FPKM (or TPM).

${SampleName}.bam is chromatin accessibility data, DNase-seq or ATAC-seq.

${SampleName}.bam.bai is the index file of bam file.

Note that all the three files should have same before-dot-file-name ${SampleName},only difference is after dot ".txt", ".bam" or ".bam.bai". Please see the example of RAd4 in the ./Input directory.

Step 2: Run

sh PECA.sh ${SampleName} ${genome}

Example: sh PECA.sh RAd4 mm9

To make sure the code run smoothly, please provide at least 64GB memory.

The results will be ./Results/${SampleName}/ . ${SampleName}_network.txt is the tissue specific network.

TFTG_score.txt is regulation strength for the all TF to TG. Each row represent one TF and each column represents one target gene. Higher value represents higher possibility of regulation.

CRB_pval.txt is the Chromatin regulators' (CR) binding site matrix, each column represent one CR, each row represent one region, the values are p-values.

Run PECA without ENCODE data information

PECA model uses prior information from ENCODE data. One can learn this prior information using their own data without using the ENCODE data if the number of paired samples are greater than 5.

sh PECA_withoutENCODE.sh FullPath_to_sampleNameFile ${genome}

Example: sh PECA_withoutENCODE.sh /home/user/sampleName.txt hg19 Here /home/user/sampleName.txt is a txt file that contain sample names (contain one sample name per line). For example

ES_day0
ES_day2
ES_day4
ES_day6
ES_day10
ES_day20

Under Input folder you should have ES_day0.txt, ES_day0.bam, and ES_day0.bam.bai, and the same for other samples. The reults of ES_day0 will be stored in ./Results__withoutENCODE/ES_day0/.

Run PECA_compReg:

If you have two conditions (multiple samples in each conditions) and want to compare the two conditions at network level, please see tutorial in comparative_regulatory_analysis.md https://github.com/SUwonglab/PECA/blob/master/comparative_regulatory_analysis.md.

Run PECA_net_dif:

If you have two samples and want to compare the two samples at network level, please do it by following steps:

1, Prepare two networks: Run PECA on two samples one by one by "sh PECA.sh ${sampleName} ${genome}"

2, Run: sh PECA_compare_dif.sh ${Sample1} ${Sample2} ${Organism}

Example: sh PECA_compare_dif.sh K562 GM12878 human ; sh PECA_compare_dif.sh mESC RAd4 mouse

The results will be ./Results/Compare_${Sample1}_${Sample2}. Containing six files:

specific network of two samples: ${Sample1}_specific_network.txt and ${Sample2}_specific_network.txt

common network of two samples: ${Sample1}_${Sample2}_common_network.txt

specific module of two networks: ${Sample1}_specific_module.txt and ${Sample2}_specific_module.txt

common module of two samples: ${Sample1}_${Sample2}_common_module.txt

Files PooledNetwork.txt or PooledModule.txt can be used to visualize the network by cytoscype, and the node lable is given in file Node_lable.txt. "1" and "-1" in PooledNetwork.txt or PooledModuole.txt represent "Activation" and "Repression" respectively. "1" and "2" in Node_lable.txt represent the gene is Sample1 specific or Sample2 specific.

Run PECA_net_dif_multiple:

If you have two conditions (multiple samples in each conditions) and want to compare the two conditions at network level, please do it by following steps:

1, Prepare networks: Run PECA on all the samples from two conditions one by one by "sh PECA.sh ${sampleName} ${genome}"

2, Construct lables: Write the sample names of Group1 and Group2 into text files named $Group1 and $Group2, respectively. (eg. create one text file named "Control" and put the sample names of one condition to this file, create other text file named "Case" and put the names of the other condition to this file. Note that the sample name files contain one sample name per line )

3, Run: sh PECA_compare_dif_multiple.sh $Group1 $Group2 ${Organism} Example： sh PECA_compare_dif_multiple.sh Control Case human

The results will be ./Results/CompareGroup_${Group1}_${Group2}. Containing six files:

specific network of two conditions: ${Group1}_specific_network.txt and ${Group2}_specific_network.txt

common network of two conditions: ${Group1}_${Group2}_common_network.txt

specific module of two conditions: ${Group1}_specific_module.txt and ${Group2}_specific_module.txt

common module of two conditions: ${Group1}_${Group2}_common_module.txt

Files PooledNetwork.txt or PooledModuole.txt can be used to visualize the network by cytoscype, and the node lable is given in file Node_lable.txt. "1" and "-1" in PooledNetwork.txt or PooledModuole.txt represent "Activation" and "Repression" respectively. "1" and "2" in Node_lable.txt represent the gene is Group1 specific or Group2 specific.

Requirements:

Matlab (Optimization Toolbox)
macs2
homer
samtools
bedtools

Contact:

If you have any issues, please contact Zhana Duren by [email protected]

peca's People

Contributors

Stargazers

Watchers

Forkers

amirhmstu shanzhyang ftucos pedrodelosreyes josephlaic durenzn

peca's Issues

Updates on `PECA` and related repositories

Hi there,

I've been exploring a bunch of repositories that either contain the PECA software or have PECA as a major dependency. I've found these repos across a few accounts:

I've previously submitted issues on the SUwonglab/PECA issue tracker and made an attempt to get in touch with @durenzn directly, both via "@" and by email. Unfortunately, those issues are still awaiting responses.

To get a better picture of the current state of PECA and cast a wide net, I'm posting this identical issue across all these repositories. Is PECA archived, or is it still under development or maintenance? If it's still being developed or maintained, where's that mainly happening?

Any information or guidance would be super helpful. Thanks!

Error when running 'install.sh'

Thanks for the software.

I'm currently having issues running the 'install.sh' script. Attempting to download the tar.gz file from your website returns a 403 Forbidden error. It looks like the entire website has access permissions removed completely. Full error below:

--2021-10-06 18:42:01-- http://web.stanford.edu/~zduren/PECA/Thresholding-Based%20SVD_files/Prior.tar.gz
Resolving web.stanford.edu (web.stanford.edu)... 171.67.215.200, 2607:f6d0:0:925a::ab43:d7c8
Connecting to web.stanford.edu (web.stanford.edu)|171.67.215.200|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-10-06 18:42:01 ERROR 403: Forbidden.

tar (child): Prior.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

Thank you!

Request for code clean-up and question about init. of "EM-like" approach

Hi, Thanks for the nice approach and follow ups TimeReg and vPECA.

Gentle request for code clean-up: It would help open source development for some minor clean-up (var names, comments, typo fixes, ...). Then other devs could more easily work on larger refactorings, e.g. the abstraction of PECA/scr/PECA_network_* functions and so on.
Can you describe the initialization strategy for your "EM-like" approach? For instance, I see here you fix eita (eta, $\eta$ ?) to two values:

eita0=-30.4395;
eita1=0.8759;

I'm surprised single values are fixed and fixed to 4 decimal places for this algorithm.

Question: Do you suggest filtering the input gene expression table before running PECA

Hi team,

I read in the manuscript you mentioned building the network considering only genes with FPKM > 10.
So do you suggest removing lowly expressed genes before running PECA?
In addition, I noticed in your /Prior/TFTG_corr_mouse.mat, there is no Trp53 for mouse TFName, just want to know how you defined these TFs?

Thanks.

Remove assumption MATLAB is loaded via `Lmod`

PECA's shell scripts assume Lmod is installed and that MATLAB is available via Lmod in a module named matlab. See e.g. here.

This info is missing from the documentation, but this dependency/assumption should in any case be removed.

Question: should I use only nucleosome free region as ATACseq input?

Hi, I was wondering if filtering the ATACseq bam file used as input for PECA2 to retain only Nucleosome Free Regions is required and or would increase the accuracy of the analysis.

Does PECA work for uncommon animal species?

Hi,
I read this paper "Modeling gene regulation from paired expression and chromatin accessibility data" and found this software will be very helpful for my research. My research organism is Amphimedon queenslandica (one kind of demosponge). I have the ATAC-Seq data and RNA-seq data. We have the genome and genome feature formate files. I want to know how could I use PECA for my analysis? I found this software only build in human and mm information.
Thanks for any useful information advance.

Best Regards
Huifang

Can we use multi-cores to run the programme?

Thanks for your software.

I have a batch of Time Course ATAC and RNA-seq data to run.

Accordingly, i am running the PECA first. The motif finding step is too slow. I tried to add "-p 48" following -size given, however, it seems not work.

lack of CRS part

Hi, I have gone through the PECA pipeline without ENCODE. But I don't find the result related to CRS mentionded in TimeReg article, neither in the source code. It is supposed to build a curated network based on TRS and CRS filtering according to the article. But I find the RE-TG pairs in the present pipeline are simply identied by overlap with prior data, which is generatted by a linear model in our cases.
Is it my misunderstanding?

Error- Couldn't open motif file

Undefined function 'fsolve' for input arguments of type 'function_handle' when running `PECA_compare_dif_multiple.sh`

When running PECA_compare_dif_multiple.sh I got the following error:
Undefined function 'fsolve' for input arguments of type 'function_handle'.

The fix (for me) was to install the MATLAB 'Optimization Toolbox'.

System: MacOS Monterey (12.2), MacBook Pro 2020 (Intel i7)

Using FPKM for PECA/TimeReg

The manual says PECA can use TPM or FPKM as input but FPKM values are not usually used to compare between samples. Is it ok to use FPKM if we intend to use compare_diff or TimeReg, which compare between samples? (Also, for TimeReg, we have to take averages of the RNA replicates, can we do this with FPKMs? Or should we take the averages of the counts first then convert to FPKM?)

non-model organisms

Hi，
Can non-model organisms use this tool to infer networks?
Thanks a lot for your time and effort!

matrix dimensions must agree error in matlab

Hi, I ran the example file and everything seems to work fine.

However, when I tried with my own samples I get an error in Matlab

"matrix dimensions must agree"

the line of code that causes the error is

Score=sqrt(TFExpG').(2.^abs(R2)).*full(BOH);

Removing .*(2.^abs(R2)) allows the line to execute without errors.

Any ideas as to where the problem might be?

Thanks,
Sky