SPREd

SPREd (Supervised Predictor of Regulatory Edges) is a machine learning based computational tool for reconstruction of gene regulatory networks (GRNs) from transcriptomic data.

Overview

Installation

Example usage

Data

Contact

Overview

The inputs to the SPREd model comprise expression relationships (correlation, mutual information, precision score) between the target gene and each candidate TF and between each pair of TFs. The output is one (SPREd-SP) or a vector of (SPREd-ML) boolean label, indicating whether the candidate TF regulates the target gene. Training such a neural network model requires a large number of training samples in the form of expresssion matrices and their underlying GRNs. For this, we use the GRN-based expression data simulator, SERGIO, that uses a biophysics-inspired stochastic differential equation to simulate a gene's expression dynamics under linear as well as non-linear influence of its regulator TFs, as prescribed by a GRN. We use SERGIO to generate thousands of training samples for SPREd, including a variety of GRNs and the corresponding expression profiles, resulting in a trained model that can predict the GRN for any given expression matrix.

Installation

To download SPREd, clone the repository locally by the following command.

git clone [email protected]:iiiime/SPREd.git
cd SPREd

Conda is recommended for environment setup and dependencies installation.

conda env create -f environment.yml
conda activate spred

Example usage

SPREd provides several sample datasets under SPREd/tests.

Quick Start

The easist way to get started is to evaluate our pretrained SPREd models. The sample input of SPREd-SP is a preprocessed matrix with dimension of (n, 5, n_tfs+1), where n is the number of TF-gene pairs, and 5 is the different relationships between the Tf and target gene computed from the expression matrix. The output is a vector of predicted edges of the TF and the target gene. Similarly, the input of the SPREd-ML model is a (n, 5, C(n_tfs+1, 2)) matrix, and the output is a list of boolean vector for each gene with every TF.

Pretrained models

To begin with, run the test scripts with the provided preprocessed synthetic expression data on 1 GRN of the 3-layer network with 5 master regulator, 100 TFs, and 100 target genes.

SPREd-SP

python ./sp/test.py --saved_weights './tests/model_weights_sp.pth' --data './tests/hist_sp.npy' --label './tests/label_sp.csv' --n_tf 100

SPREd-ML

python ./ml/test.py --saved_weights './tests/model_weights_ml.pth' --data './tests/hist_ml.npy' --label './tests/label_ml.csv'

The pretrained model outputs the predicted GRN, which is then compared and evaluated with the real GRN as in the SERGIO input network in the dataset.

Train new models

However, the current pretrained SPREd model implemented in model.py works only when applied to data with the exact same number of TFs. To train a new model and save the weight of the model, we need to generate the synthetic dataset with the same number of TFs and possibly the same number of conditions. This process might take up to 24 hours to simulate the sufficient amount of data including 30k target genes from different 300 GRNs. In future updates, we will optimize the SPREd workflow by replacing SERGIO with its 2.0 version, which is 100x faster than v1.0.

Data generation

The synthetic expression data is generated by SERGIO simulator, which supports thousands of training samples from various GRNs for SPREd model.

Run

python sim.py

to generate input (GRNs and regulation profiles) and output (expression matrix) of SERGIO.

Then, run

python preprocess_sp.py

python preprocess_ml.py

to configure and save the inputs for SPREd-SP or SPREd-ML model.

For example, if we want to generate the training data for predicting the yeast stress network (YEASTRACT.Count3 from MERLIN-P) that has 246 candidate TFs and 173 experimental conditions, we simulate the data by running

python sim.py --n_features 246 --n_cond 173 --n_samples 300 --data_dir ./dataset/data/y3_stress

to generate SERGIO expression matrix from 300 GRNs with the same properties of yeast dataset, and run

python preprocess_sp.py --n_features 246 --n_cond 173 --n_samples 300 --data_dir ./dataset/data/y3_stress --save ./dataset/y3_stress

to prepare the inputs for the SPREd-SP model.

Data simulation options

n_cond: int (default: 50). The number of experimental conditions, which is defined as the number of cell types in SERGIO simulator.
n_mrs: int (default: 5). The number of master regulators (MR) that are not regulated by other TFs.
n_features: int (defaut: 100). The number of TFs that are not MRs. The in degrees of TFs are in the range of [3, 8].
n_genes: int (default: 100). The number of target genes in each GRN.
n_samples: int (default: 10). The number of GRNs to simulate.
data_dir: str (default: './dataset/data'). The directory of input files of SERGIO and simulated data files generated by SERGIO.

Data preprocessing options

file_id: int (default: 0). The file ids to enumerate multiple datasets.
save: str (default: './dataset/). The directory of preprocessed input data for SPREd model.
data_dir: str (default: './dataset/data/'). The directory of SERGIO simulated data files.
n_cond: int (default:50). The number of conditions in epxression matrix.
n_bins: int (default:6). The number of bins of histograms.
n_samples: int (default:10). The number of training or testing samples.
n_mrs: int (default:5). The number of master regulators. Should be the same as the provided dataset.
n_features: int (default:100). The number of TFs. Should be the same as teh provided dataset.
n_genes: int (default:100). The number of target genes. Should be the same as teh provided dataset.

Train and run SPREd model

After generating and formatting the synthetic data for training SPREd model, run

python ./sp/train.py --data './dataset/hist.npy' --labels './dataset/label.csv'

The training is usually completed within 50 epochs.

Training options

batch_size: int (default: 32). The number of training samples in one pass.
learning_rate: float (default: 2e-4). Learning rate of Adam optimizer.
weight_decay: float (default: 5e-4). Weight decay of Adam optimizer.
hidden_unit: int (default: 128). The number of hidden units in fully connected layer.
pos_weight: int (default: 9). A weight of positive samples for BCEWithLogitsLoss.
dropout: float (default: 0.3). Dropout rate of hidden units.
data: str (default: './hist0.npy'). File name of the processed input data.
labels: str (default: './label0.csv'). File name of the labels of the data.

Meanwhile, preprocess the expression matrix you would like to infer GRN from for SPREd by executing python preprocess_sp.py or python preprocess_ml.py with properly specified directory. See Data preparation section above for details.

Finally, reconstruct the GRN by running:

python predict.py --data './dataset/'

which takes the processed expression matrix and the model weights from training as inputs, and for each gene, generates a list of edges between each candidate TF and the target gene as outputs.

Data sources

SERGIO

We provide several preprocessed datasets of the default setting simulate by SERGIO in folder SPREd/dataset and SPREd/tests. These datasets can be used for training and testing SPREd model and reproduce the main results in our manuscripts.

MERLIN-P

The dataset is accessible from the MERLIN-P inferred networks repository.

DREAM5

To DREAM5 challenge data can be downloaded here.

Contact

Please forward any questions to the author: Zijun Wu ([email protected]) or the senior PI: Dr. Saurabh Sinha ([email protected]).

iiiime / spred Goto Github PK

spred's Introduction

SPREd

Contents

Overview

Installation

Example usage

Quick Start

Pretrained models

Train new models

Data generation

Train and run SPREd model

Data sources

Contact

spred's People

Contributors

Stargazers

Watchers

spred's Issues

Recommend Projects

Recommend Topics

Recommend Org