Table of Contents
- Brief Description
- Reference to the Publication
- Methodology
- Available Commands
- Installation Instructions
- Execution Of CODC Using BRCA Data
- Explanation of the Relevant Parameters
- Input File Format Specification
- Output File Format Specification
- Explanation and Interpretation of the Output
- Recommended Hyperparameters by the Authors
The CODC CLI tool is designed for analyzing gene expression data to calculate differential co-expression using a copula-based approach. It is implemented in Python based on the R implementation of Ray et al. for enhanced performance, with support for parallel processing. The tool allows users to compute differential co-expression networks and provides additional commands for downstream analysis and performance measurement. Installation can be done via Docker or locally using PDM, a Python package manager. The tool expects input files in TSV format and outputs the co-expression network as a TSV file as well.
This tool implements the method proposed by Ray, S., Lall, S., & Bandyopadhyay, S. in "CODC: a Copula-based model to identify differential co-expression.".
The methodology to compute the copula based differential co-expression and mathematical explaination is detailed here
The CLI includes commands for:
- Copula based differential co-expression calculation (
codc
) - GO enrichment analysis (
go-enrichment
) - Performance measurement of Python script (
python-performance
) - Performance measurement of R script (
r-performance
)
This readme, explains Copula based differential co-expression calculation (codc
).
Before installing and running the CLI tool, you have to clone the repo and navigate to the project's root directory.
git clone [email protected]:bionetslab/grn-benchmark.git && cd grn-benchmark/src/codc-cli-tool
docker build -t codc-tool .
Install PDM (Python package manager) if not already installed:
pip install pdm
Then, install the packages using PDM:
pdm install
The commands below will output the network.tsv
in ./data/
directory
docker run --rm -v ./data:/data codc-tool codc --input_file_1 /data/BRCA_normal.tsv --input_file_2 /data/BRCA_tumor.tsv --output_path /data --batch_size 100
pdm run cli codc --input_file_1 ./data/BRCA_normal.tsv --input_file_2 ./data/BRCA_tumor.tsv --output_path ./data --batch_size 100
- Description: Path to the TSV file containing gene expression data for the first condition.
- Required: Yes
- Example:
--inputfile_1 /path/to/condition1.tsv
- Description: Path to the TSV file containing gene expression data for the second condition.
- Required: Yes
- Example:
--inputfile_2 /path/to/condition2.tsv
- Description: The directory where the output TSV file will be saved. This file will contain the computed differential co-expression network based on copula approach.
- Required: Yes
- Example:
--output_path /path/to/output
- Output Details: The output is a TSV file named
network.tsv
, which includes columns for target gene, regulator gene, condition, and the weight as the co-expression difference.
- Description: Method to handle ties in data ranking within the pseudo-observations calculation.
- Required: No (default is "average")
- Options:
average
: Average ranks of ties.max
: Use the maximum rank for ties.
- Example:
--ties_method max
- Description: Specifies the smoothing technique applied to the empirical copula calculation.
- Required: No (default is "none")
- Options:
none
: No smoothing applied.beta
: Use a beta smoothing approach.checkerboard
: Apply checkerboard smoothing.
- Example:
--smoothing beta
- Description: Determines the method used for computing the Kolmogorov-Smirnov statistic, which quantifies the differential co-expression.
- Required: No (default is "asymp")
- Options:
asymp
: Use asymptotic properties of the KS statistic.auto
: Automatically determine the best method based on data characteristics.exact
: Compute an exact KS statistic.
- Example:
--ks_stat_method exact
- Description: Determines how many pair of genes will be executed in each batch in parallel execution.
- Required: No (default is 100)
- Example:
--batch_size 100
Input files must be in a tab-separated format with gene names in rows and sample IDs in columns. Example:
Gene | TCGA-A7-A0CE | TCGA-A7-A0CH |
---|---|---|
ACTA1 | 6.872032023 | 4.947203749 |
MYL2 | 0.415445555 | 0.0 |
The output network.tsv
is a tab-separated file that includes:
- Target: Target gene of the edge.
- Regulator: Source gene of the edge.
- Condition: Describes the differential co-expression across conditions.
- Weight: Numerical value indicating the strength of the relationship.
Example output:
Target | Regulator | Condition | Weight |
---|---|---|---|
MYL2 | ACTA1 | Diff Co-Exp between both Condition | 0.1111 |
The network.tsv
output file lists gene pairs that are differentially coexpressed between two conditions, providing insights into gene interactions under different conditions.
There were no specific hyperparameters recommended by the authors. The default parameters used are based on typical settings derived from the author's R implementation:
ks_stat_method = asymp
ties_method = average
smoothing = none
Md Badiuzzaman Pranto, Friedrich-Alexander-Universität, Erlangen-Nürnberg.