Cancer-Dependency-Study

Cancer dependency study includes machine learning methods trying and a small shiny app.

Summary

Overview

Context

Predicting cancer dependencies from molecular data can help stratify patients and identify novel therapeutic targets. However, the prediction power of protein expression data has been strictly measured. Thus, the paper intended to evaluate the predictive power of the protein expression data generated by reverse-phased protein arrays in detecting cancer dependencies, and to develop a related analytic tool for community use.

Need

Understanding the relationship genotype-phenotype relationships of cancer cells would be critical for precision cancer medicine because it could help classify patients into different treatment groups and distinguish novel therapeutic targets.

Vision

The paper evaluated the consistency of cancer dependency data between CRISPR/Cas9 and short hairpin RNA (shRNA) perturbation platforms. Then the same-gene predictions of the cancer dependency would be performed using four available expression-related features (copy number alteration, DNA methylation, messenger RNA expression, and protein expression). Also, three machine learning algorithms (Conditional random forest, Linear regression, Random forest) have been used to analyze the feature importances.

Outcome

For the genes selected from CRISPR/Cas9 and shRNA, the paper found that the protein expression data showed significant predictive power for cancer dependencies, and they were the best predictive feature for the CRISPR/Cas9-based dependency data. Thus, a systematic assessment for predicting cancer dependeccies of cell lines from different expression-related features of a gene has been provided. Also, the protein expression data have been proved that they are a highly valuable information source for understanding tumor vulnerabilities and identifying therapeutic opportunities.

Methodology

Data Sources

This paper makes use of the following sets of data:

Reverse-phase protein array (RPPA)-based protein data from CCLE, which assayed 214 protein markers across 899 cell lines.
Cancer dependency data: CRISPR/Cas9 (DepMap19Q1) and shRNA (DEMETER2).
Copy number alteration (CNA), DNA methylation, and mRNA expression data from CCLE

Data Model

The reponse variable is a vector of dependency scores (cell growth change) for each gene across cell lines.

A score of 0 indicates that a gene is not essential.
A score of -1 corresponds to the median value of all common essential genes.

The explanatory variables (predictors) were the self-features that were related to gene expression.

Feature engineering has been done for improving the quality of the model outcome.

Construct a robust cancer dependency set by selecting genes showed high consistency between shRNA and CRISPR/Cas9.
Overlap that with the genes and cell lines from CCLE
Only consider RPPA, CNA, DNA methylation, and mRNA expression from the same set of cell lines.

Machine Learning Model

Train-test-split: training set (70% cancer cell lines), testing set (30% cancer cell lines).
Regression methods: linear regression, random forest, conditional random forest.
Baseline model: exclude failed predictions by using the averaged dependency score as the predicted values.
Cross-validation training: 10-fold cross validation and repeated the procedure for 10 times to avoid model overfitting.
Evaluation metrics: root-mean-square error (RMSE) and R².
Fearture importance analysis (varImp function in R with caret package):
- Linear regression: the absolute value of the t-statistic for each model parameter is used.
- Random forest: the MSE is computed on the out-of-bag data for each tree, and then the same computed after permuting a variable.

Tools

Paper

R and Python

Myself

Python: scikit-learn (Lasso Regression), statsmodels (Lasso Regression, Linear Regression), CatBoost R: Shinyapp

smt970913 / cancer-dependency-study Goto Github PK