Cancer dependency study includes machine learning methods trying and a small shiny app.
Predicting cancer dependencies from molecular data can help stratify patients and identify novel therapeutic targets. However, the prediction power of protein expression data has been strictly measured. Thus, the paper intended to evaluate the predictive power of the protein expression data generated by reverse-phased protein arrays in detecting cancer dependencies, and to develop a related analytic tool for community use.
Understanding the relationship genotype-phenotype relationships of cancer cells would be critical for precision cancer medicine because it could help classify patients into different treatment groups and distinguish novel therapeutic targets.
The paper evaluated the consistency of cancer dependency data between CRISPR/Cas9 and short hairpin RNA (shRNA) perturbation platforms. Then the same-gene predictions of the cancer dependency would be performed using four available expression-related features (copy number alteration, DNA methylation, messenger RNA expression, and protein expression). Also, three machine learning algorithms (Conditional random forest, Linear regression, Random forest) have been used to analyze the feature importances.
For the genes selected from CRISPR/Cas9 and shRNA, the paper found that the protein expression data showed significant predictive power for cancer dependencies, and they were the best predictive feature for the CRISPR/Cas9-based dependency data. Thus, a systematic assessment for predicting cancer dependeccies of cell lines from different expression-related features of a gene has been provided. Also, the protein expression data have been proved that they are a highly valuable information source for understanding tumor vulnerabilities and identifying therapeutic opportunities.
This paper makes use of the following sets of data:
- Reverse-phase protein array (RPPA)-based protein data from CCLE, which assayed 214 protein markers across 899 cell lines.
- Cancer dependency data: CRISPR/Cas9 (DepMap19Q1) and shRNA (DEMETER2).
- Copy number alteration (CNA), DNA methylation, and mRNA expression data from CCLE
The reponse variable is a vector of dependency scores (cell growth change) for each gene across cell lines.
- A score of 0 indicates that a gene is not essential.
- A score of -1 corresponds to the median value of all common essential genes.
The explanatory variables (predictors) were the self-features that were related to gene expression.
Feature engineering has been done for improving the quality of the model outcome.
- Construct a robust cancer dependency set by selecting genes showed high consistency between shRNA and CRISPR/Cas9.
- Overlap that with the genes and cell lines from CCLE
- Only consider RPPA, CNA, DNA methylation, and mRNA expression from the same set of cell lines.
- Train-test-split: training set (70% cancer cell lines), testing set (30% cancer cell lines).
- Regression methods: linear regression, random forest, conditional random forest.
- Baseline model: exclude failed predictions by using the averaged dependency score as the predicted values.
- Cross-validation training: 10-fold cross validation and repeated the procedure for 10 times to avoid model overfitting.
- Evaluation metrics: root-mean-square error (RMSE) and R2.
- Fearture importance analysis (varImp function in R with caret package):
- Linear regression: the absolute value of the t-statistic for each model parameter is used.
- Random forest: the MSE is computed on the out-of-bag data for each tree, and then the same computed after permuting a variable.
R and Python
Python: scikit-learn (Lasso Regression), statsmodels (Lasso Regression, Linear Regression), CatBoost R: Shinyapp