Lawrence Chillrud [email protected]
Christian Duffee [email protected]
Charilaos Apostolidis [email protected]
Directory for EE475 final project.
Radiogenomic machine learning (ML) for EGFR mutation status prediction from Computed Tomography (CT) scans of patients with Non-Small Cell Lung Cancer (NSCLC).
LINK TO THIS REPOSITORY (please view online): https://github.com/lawrence-chillrud/ece475
LINK TO PROJECT YOUTUBE VIDEO: https://youtu.be/L6dXzu2up7k
Table of contents:
- Overview
- Requirements
- The data
- Downloading the data
- Converting the data
- Feature extraction
- Data exploration
- Feature selection
- Classification results
- References
We are using a radiogenomic machine learning pipeline to predict EGFR mutation status from Computed Tomography (CT) scans of patients with Non-Small Cell Lung Cancer (NSCLC).
EGFR mutation status, a binary characteristic in NSCLC {Mutant
, Wilde Type
}, is an important biomarker for NSCLC patients that correlates highly with prognosis (the EGFR mutated variant confers better prognosis). As such, it is crucial to know the EGFR mutation status of a tumor in order to determine the best treatment options available to the patient.
In our project, we take the following approach to the EGFR mutation status prediction task:
This pipeline then yields 15 models total (each of the 5 feature selectors crossed with each of the 3 classification models).
We provide the environment.yml
YAML file for easy cloning of the exact conda virtual environment used to run all project scripts. With conda installed, simply run the following line in ece475
directory:
conda env create --file environment.yml
Then activate the newly created rgml
environment with:
conda activate rgml
You should now be able to run all scripts in the code
folder without issue.
The dataset we are using is The Cancer Imaging Archive's (TCIA)1 publicly available NSCLC Radiogenomics dataset, made from a cohort of 211 NSCLC patients. A brief description from the dataset's TCIA page is reproduced below:
"The dataset comprises Computed Tomography (CT) images, ... and segmentation maps of tumors in the CT scans. Imaging data are also paired with results of gene mutation analyses... from samples of surgically excised tumor tissue, and clinical data, including survival outcomes. This dataset was created to facilitate the discovery of the underlying relationship between tumor molecular and medical image features, as well as the development and evaluation of prognostic medical image biomarkers."
We refer interested readers to Bakr et al.2 for further details regarding the dataset.
- Download the .tcia manifest file (88 KB) that will be needed to later retrieve the images and their segmentations from the NCBI Data Retriever (see steps 3 and 4). We include this
.tcia
file here in the data/TCIA/ folder incase the download link changes in the future. - Download the .csv file (67 KB) providing the clinical data obtained for each patient (including each patient's EGFR mutation status). Again, we include this
.csv
here in the data/TCIA/manifest/ folder (with the infixDATA_LABELS
) should the download link change in the future. (Note: this clinical data.csv
file is different from themetadata.csv
file we also include in the same folder. Themetadata.csv
file can be obtained from scratch by following step 4 below.) - Follow these instructions to install the NCBI Data Retriever application.
- Now open the
.tcia
manifest file downloaded in step 1 (this will open the NCBI app installed in step 3). Follow the app's prompts to download all the images and their segmentations (97.6 GB total). They will be downloaded in.dcm
DICOM format. Detailed instructions can be found here.
Note: Because we provide all the .csv
files needed throughout the main ML analysis in the data/generated/ directory, the interested reader need only follow the above steps to download the raw DICOM files if they want to perform the conversion (section 5 below) and feature extraction (section 6 below) themselves. Otherwise, sections 4-6 in this README need not be followed computationally, and just serve as documentation for how the analysis data was prepared. The relevant files for the main analysis are NSCLC_labels.csv (output by section 5) and NSCLC_features.csv (output by section 6).
In order to make the data smaller and easier to work with, we must convert all downloaded CT scans and their accompanying segmentations from their raw DICOM .dcm
format to a compressed NIfTI .nii.gz
format. Interested readers can study this useful DICOM and NIfTI Primer post for the differences between the two file types.
There are many tools for converting .dcm
-> .nii.gz
; the software we choose to employ is dcm2niix
3. Documentation for dcm2niix
can be found here.
We convert all CT scans and segmentations to .nii.gz
files by running the 1_convert_dcms.py script, which utilizes dcm2niix
under the hood. This script also outputs the NSCLC_labels.csv file.
Now that we have downloaded the data and converted it into the appropriate format, we can extract features from the CT scans and their segmentations using the pyradiomics
python package4. Documentation for pyradiomics
is extensive and can be found here.
We extract radiomic features from the .nii.gz
files by running the 2_extract_features.py script. This script outputs the NSCLC_features.csv, which (together with NSCLC_labels.csv) can be used for downstream ML tasks.
We split the dataset into training and testing data based on EGFR mutation status as follows:
-1 | 69 |
---|---|
1 | 17 |
-1 | 23 |
---|---|
1 | 6 |
See the code/3_visualize_features.py file for code used to generate these plots. Visuallzing some of the features we get
Not all of them are normally distributed, so we decide to normalize them rather than standardize (see the prep_data()
function in code/utils.py).
Five feature selection algorithms were used to find the most descriptive features using a grid search over possible hyperparameters. These algorithms are distance correlation, lasso, random forest, xgboost, and gradient boosted decision trees. Only the first four were run due to time constraints with re: exhaustive gridsearches. Furthermore, due to the nature of the algorithms, xgboost, random forest, and lasso allowed an additional search to determine how many of the top features should be included in the model.
Final classification results on all 15 models (validation and test):
Validation | Test | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Classifier | Feature Selector | AUC | F1 | MCC | Precision | Recall | AUC | F1 | MCC | Precision | Recall | |
Logistic Regression | Distance Correlation | 0.680 | 0.455 | 0.312 | 0.384 | 0.592 | 0.601 | 0.364 | 0.218 | 0.400 | 0.333 | |
Lasso | 0.780 | 0.570 | 0.472 | 0.459 | 0.792 | 0.453 | 0.154 | -0.089 | 0.143 | 0.167 | ||
Xgboost | 0.694 | 0.470 | 0.324 | 0.367 | 0.688 | 0.663 | 0.462 | 0.309 | 0.429 | 0.500 | ||
Random Forest | 0.738 | 0.531 | 0.407 | 0.433 | 0.708 | 0.558 | 0.308 | 0.110 | 0.286 | 0.333 | ||
SVM | Distance Correlation | 0.543 | 0.262 | 0.099 | 0.300 | 0.258 | 0.598 | 0.375 | 0.167 | 0.300 | 0.500 | |
Lasso | 0.694 | 0.485 | 0.382 | 0.498 | 0.513 | 0.435 | 0.000 | -0.173 | 0.000 | 0.000 | ||
Xgboost | 0.574 | 0.309 | 0.130 | 0.279 | 0.379 | 0.703 | 0.500 | 0.346 | 0.400 | 0.667 | ||
Random Forest | 0.644 | 0.433 | 0.326 | 0.526 | 0.404 | 0.540 | 0.222 | 0.106 | 0.333 | 0.167 | ||
LDA | Distance Correlation | 0.546 | 0.256 | 0.084 | 0.257 | 0.283 | 0.598 | 0.375 | 0.167 | 0.300 | 0.500 | |
Lasso | 0.630 | 0.374 | 0.239 | 0.364 | 0.438 | 0.457 | 0.000 | -0.139 | 0.000 | 0.000 | ||
Xgboost | 0.487 | 0.191 | -0.033 | 0.156 | 0.258 | 0.601 | 0.364 | 0.218 | 0.400 | 0.333 | ||
Random Forest | 0.514 | 0.140 | 0.046 | 0.250 | 0.100 | 0.500 | 0.000 | 0.000 | 0.000 | 0.000 |
This table shows various metrics describing the accuracites achieved for each classifier and feature selection algorithm pairs. The best validation result was achieved by usign a logistic regression with a lasso feature selector. The best testing result was achieved by using an SVM with an XGBoost feature selector.
Footnotes
-
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. ↩
-
Bakr S, Gevaert O, Echegaray S, Ayers K, Zhou M, Shafiq M, Zheng H, Benson JA, Zhang W, Leung ANC, Kadoch M, Hoang CD, Shrager J, Quon A, Rubin DL, Plevritis SK, Napel S. A radiogenomic dataset of non-small cell lung cancer. Sci Data. 2018 Oct 16;5:180202. ↩
-
Li X, Morgan PS, Ashburner J, Smith J, Rorden C. The first step for neuroimaging data analysis: DICOM to NIfTI conversion. J Neurosci Methods. 2016;264:47-56. ↩
-
Van Griethuysen JJ, Fedorov A, Parmar C, Hosny A, Aucoin N, Narayan V, Beets-Tan RG, Fillion-Robin JC, Pieper S and Aerts HJ. Computational radiomics system to decode the radiographic phenotype. Cancer research, 2017, 77(21), pp.e104-e107. ↩