This project focuses on classifying different types of brain tumors using MRI data from TCGA and REMBRANDT datasets. The process involves data merging, cleaning, encoding, feature extraction using Pyradiomics, and training machine learning models to predict the cancer type.
- Project Overview
- Data Preprocessing Steps
- Overview of the Models Used
- Training Results
- Testing on Unseen Data
- Conclusion
This data science project aims to classify different types of brain tumors (Astrocytoma, GBM, Oligodendroglioma) using radiomic features combined with clinical data. The steps include data merging, cleaning, encoding, feature extraction, and training separate models for each disease type. The results include accuracy metrics and detailed classification reports for each model.
- Pandas: For data manipulation and analysis.
- NumPy: For numerical operations on arrays.
- SimpleITK: To handle medical images in conjunction with radiomics.
- PyRadiomics: For the extraction of radiomics features from medical images.
- Joblib: To serialize Python objects.
- Imbalanced-Learn: For handling imbalanced datasets.
- XGBoost: For model training and prediction.
- Clinical data is loaded from an Excel file into a pandas DataFrame.
- Radiomics data is loaded from a CSV file into another DataFrame.
- The radiomics and clinical data are merged on the patient ID column.
- The
Row.names
column is dropped after the merge.
- The
Gender
column is encoded into integers, with 'MALE' as 0 and 'FEMALE' as 1. - The
Race
column is mapped to integers for different races, with 'UNKNOWN' as 4 and missing values filled with 0.
- Any remaining NaNs in the DataFrame are filled with 0.
- Descriptive statistics for select columns in the DataFrame are calculated and displayed.
- The script counts and displays the number of instances for each unique value in the
Disease_Type
column.
- The 'Disease_Type' column is the target, classifying three categories of brain tumors: 'GBM', 'Astrocytoma', and 'Oligodendroglioma'.
- The disease type column is encoded using one hot encoding.
- Initial SVM models are used, followed by enhanced versions using SMOTE to address data imbalance.
- Hyperparameter optimization via GridSearchCV and RandomizedSearchCV.
- AdaBoostClassifier with SVM as the base classifier is employed to improve model performance, especially for handling imbalanced classes.
- Implemented for anomaly detection in 'Astrocytoma', distinguishing outliers crucial for identifying rare or atypical cases.
- Combines SVM and AdaBoost under a Logistic Regression final estimator, leveraging the strengths of each model to enhance predictions.
- Utilizes XGBoost for robust classification across each disease type, focusing on managing various data dimensions and complexities.
- Accuracy, confusion matrices, and classification reports are the primary metrics used to evaluate model performance.
- The multilabel approach achieved better results compared to the multiclass framing.
- Disease_Type_GBM: Consistently strong performance with accuracy rates up to 80%.
- Disease_Type_Oligodendroglioma: High performance with accuracy up to 90% in some models.
- Initial setup and enhanced versions using SMOTE and hyperparameter optimization.
- Post Grid Search Optimization and Random Search Optimization showed improved accuracy.
- Improved performance by aggregating the predictive power of multiple weak classifiers.
- Achieved high accuracy and robust performance across various datasets.
- Effective for the majority class but struggled with the minority class (True) for Astrocytoma.
- Strong performance, particularly for GBM, with the highest accuracy among models.
- Highly accurate for Disease_Type_Astrocytoma but faced challenges in predicting the True class across all disease types.
- Key features identified include diagnostics_Mask-original_VolumeNum, original_glcm_Idmn, original_shape_Sphericity, and others.
- Features from the new dataset were loaded, and unnecessary columns were removed.
- Clinical ground truth data was loaded, and the
Disease_Type
was one-hot encoded.
- Both SVM and XGBoost models were applied to the unseen data.
- The SVM model showed varying performance, with the highest accuracy for Oligodendroglioma (0.734375).
- XGBoost also performed best for Oligodendroglioma (0.796875) but had lower accuracy for GBM and Astrocytoma.
SVM and XGBoost models demonstrated varying success across disease types. Both models struggled with GBM but performed moderately well for Astrocytoma and showed strong predictive power for Oligodendroglioma. Further model refinement and additional data could enhance performance, especially for underrepresented classes.