Our team: Roi Peleg, Lilach Mor, Omer Rugi
8200Bio_Data_Challenge 3rd event in collaboration with DermaDetect!
Given a small batch of tabular medical data and a Decision tree model, try to improve the accuracy of the model while keeping the features readable so it could be presented as a "Tree" and a doctor could make sense out of it.
Label encoding - For all the discrete data.
OneHot encoding – on the labels (after label encoding them).
MinMaxScalar – on the numeric data.
Zeros & Ones – Replace the values of Booleans.
Data Completion – In the features ‘location_covrage’ and ‘pain.pain_type’ there was missing data, we tried RandomForest and KNN imputer to fill in the missing data. The samples that contained the data = train, the samples with missing data = test. The RandomForest was selected, performed better.
Feature/Demention reduction – using a Decision Tree model we found the 50 most important features, after running it we had a list of the most important features (rf_selected_features). After finding those, every call to the preprocessing cleaned the unnecessary features based on the rf_selected_features (both in train and in test).
Data Generator - creating new data based on the given sample's statistics and distribution.
Split test train – did it in an equal manner between the classes, so the train will “see” all the possible labels.
Generating data – used statistic sampling of the train data to generate more samples to train the model better.
OneHot Encoding – used it on the labels when training the model.
** Note: we tried –
- Change all the discrete data as OneHot vectors but it didn’t work.
- Use GAN’s to generate data
Didn’t give good results.