Kaggle Competition available here.
In the one-page/ folder are available the one-page versions of the whole project, in HTML format (with hiplot tool interaction enabled) and PDF format (with hiplot tool interaction NOT enabled)
The phenomenon of superconductivity (Wikipedia) was discovered by Heike Kamerlingh-Onnes in 1911.
Superconductivity is a property of certain substances and materials whose electrical resistance drops to zero when the temperature equals to a certain value, called the critical temperature.
Many of the superconductivity properties are poorly understood, especially if the critical temperature can be predicted from the chemical and physical properties of the material.
- Develop ML algorithms that can correctly predict the critical temperature, given the chemical structure and physical properties of a substance
- Find which features are the most relevant in the estimation
The dataset comes from a database of superconducting materials compiled by Japan's National Institute of Materials Science (NIMS).
See 0_Data_Exploration notebook.
Different models are trained:
- Linear Regression
- Random Forest
- XGBoost
- KNN
- SVM
Using several preprocessing configurations and combinations:
- Removing highly correlated features
- StandardScaler, MinMaxScaler
- Normalizer L1, L2, Max
- PCA
- Train only on Properties or Formula dataset
See 1_Training notebook.
To investigate on the relationship between critical temperature and other features, have been considered the following indicators:
- the coefficients of the Linear Regression model
- the feature importance based on mean decrease in impurity, of Random Forest and XGBoost models
- the feature importance based on feature permutation, of Random Forest and XGBoost models
See 2_Features_Importance notebook.
Best Model | XGBoost |
Preprocessing | None |
MSE | 78.09 |
R^2 | 0.931 |
Mainly looking at the Feature Permutation of the XGBoost model, the most "important" features are: Cu, Ca, Ba, O, range_ThermalConductivity, Valence
See 3_Results notebook.