Project 1 of Pattern Classification and Machine Learning - 2017/2018

Discovery of Higgs Boson using CERN's public dataset. The submission is made to competition platform kaggle.

Brief Overview

The main scripts for this project are implementations.py (six ML methods) and run.py.
In addition, we create a python notebook, namely implementations.ipynb, to nicely document and display step-by-step processes of using functions in implementations.py script. We also document how we process the features in dataset and perform our classification methods.

Note that all functions and helpers for the implementations.py and run.py are stored in scripts/ directory, for more details you can go through README in scripts/.

More on Technical Overview

Data preparation and Features Removal

We devide the input data into 8 subsets based on the value of PRI_JET_NUM (feature 22) and outliers in DER_MASS_MMC (feature 1). We find out that several features are tightly coupled with the value of PRI_JET_NUM. Since PRI_JET_NUM is ranged inclusively from value of 0 until 3, we devide the input data into 4 subgroups of data based on PRI_JET_NUM value. After splitting the data, we remove features based on their strong correlation with the value of PRI_JET_NUM. The details are as follow:

For PRI_JET_NUM = 0, remove features: [4, 5, 6, 11, 12, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29].
For PRI_JET_NUM = 1, remove features: [4, 5, 6, 11, 12, 15, 18, 20, 22, 26, 27, 28].
For PRI_JET_NUM = 2, remove features: [11, 15, 18, 20, 22, 28].
For PRI_JET_NUM = 3, remove features: [11, 15, 18, 20, 22, 28].

From these 4 subgroups, we devide again each subgroup into two subsets based on outliers on DER_MASS_MMC. So at the end, we have 8 subsets of data to obtain a model each.

We define this step directly in both implementations.ipynb and run.py. In run.py, it is described on create_subsets() and remove_features() functions.

Features Processing and Generation

For each subset of input x, we process the features based on Standard score (z-score) and then expand them using logarithmic, polynomial, cross-term, and square root basis function.

The implementations of this step are written in preprocess.py on scripts/preprocess.py

Cross-validation

We validate our models using cross-validation to avoid underfitting or overfitting. Therefore we have two scripts; implementations_cross_validation.py (python script) and implementations_cross_validation.ipynb (python notebook), to show and prove that we do not encounter underfitting or overfitting in our models. Both scripts are duplication of implementations.py and implementations.ipynb except that these scripts only return the accuracies of the same methods but with cross-validation (splitting data into test and train set).

For ease-to-use, please check implementations_cross_validation.ipynb.

Important Notes for the Datasets

Please simply put the two data-sets (train.csv and test.csv) in higgs-data/ directory.

Project Structure

higgs-data: the CERN's public Higgs-Boson discovery datasets.
report: report in LaTeX
scripts: all main ML functions and helpers.

Minimum Dependencies

To execute the implementations.py and run.py, at least you should have numpy:

$ pip install numpy

implementations.py - 6 Mandatory ML Methods

We implement 6 ML methods as follows:

least_squares_GD : (Linear Regression using Gradient Descent)
least_squares_SGD : (Linear Regression using Stochastic Gradient Descent)
least_squares : (Linear Regression using Normal Equations.)
ridge_regression : (Ridge Regression using Normal Equations)
logistic_regression : (Logistic Regression using Gradient Descent)
reg_logistic_regression : (Regularized Logistic Regression using Gradient Descent)

run.py - Creating Final Submission File

Our final result can be produced by executing script run.py.
In run.py, we use logistic regression to produce our final submission.

Public leaderboard
- 82.842% of accuracy.
Private Leadeboard
- 82.753% of accuracy.

How to use implementations.py

Ensure that you have python 3 in your machine.
Clone this repository
To reuse either one of six implementations of ML methods. Load the functions from implementations.py script and pass the required parameters:

from implementations.py import [function]
# Example to run linear regression using gradient descent
weights, loss = least_squares_GD(y, tx, initial_w, max_iters, gamma)

or import all methods all at once.

from implementations.py import *
# Example to run least squares with normal equations
weights, loss = least_squares(y, tx)

How to use run.py

Ensure that you have python 3 in your machine.
Clone this repository
Run the run.py to create our final submission file to kaggle:

$ cd IST_ML_PROJECT1/
$ python run.py

How to use implementations.ipynb - Jupyter Notebook

Install Anaconda
Clone this repository
Run the jupyter-notebook (default is enabled by conda) in terminal:

$ cd IST_ML_PROJECT1/
$ jupyter-notebook implementations.ipynb

Follow the steps on each cell to produce and display results in nice HTML format.

Team - "Lovely Guys"

Project Repository Page

Cheng-Chun Lee (@wlo2398219) : ([email protected])
Haziq Razali (@haziqrazali): ([email protected])
Sanadhi Sutandi (@sanadhis) : ([email protected])

sanadhis / ist_ml_project1 Goto Github PK

ist_ml_project1's Introduction

Project 1 of Pattern Classification and Machine Learning - 2017/2018

Brief Overview

More on Technical Overview

Data preparation and Features Removal

Features Processing and Generation

Cross-validation

Important Notes for the Datasets

Project Structure

Minimum Dependencies

implementations.py - 6 Mandatory ML Methods

run.py - Creating Final Submission File

How to use implementations.py

How to use run.py

How to use implementations.ipynb - Jupyter Notebook

Team - "Lovely Guys"

ist_ml_project1's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org