Giter VIP home page Giter VIP logo

machine_learning_predictivemodeling_exoplanets's Introduction

machine-learning_challenge

Evaluation of several categorization algorithms on NASA's Exoplanet Kepler Dataset

This project evaluates several categorization techniques including:

  • Logistic Regression
  • KNN
  • Random Forest
  • Grid Search

The initial step was eliminating co-dependent variables/measures from the original dataset. This was done through a study of the data dictionary on the website where the dataset resided.

The next step was the exploration of the data. For each of the three labels: Confirmed, Candidate, and False-Positive there is a boxplot per selected variable. This analysis is included in the jupyter notebook called boxplots and ExploreData. The boxplots were used to visually understand the relations between each of the measures and the three labels.

The following are the take-aways from this analysis. The list shows the measure and what label appears to be more sensitive to it.

koi_period -> Candidate

koi_time0bk -> Confirmed koi_impact -> Confirmed koi_tec_plnt_num -> Confirmed koi_depth -> Confirmed

koi_duration -> False-Positive
koi_prad -> False-Positive koi_tec -> False-Positive koi_snr -> False-Positive koi_srad -> False-Positive

koi_insol -> NONE koi_steff -> NONE koi_slagg -> NONE kep_mag -> NONE va -> NONE dec -> NONE

The potential implication of the analysis above is that any categorization algorithm is more likely to be better at identifying False-Positive label than any of the other two labels. This was verified in all the algorithms (models) used.

The worst performing algorithm was Logistic Regression with an overall effectiveness of 60%. However, when a more detailed view of the prediction vs actual results was made. Logistic Regression had an effectiveness of 80% in identifying False Positives, a 60% of Confirmed, and only a 3% effectiveness in identifying Candidates.

The remaining algorithms were comparably effective (92% - 96%) across the board, although accurary (recall) of Candidates was the lowest for all of them.

There was a large computational cost with the Grid Search. The process had to be re-started several times to ensure processing of all the folds.

It was important to eliminate irrelevant or non-significant attributes of measure to ensure the models had a stron base to train the model and effectively do testing and then predictions.

machine_learning_predictivemodeling_exoplanets's People

Contributors

fbarillas08 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.