Giter VIP home page Giter VIP logo

titanic_kaggle's Introduction

Data Processing on Titanic Dataset

About this project:

To practice data processing is one of the purposes of this project. The other reason is to see how PCA (Principal Component Analysis) affects machine learning models' performance. I trained 4 kinds of models on 2 kinds of dataset (with PCA and without PCA).

The dataset:

Titanic dataset: https://www.kaggle.com/c/titanic/data

  • train.csv: Rows: 891, Cols: 11, (memory usage: 83.5+ KB)
  • test.csv: Rows: 418, Cols: 10, (memory usage: 35.9+ KB)
  • gender_submission.csv: Rows: 418, Cols: 1, (memory usage: 6.5 KB)

Useful website:

Titanica: https://www.encyclopedia-titanica.org/titanic-deckplans/profile.html
whole

Variables in the dataset

Variables Definition Data type Key
Survived Survived int 0 = No, 1 = Yes
Pclass Ticket class int 1 = 1st, 2 = 2nd, 3 = 3rd
Name Name str
Sex Sex str
Age Age in years float
Sibsp # of siblings/spouses abroad the Titanic int
Parch # of parents/children abroad the Titanic int
Ticket ticket number str
Fare Passenger fare float
Cabin Cabin number str
Embarked Port of Embarkation str C = Cherbourg, Q = Queenstown, S = Southampton

What problems we have in the dataset

  • Missing Values
  • Outliers
  • Non-numerical Data
  • Multiple Value Ranges

Missing Values

Columns that have missing values

There are three columns which have missing values.

  • Age
  • Cabin
  • Embarked

Solution for Age column Screen Shot 2019-07-22 at 4 46 40 PM

Pclass has the biggest absolute corretion with Age. So the solution is take the mean of Age of each Pclass and insert them into blanks respectively.

Solution for Cabin column
Get rid of this column.

Because Screen Shot 2019-07-22 at 4 49 54 PM

  • It doesn't seem there is a correlation between Survived, which is the target variable, and Cabin.
  • The Cabin is missing 77.1% of values in the column. So it is hard to fill.

Solution for Embarked column
The column is missing only two values so I am going to fill the two blanks with S which is the place where most people got board from. Screen Shot 2019-07-22 at 4 53 00 PM

Outliers

The minimum and maximum of Fare seem something wrong. Screen Shot 2019-07-22 at 4 55 33 PM

Screen Shot 2019-07-22 at 4 57 01 PM

Solution for Fare
By using DataFrame and Titanica, which is the useful site, try to find the fare of rooms whose size and shape are similar to the size of rooms whose Fare is missing. If can't find them, use mean of fare of each Pclass because they are correlated with each other.

Non-Numerical Data

In the dataset I use for training, there are two columns which have non-numerical values. Screen Shot 2019-07-22 at 5 03 04 PM

  • Sex: Usually, sex is not dealt with ordinal variable, but I am going to deal with sex as an ordinal variable here because the female has more priority to be rescued like the privious plot shows. {male: 0, female: 1}
  • Embarked: This is not ordinal variable so I am going to use one-hot.

Multiple Value Ranges

To make them in a range between 0 and 1, use Min-Max Normalization.

$$ y = \frac{x - x_{min}}{x_{max} - x_{min}} $$



Where
$y: $is the normalized value of x $x: $is a value $x_min: $is the minmum in a column $x_max: $is the maximum in a column

Others

I also use PCA in order to summarize the dataset and to reduce the feature dimensions. And train models, Decision Tree, Random Forest, KNN, and NN on both of dataset without PCA and with PCA, and compared them by actually submitting the results to Kaggle.

titanic_kaggle's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.