To practice data processing is one of the purposes of this project. The other reason is to see how PCA (Principal Component Analysis) affects machine learning models' performance. I trained 4 kinds of models on 2 kinds of dataset (with PCA and without PCA).
Titanic dataset: https://www.kaggle.com/c/titanic/data
- train.csv: Rows: 891, Cols: 11, (memory usage: 83.5+ KB)
- test.csv: Rows: 418, Cols: 10, (memory usage: 35.9+ KB)
- gender_submission.csv: Rows: 418, Cols: 1, (memory usage: 6.5 KB)
Titanica: https://www.encyclopedia-titanica.org/titanic-deckplans/profile.html
Variables | Definition | Data type | Key |
---|---|---|---|
Survived | Survived | int | 0 = No, 1 = Yes |
Pclass | Ticket class | int | 1 = 1st, 2 = 2nd, 3 = 3rd |
Name | Name | str | |
Sex | Sex | str | |
Age | Age in years | float | |
Sibsp | # of siblings/spouses abroad the Titanic | int | |
Parch | # of parents/children abroad the Titanic | int | |
Ticket | ticket number | str | |
Fare | Passenger fare | float | |
Cabin | Cabin number | str | |
Embarked | Port of Embarkation | str | C = Cherbourg, Q = Queenstown, S = Southampton |
- Missing Values
- Outliers
- Non-numerical Data
- Multiple Value Ranges
There are three columns which have missing values.
- Age
- Cabin
- Embarked
Pclass
has the biggest absolute corretion with Age
. So the solution is take the mean of Age of each Pclass
and insert them into blanks respectively.
Solution for Cabin
column
Get rid of this column.
- It doesn't seem there is a correlation between
Survived
, which is the target variable, andCabin
. - The
Cabin
is missing 77.1% of values in the column. So it is hard to fill.
Solution for Embarked
column
The column is missing only two values so I am going to fill the two blanks with S
which is the place where most people got board from.
The minimum and maximum of Fare
seem something wrong.
Solution for Fare
By using DataFrame and Titanica, which is the useful site, try to find the fare of rooms whose size and shape are similar to the size of rooms whose Fare
is missing.
If can't find them, use mean of fare of each Pclass
because they are correlated with each other.
In the dataset I use for training, there are two columns which have non-numerical values.
Sex
: Usually, sex is not dealt with ordinal variable, but I am going to deal with sex as an ordinal variable here because the female has more priority to be rescued like the privious plot shows. {male: 0, female: 1}Embarked
: This is not ordinal variable so I am going to use one-hot.
To make them in a range between 0 and 1, use Min-Max Normalization.
Where
$y: $is the normalized value of x
$x: $is a value
$x_min: $is the minmum in a column
$x_max: $is the maximum in a column
I also use PCA in order to summarize the dataset and to reduce the feature dimensions. And train models, Decision Tree, Random Forest, KNN, and NN on both of dataset without PCA and with PCA, and compared them by actually submitting the results to Kaggle.