Brainstroke Prediction

Kaggle dataset link : https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Libraries used for dataset processing

Numpy
Pandas

Libraries used for graphical representation

Matplotlib
Seaborn

Libraries used for Scaling and Oversampling

Sklearn.preprocessing
Imblearn

PREPROCESSING

Removed the id column – decreasing the dimension – did not add to insights in the data analysis.

df = df.drop(['id'],axis=1)

Count for NULL values are checked among the attributes of the dataset

print(df.isna().sum())

Only BMI-Attribute had NULL values
Plotted BMI's value distribution - looked skewed - therefore imputed the missing values using the median.
Didn’t eliminate the records due to dataset being highly skewed on the target attribute – stroke and a good portion of the missing BMI values had accounted for positive stroke

The dataset was skewed because there were only few records which had a positive value for stroke-target attribute
In the gender attribute, there were 3 types - Male, Female and Other. There was only 1 record of the type "other", Hence it was converted to the majority type – decrease the dimension
Most of the attributes in the dataset were binary values – converting the numeric bin values into string bin values for dummy encoding.
- Dummy encoding similar to one-hot encoding – Values in the binary ecoded columns are 1/0 – Additional attributes/columns created.
Random oversampling done on the dataset to balance the skew in the target attributes.
- Boosting the number of records in the minority class – records

EDA - Exploratory Data Analysis

Plotted plots of each attribute - Analyse trends if any – plots: pie, histogram.
Plotted relation of target attribute to other attributes to find any correlation.
Plotted the heatmap – correlation plot between the attributes.
- Heatmap showed very less correlation between the attribute values.

MODEL BUILDING

Creating a train and test split of the oversampled dataset. (80-20)

Applied various Machine learning models for predictive analysis

Decision tree
KNN
XG-Boost
Random forest
Logistic regression

Analysed the results generated using confusion matrix - accuracy, precision, recall and plotting the ROC plot and generating the AUC scores.

Accuracies calculated:

Decision tree : 97.89%
KNN : 97.22%
XG-Boost : 97.48%
Random forest : 99.48%
Logistic regression : 76.34%

Chosen model - RANDOM FOREST

Results were validated using the k fold (20 splits) validation for overfitting

Accuracy: 95.01
For Random Forest

emilbluemax / brainstroke Goto Github PK

brainstroke's Introduction

Brainstroke Prediction

PREPROCESSING

EDA - Exploratory Data Analysis

MODEL BUILDING

brainstroke's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent