Giter VIP home page Giter VIP logo

brainstroke's Introduction

Brainstroke Prediction

Kaggle dataset link : https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset


Libraries used for dataset processing

  • Numpy
  • Pandas

Libraries used for graphical representation

  • Matplotlib
  • Seaborn

Libraries used for Scaling and Oversampling

  • Sklearn.preprocessing
  • Imblearn

PREPROCESSING


  • Removed the id column – decreasing the dimension – did not add to insights in the data analysis.
df = df.drop(['id'],axis=1)
  • Count for NULL values are checked among the attributes of the dataset
print(df.isna().sum())
  • Only BMI-Attribute had NULL values
  • Plotted BMI's value distribution - looked skewed - therefore imputed the missing values using the median.
  • Didn’t eliminate the records due to dataset being highly skewed on the target attribute – stroke and a good portion of the missing BMI values had accounted for positive stroke
  • The dataset was skewed because there were only few records which had a positive value for stroke-target attribute

  • In the gender attribute, there were 3 types - Male, Female and Other. There was only 1 record of the type "other", Hence it was converted to the majority type – decrease the dimension

  • Most of the attributes in the dataset were binary values – converting the numeric bin values into string bin values for dummy encoding.

    • Dummy encoding similar to one-hot encoding – Values in the binary ecoded columns are 1/0 – Additional attributes/columns created.
  • Random oversampling done on the dataset to balance the skew in the target attributes.

    • Boosting the number of records in the minority class – records

EDA - Exploratory Data Analysis


  • Plotted plots of each attribute - Analyse trends if any – plots: pie, histogram.
  • Plotted relation of target attribute to other attributes to find any correlation.
  • Plotted the heatmap – correlation plot between the attributes.
    • Heatmap showed very less correlation between the attribute values.

MODEL BUILDING


  • Creating a train and test split of the oversampled dataset. (80-20)

Applied various Machine learning models for predictive analysis

  1. Decision tree
  2. KNN
  3. XG-Boost
  4. Random forest
  5. Logistic regression

Analysed the results generated using confusion matrix - accuracy, precision, recall and plotting the ROC plot and generating the AUC scores.

Accuracies calculated:

  1. Decision tree : 97.89%
  2. KNN : 97.22%
  3. XG-Boost : 97.48%
  4. Random forest : 99.48%
  5. Logistic regression : 76.34%

Chosen model - RANDOM FOREST

Results were validated using the k fold (20 splits) validation for overfitting

  • Accuracy: 95.01
    For Random Forest

brainstroke's People

Contributors

emilbluemax avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.