nataliepham6720 / health-survey-risk-factors Goto Github PK

View Code? Open in Web Editor NEW

Our class project on evaluating a health survey dataset from kaggle. We'll try EDA, feature selection, perform +evaluate different ML models, and visualize results. Dataset: Behavioral Risk Factor Surveillance System Analysis

Jupyter Notebook 100.00%

health-survey-risk-factors's Introduction

Health-Survey-Risk-Factors

Main Idea: The Behavioral Risk Factor Surveillance System is health-related telephone surveys collected from U.S. residents regarding their health-related risk behaviours, chronic health conditions, and use of preventive services.

Goal: To investigate how behaviour and health conditions might affect General Health Status of people. This aims to identify populations at increased risk of chronic health conditions.

Dataset : Our datasets contain 491775 records and 330 features. The data types of variables are mixed: some are categorical (i.e., age, marital status), others are numerical (i.e., total fruits consume per day, minutes of first activity).

Project Plan: Explore the relationship between each feature and General Health Status, and also the relationship between different features.

Data exploration: we will conduct EDA to find missing values, outliers, and conduct feature engineering to find out the most important features we want to focus on (correlation analysis).
Hypothesis testing: we will use hypothesis testing and statistical analysis based on our research question.
Modeling: Fit linear regression and more complicated ML models to this problem.
Evaluation: separate datasets into two groups-training and testing. After training, test/validate and compare which models can accurately predict General Health Status of people.

Recommend Projects

nataliepham6720 / health-survey-risk-factors Goto Github PK