Our class project on evaluating a health survey dataset from kaggle. We'll try EDA, feature selection, perform +evaluate different ML models, and visualize results. Dataset: Behavioral Risk Factor Surveillance System Analysis (https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system)
Main Idea: The Behavioral Risk Factor Surveillance System is health-related telephone surveys collected from U.S. residents regarding their health-related risk behaviours, chronic health conditions, and use of preventive services.
Goal: To investigate how behaviour and health conditions might affect General Health Status of people. This aims to identify populations at increased risk of chronic health conditions.
Dataset : Our datasets contain 491775 records and 330 features. The data types of variables are mixed: some are categorical (i.e., age, marital status), others are numerical (i.e., total fruits consume per day, minutes of first activity).
Project Plan: Explore the relationship between each feature and General Health Status, and also the relationship between different features.
- Data exploration: we will conduct EDA to find missing values, outliers, and conduct feature engineering to find out the most important features we want to focus on (correlation analysis).
- Hypothesis testing: we will use hypothesis testing and statistical analysis based on our research question.
- Modeling: Fit linear regression and more complicated ML models to this problem.
- Evaluation: separate datasets into two groups-training and testing. After training, test/validate and compare which models can accurately predict General Health Status of people.