Giter VIP home page Giter VIP logo

eco-395-project-somin-lee-chenxin-zhu-tianyu-wang's Introduction

ECO-395-Project-Somin-Lee-Chenxin-zhu-Tianyu-Wang

The project, co-authored by Somin Lee, Chenxin zhu and Tianyu Wang, aims to explore issues related to low fertility in South Korea. We analyzed the datasets and built models for comparison, and finally obtained a model with a good fit degree and analyzed the contribution of different variables to the model fit. Finally, the variables with high contribution degree are analyzed according to their practical significance, and their practical logic is found.

See analysis report on Project_0429.pdf

Reproduction

We used two different languages to complete the project: R and python.

For R part, please follow the Project_0429.Rmd file in repository and use the dataset Data.csv in data folder. The packages used are as follow:

library(readxl)
library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(e1071)
library(factoextra)
library(readr)
library(rpart)
library(gbm)
library(readr)
library(osmdata)
library(tidyverse)
library(tidyr)
library(rsample) 
library(ggplot2)
library(pROC)

For python part, please run machinelearning_final.ipynb using the projectdata_new.csv in data folder. We have induced neural network to analyze and fit the data, and finally selected the top variables to combine with the variables selected by the traditional model for the final fit. We used the Pytorch framework, CUDA version 12.4.1, and the package is as follows:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from torch.utils.data import DataLoader, TensorDataset
import torch
import chardet
import torch.nn as nn
import torch.optim as optim
import shap

After importing packages, we cleaned the dataset. The NA values are populated as Missing and -999 for categorical value and numerical value seperately.

NUMERICAL_FILL = -999
CATEGORICAL_FILL = "Missing"

# Impute missing values with different constants based on data type
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:  # For numerical columns
        df[col].fillna(NUMERICAL_FILL, inplace=True)
    elif df[col].dtype == 'object':  # For categorical columns
        df[col].fillna(CATEGORICAL_FILL, inplace=True)

Then we dropped some variables in order to reduce distractions and build the neural network framework then build evaluation function to test the model.

# Split features and target
features = df.drop(['n_under_age10', 'n_fam_members', 'region', 'GENDER_RESP', 'AGE_RESP', 'HOUSE_TYPE', 'HOUSE_TYPE2', 'working_status', 'reason_not_working', 'working_field', 'unique_identifier', 'REASON_TIRED2', ], axis=1)
target = df['n_under_age10']

When we started testing the neural network, we found that the results were too random because the dataset was too small.

# Initialize SHAP explainer with the model prediction function and a sample of the transformed data
explainer = shap.KernelExplainer(f, X_transformed)

# Calculate SHAP values for the transformed data
shap_values = explainer.shap_values(X_transformed)

assert shap_values.shape[1] == X_transformed.shape[1], "Mismatch in number of SHAP values and features"
expected_value = explainer.expected_value
if isinstance(expected_value, np.ndarray):
    expected_value = expected_value[0] 
shap.decision_plot(expected_value, shap_values, X_transformed[:100], feature_names=feature_names)

So we decided to calculate the average contribution of multiple runs.

Ultimately, we used SHAP value and decision plot to display the average value after running 100 times neural network and their contributions to final output.

WARNING: The average SHAP value and SHAP decision plot code is time-consuming. Under discrete graphics card RTX 3060, the total running time is almost 2 hours.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.