gabors-data-analysis / da_case_studies Goto Github PK

View Code? Open in Web Editor NEW

170.0 170.0 152.0 62.43 MB

Codes for case studies for the Bekes-Kezdi Data Analysis textbook

License: MIT License

R 3.25% Stata 1.50% Jupyter Notebook 95.18% Python 0.07%

da_case_studies's People

Stargazers

Watchers

Forkers

pgszilagyi anhnguyendepocen viki-meszaros da-student-one sulekata utassydv xinqiw yxlaoban118 abdu95 lenamax2355 aronpalk m-zainidinov zlatnizmaj yousefibrahim88 kovacskokokornel zholler varun-agr diazcon rohit-21 sindile murmaty cjens zouxianqiang imreboda chuvanan andrefmb jacobo-campo pkattuman katya-stanzy abdulrasheedisah viktoriakonya limingming26 karayeah sn2121161 almaalbrecht like4986 bmbln saravargha ersan-kucukoglu snowdj hantekin thomasholleczek brookeum danidatascience bradmorgan60 alexanderpelaezj chiaracasoli y-khan akirawisnu nathalienf brunohelmeczy victoriamosby econometrics mgpuwa allanycg boukos beeway ivanmkc arturvelk geraldmandevhana karthy257 abhradeepmaiti j-bian nazaninkhazra colinjianzhang wizardshowing devdodone supriyaukey jvega68 felehaile tesprick jcasanchez temitope-benson davidalrc christianlindke kweyuchesa ayobishahana oliyiyi yz81 rogerlop namdz911 bognarandras hannaceu nai-coder gabizsiros caliline2 courtlin-holt-nguyen ttaszi lavigneathefeeling akaaaa47 berserker0007 cheondrak juhiikea kaffeegangster jonduan mathpol muzairaslam jinjiaacademy alvarogaiotti asmawani06

da_case_studies's Issues

ch3-vienna miinor

label fix: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch03-hotels-vienna-explore%2Fch03-hotels-vienna-explore.ipynb/file/#comment-nb-e1db3f0d

ch18 tstibble valami baj van [urgent]

valami baj van datum kezeléssel.
R4.0.2 alatt.

List of 5 best deals

df <- data.frame(hotel_id = hotels$hotel_id, price= hotels$price, lnprice_resid=hotels$lnprice_resid, distanc=hotels$distance, stars=hotels$stars, rating=hotels$rating)
df[order(df$lnprice_resid)[1:5], ]
#TODO

print out nice table

ch 22 market definition

single aa or us markets are actually untreated. Typo in the book
--> must add comments in all languages

Python ch19-24 ideas

Check esp ch 22-24

https://github.com/jmbejara/comp-econ-sp19/blob/master/lectures/4-23_Panel_Data/Fixed-and-Random-Effects-Rosetta-Stone.ipynb

https://github.com/dmsul/econtools

Ch17 RF options comment

Add a bit more comments to R code on options

ch23 python R2

R2 calculation in FE models corrected in R. Within R2 should be printed. --> check Python.

Added a bit (not in book) that prints R2 for unweighted model --> add Python

groupby.apply() too slow (8min)

Here: https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch22-airline-merger-prices/ch22-airlines-01-dataprep.ipynb

and here: https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch22-airline-merger-prices/ch22-airlines-02-analysis.ipynb

[urgent] swim log model

Model M6 RMSE is wrong in R.
The issue is the log correction.
It wants to use the system-generated residual whereas that's impossible for the holdout set.
Instead, thus, it uses the training set residuals.
Yet it's easy to obtain the holdout set "residuals", they are (y-yhat).
The corrected RMSE is very different.

This is what we have in R
#had to cheat and use train error on full train set because could not obtain CV fold train errors
corrb <- mean((reg6$finalModel$residuals)^2)
rmse_CV["reg6"] <- reg6$pred %>%
mutate(pred = exp(pred + corrb/2)) %>%
group_by(Resample) %>%
summarise(rmse = RMSE(pred, exp(obs))) %>%
as.data.frame() %>%
summarise(mean(rmse)) %>%
as.numeric()
rmse_CV["reg6"]

below is new stata code that does it

*** 5 MODELS WITH QUANTITY AS TARGET VARIABLE (M1-M5)
local M1 t i.month
local M2 t i.month i.dayofweek
local M3 t i.month i.dayofweek i.natholiday
local M4 t i.month i.school_off##i.dayofweek
local M5 t i.month i.school_off##i.dayofweek i.month##i.dayofweek

forvalue i=1/5 {
forvalue y=2010/2015 {
qui use "$data_out/swim-daily-workfile.dta", replace
dis ""
dis "*********************************************"
dis "Model Mi'" dis "test year: y'"
dis "training years:"
tab year if year!=y' reg quantity Mi'' if year!=y'
qui predict yhat
qui gen sq_error = (quantity - yhat)^2
qui sum sq_error if year==y' local mse_y' = r(mean)
}
gen cv_mse_Mi' = (mse_2010'+mse_2011'+mse_2012'+mse_2013'+mse_2014'+mse_2015')/6 gen cv_rmse_Mi' = sqrt(cv_mse_M`i')
keep if t==1 / one-obervation file with the forecast statistics /
keep t cv
cap merge 1:1 t using "$data_out/swim-daily-forecasts.dta", nogen
save "$data_out/swim-daily-forecasts.dta", replace
}

*** +1 MODEL (M6) WITH LN(QUANTITY) AS TARGET VARIABLE
local M6 t i.month i.school_off##i.dayofweek
forvalue y=2010/2015 {
use "$data_out/swim-daily-workfile.dta", replace
cap gen lnq = ln(quantity)
keep if year>=2010 & year<=2015
qui reg lnq M6' if year!=y'
qui predict yhat
local sig = e(rmse)
replace yhat = exp(yhat) * exp(sig'^2/2) qui gen sq_error = (quantity - yhat)^2 qui sum sq_error if year==y'
local mse_y' = r(mean) } gen cv_mse_M6 = (mse_2010'+mse_2011'+mse_2012'+mse_2013'+mse_2014'+`mse_2015')/6
gen cv_rmse_M6 = sqrt(cv_mse_M6)
cap merge 1:1 t using "$data_out/swim-daily-forecasts.dta", nogen
aorder
order t
save "$data_out/swim-daily-forecasts.dta", replace

tabstat cv_rmse_M*, col(s) format(%4.2f)

ch 21 wms --- ate

the matchit gives atet, we'll also need ate.

ch 23 asia ip regression missing

asia-ip R kódból hiányoznak a regressziók....

(amugy, most megtaláltam egy nyári megjegyzésem, hogy érdemes lenne ezt megnézni fixest, de maradhat ln_robust is persze)

Add legend to lowess plot in ch9-gender

Add legend to plot: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch09-gender-age-earnings%2Fch09-earnings-inference.ipynb/file/#comment-nb-f9a84a9c

just trying whether reviewnb.com works..

requirements.txt looks skinny

It's great that exact versions are provided, but are you sure we have everything there? I am missing jupyter for instance.

missing Dockerfile

I can't find the Dockerfile. Is it only me?

world bank data cleaner missing

da_data_repo\worldbank-immunization\clean

only stata code, R is missing

--> as a result, ch23 R code reads in a dta

ch23-immunization-life.R todo

do all stuff in R code

ch3 football minor

label fixes: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch03-football-home-advantage%2Fch03-football-home-advantage-describe.ipynb/file/#comment-nb-4299b0b6

ch22-24 fixed effects in R (not urgent)

check for FE
R: https://github.com/lrberge/fixest

ch21-ownership-management-quality/ch21-wms.do matsize too large for Stata IC

Stata IC (which is available in CEU virtual lab e.g. )allows a maximum matsize of 800, but more is needed for the last psmatch2

ch8 life exp label

correct x axis ticks: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch08-life-expectancy-income%2Fch08-life-expectancy-income.ipynb/discussion/#comment-146f957f

ch-21-wms-analysis.R matching on propensity score

It seems to me that the treatment effects from propensity score matching in lines 186-187 and 215-216 are average treatment effects on the treated (ATET) and not average treatment effects (ATE) as suggested by the variable names in the code (see documentation of matchit function in R)

ch09 hotel externalvalid

This needs refactoring to tidyverse.

ch16 RF -- PDP todo

#########################################################################################

Partial Dependence Plots -------------------------------------------------------

#########################################################################################

TODO

: somehow adding scale screws up. ideadlly both graphs y beween 70 and 130,

n:accom should be 1,7 by=1

FIXME

should be on holdout, right? pred.grid = distinct_(data_train, "), --> pred.grid = distinct_(data_holdout, )

ch11 smoking plots

Add legends

ch13 caret rmse

ch13-used-cars-reg.R

Caret train - which RMSE calculates?
TO CHECK
@janosbiro

ch3-simulation minor

labelfix: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch03-simulations%2Fch03-distributions.ipynb/file/#comment-nb-9543625e

ch 18 case shiller R refactor

Tighten code as suggested by @kezdi
print appropriate tables
don't print what we don't have in book
rename models as in stata (book)

ch4-management firmsize alpha coloring issue

Plotnine is somehow not able to do alpha scaling with custom colors

eg. https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch04-management-firm-size%2Fch04-wms-management-size.ipynb/file/#comment-nb-5ed15a8e

ch3-distribution height

Add second y axis with percentages: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch03-distributions-height-income%2Fch03-height-income.ipynb/file/#comment-nb-39d50d07

https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch03-distributions-height-income%2Fch03-height-income.ipynb/file/#comment-nb-29782d60

Ch2-managers figure 2.1

Last 3 obs missing: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch02-football-manager-success%2Fch02-football-manager-success.ipynb/file/#comment-nb-5ee9947f

ch07 simple reg

ch07 hotel simple reg has 2 FIXMEs, both small

R library freeze

Need to add Renv / or other solution to offer library options
SwitchR was another idea. It has the idea, that offers a separate platform for the book, ie ppl may stay use other versions for other projects.

Something we started discussing w @zholler in the summer, not done.

check stata, R: cut absolute paths

sometimes one of us left their name...

ch09 earnings inference tables

try using modelsummary() to create regression tables

I added a small one.

https://vincentarelbundock.github.io/modelsummary/articles/modelsummary.html

[urgemt] ch21 wms

Valami güzmi van
R 4.0.2, dplyr 1.02

Ownership: define founder/family owned and drop ownership that's missing or not relevant

Ownership

data %>%

dplyr::group_by(ownership) %>%
dplyr::summarise(Freq = n()) %>%
mutate(Percent = Freq / sum(Freq)*100, Cum = cumsum(Percent))
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "function"

ch 12 huxreg

ch 12 arizona huxreg valami nem okes

ch14 airbnb logs

We are purging logs from code
and move to separate code which will not be shared.

I started it, not done.

+some todos in log bit.

Python type hints

Have you thought of providing Python type hints for functions? Wouldn't be a huge effort to append them IMO.

It's becoming the standard (at least I'd like to think so) and definitely helps instructors understand the codebase.

Ch.20 "missing variable in dplyr data selection part

Line 70, need to include "phonecalls1" to variable selection, the rest needs this variable.
Line 128, varlabel(data, var.name = c("quitjob", "phonecalls1")) is not necessary and gives error.

world bank files ch02, ch23

R, Python world bank immunization files read in .dta
it should read in cleaned csv from osf.

+add cleaners to osf that save cleaned csv--- to be read in by code in github

ch7-simplereg minor

labelfix: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch07-hotels-simple-reg%2Fch07-hotels-simple-reg.ipynb/file/#comment-nb-e9855ad6

Ch18 case shiller T18.3 VAR in R

Need to add monthly seasonality to VAR

ch19 ch19_food-health-maker

ch19_food-health-maker.do --> R
not a big file, mostly data wrangling, labeling,

reformat R code folders as part of package

-change folder management at all R scripts.
-create packages.txt for all code and save as txt

Example: Like ch07. ch07-hotel-simple-reg
Zsuzsi to help Adam get started.

geom

ezt miert kapom minden geom smoooth esetben?
pl ch07 hotels simple

geom_smooth() using formula 'y ~ x'

filter

Ch04 wms

filter is sok helyen nyafog

Sample selection

df <- df %>%

filter(country=="Mexico" & wave==2013 & emp_firm>=100 & emp_firm<=5000)
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "function"
In addition: Warning message:
filter_() is deprecated as of dplyr 0.7.0.
Please use filter() instead.
See vignette('programming') for more help

ch3 hotels compare minor label

label fixes: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch03-hotels-europe-compare%2Fch03-hotels-europe-compare.ipynb/file/#comment-nb-36d73c3c

funs

ch04 wms

df %>%

dplyr::select(management, emp_firm) %>%
summarise_all(funs(min, max, mean, median, sd, n()))

A tibble: 1 x 12

management_min emp_firm_min management_max emp_firm_max management_mean emp_firm_mean management_medi~ emp_firm_median

1 1.28 100 4.61 5000 2.94 761. 2.94 353

... with 4 more variables: management_sd , emp_firm_sd , management_n , emp_firm_n

Warning message:
funs() is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:

Simple named list:

list(mean = mean, median = median)

Auto named with `tibble::lst()`:

tibble::lst(mean, median)

Using lambdas

list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call lifecycle::last_warnings() to see where this warning was generated.

ch5 online offline minor

labelfix: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch06-online-offline-price-test%2Fch06-online-offline-price-test.ipynb/file/#comment-nb-da2544fd

calibration curve as fn

ch11, ch17 has calibration. have a calibration curve fn