gabors-data-analysis / da_case_studies Goto Github PK
View Code? Open in Web Editor NEWCodes for case studies for the Bekes-Kezdi Data Analysis textbook
License: MIT License
Codes for case studies for the Bekes-Kezdi Data Analysis textbook
License: MIT License
valami baj van datum kezeléssel.
R4.0.2 alatt.
df <- data.frame(hotel_id = hotels$hotel_id, price= hotels$price, lnprice_resid=hotels$lnprice_resid, distanc=hotels$distance, stars=hotels$stars, rating=hotels$rating)
df[order(df$lnprice_resid)[1:5], ]
#TODO
single aa or us markets are actually untreated. Typo in the book
--> must add comments in all languages
Add a bit more comments to R code on options
R2 calculation in FE models corrected in R. Within R2 should be printed. --> check Python.
Added a bit (not in book) that prints R2 for unweighted model --> add Python
Model M6 RMSE is wrong in R.
The issue is the log correction.
It wants to use the system-generated residual whereas that's impossible for the holdout set.
Instead, thus, it uses the training set residuals.
Yet it's easy to obtain the holdout set "residuals", they are (y-yhat).
The corrected RMSE is very different.
This is what we have in R
#had to cheat and use train error on full train set because could not obtain CV fold train errors
corrb <- mean((reg6$finalModel$residuals)^2)
rmse_CV["reg6"] <- reg6$pred %>%
mutate(pred = exp(pred + corrb/2)) %>%
group_by(Resample) %>%
summarise(rmse = RMSE(pred, exp(obs))) %>%
as.data.frame() %>%
summarise(mean(rmse)) %>%
as.numeric()
rmse_CV["reg6"]
below is new stata code that does it
*** 5 MODELS WITH QUANTITY AS TARGET VARIABLE (M1-M5)
local M1 t i.month
local M2 t i.month i.dayofweek
local M3 t i.month i.dayofweek i.natholiday
local M4 t i.month i.school_off##i.dayofweek
local M5 t i.month i.school_off##i.dayofweek i.month##i.dayofweek
forvalue i=1/5 {
forvalue y=2010/2015 {
qui use "$data_out/swim-daily-workfile.dta", replace
dis ""
dis "*********************************************"
dis "Model Mi'" dis "test year:
y'"
dis "training years:"
tab year if year!=y' reg quantity
Mi'' if year!=
y'
qui predict yhat
qui gen sq_error = (quantity - yhat)^2
qui sum sq_error if year==y' local mse_
y' = r(mean)
}
gen cv_mse_Mi' = (
mse_2010'+mse_2011'+
mse_2012'+mse_2013'+
mse_2014'+mse_2015')/6 gen cv_rmse_M
i' = sqrt(cv_mse_M`i')
keep if t==1 / one-obervation file with the forecast statistics /
keep t cv
cap merge 1:1 t using "$data_out/swim-daily-forecasts.dta", nogen
save "$data_out/swim-daily-forecasts.dta", replace
}
*** +1 MODEL (M6) WITH LN(QUANTITY) AS TARGET VARIABLE
local M6 t i.month i.school_off##i.dayofweek
forvalue y=2010/2015 {
use "$data_out/swim-daily-workfile.dta", replace
cap gen lnq = ln(quantity)
keep if year>=2010 & year<=2015
qui reg lnq M6' if year!=
y'
qui predict yhat
local sig = e(rmse)
replace yhat = exp(yhat) * exp(sig'^2/2) qui gen sq_error = (quantity - yhat)^2 qui sum sq_error if year==
y'
local mse_y' = r(mean) } gen cv_mse_M6 = (
mse_2010'+mse_2011'+
mse_2012'+mse_2013'+
mse_2014'+`mse_2015')/6
gen cv_rmse_M6 = sqrt(cv_mse_M6)
cap merge 1:1 t using "$data_out/swim-daily-forecasts.dta", nogen
aorder
order t
save "$data_out/swim-daily-forecasts.dta", replace
tabstat cv_rmse_M*, col(s) format(%4.2f)
the matchit gives atet, we'll also need ate.
asia-ip R kódból hiányoznak a regressziók....
(amugy, most megtaláltam egy nyári megjegyzésem, hogy érdemes lenne ezt megnézni fixest, de maradhat ln_robust is persze)
Add legend to plot: https://app.reviewnb.com/gabors-data-analysis/da_case_studies/blob/master/ch09-gender-age-earnings%2Fch09-earnings-inference.ipynb/file/#comment-nb-f9a84a9c
just trying whether reviewnb.com works..
It's great that exact versions are provided, but are you sure we have everything there? I am missing jupyter for instance.
I can't find the Dockerfile. Is it only me?
da_data_repo\worldbank-immunization\clean
only stata code, R is missing
--> as a result, ch23 R code reads in a dta
do all stuff in R code
check for FE
R: https://github.com/lrberge/fixest
Stata IC (which is available in CEU virtual lab e.g. )allows a maximum matsize of 800, but more is needed for the last psmatch2
It seems to me that the treatment effects from propensity score matching in lines 186-187 and 215-216 are average treatment effects on the treated (ATET) and not average treatment effects (ATE) as suggested by the variable names in the code (see documentation of matchit function in R)
This needs refactoring to tidyverse.
#########################################################################################
#########################################################################################
Add legends
ch13-used-cars-reg.R
Caret train - which RMSE calculates?
TO CHECK
@janosbiro
Tighten code as suggested by @kezdi
print appropriate tables
don't print what we don't have in book
rename models as in stata (book)
Plotnine is somehow not able to do alpha scaling with custom colors
ch07 hotel simple reg has 2 FIXMEs, both small
Need to add Renv / or other solution to offer library options
SwitchR was another idea. It has the idea, that offers a separate platform for the book, ie ppl may stay use other versions for other projects.
Something we started discussing w @zholler in the summer, not done.
sometimes one of us left their name...
try using modelsummary() to create regression tables
I added a small one.
Valami güzmi van
R 4.0.2, dplyr 1.02
Ownership
data %>%
ch 12 arizona huxreg valami nem okes
We are purging logs from code
and move to separate code which will not be shared.
I started it, not done.
+some todos in log bit.
Have you thought of providing Python type hints for functions? Wouldn't be a huge effort to append them IMO.
It's becoming the standard (at least I'd like to think so) and definitely helps instructors understand the codebase.
Line 70, need to include "phonecalls1" to variable selection, the rest needs this variable.
Line 128, varlabel(data, var.name = c("quitjob", "phonecalls1")) is not necessary and gives error.
R, Python world bank immunization files read in .dta
it should read in cleaned csv from osf.
+add cleaners to osf that save cleaned csv--- to be read in by code in github
Need to add monthly seasonality to VAR
ch19_food-health-maker.do --> R
not a big file, mostly data wrangling, labeling,
-change folder management at all R scripts.
-create packages.txt for all code and save as txt
Example: Like ch07. ch07-hotel-simple-reg
Zsuzsi to help Adam get started.
ezt miert kapom minden geom smoooth esetben?
pl ch07 hotels simple
geom_smooth()
using formula 'y ~ x'
Ch04 wms
filter is sok helyen nyafog
Sample selection
df <- df %>%
filter_()
is deprecated as of dplyr 0.7.0.filter()
instead.ch04 wms
df %>%
management_min emp_firm_min management_max emp_firm_max management_mean emp_firm_mean management_medi~ emp_firm_median
1 1.28 100 4.61 5000 2.94 761. 2.94 353
Warning message:
funs()
is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
list(mean = mean, median = median)
tibble::lst()
:tibble::lst(mean, median)
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call lifecycle::last_warnings()
to see where this warning was generated.
ch11, ch17 has calibration. have a calibration curve fn
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.