Giter VIP home page Giter VIP logo

moderndive / moderndive_book Goto Github PK

View Code? Open in Web Editor NEW
733.0 733.0 461.0 850.41 MB

Statistical Inference via Data Science: A ModernDive into R and the Tidyverse

Home Page: https://www.moderndive.com/

License: Other

CSS 0.01% HTML 97.34% Shell 0.01% TeX 1.76% R 0.88%
bootstrap-method confidence-intervals data-science data-visualization data-wrangling dplyr ggplot2 hypothesis-testing infer moderndive permutation-test r regression regression-models rstats rstudio statistical-inference tidy tidyverse

moderndive_book's Introduction

moderndive R Package

CRAN_Status_Badge DOI Lifecycle: stable GitHub Actions Status Codecov test coverage CRAN RStudio mirror downloads

Overview

The moderndive R package consists of datasets and functions for tidyverse-friendly introductory linear regression. These tools leverage the well-developed tidyverse and broom packages to facilitate

  1. Working with regression tables that include confidence intervals
  2. Accessing regression outputs on an observation level (e.g. fitted/predicted values and residuals)
  3. Inspecting scalar summaries of regression fit (e.g. R-squared, R-squared adjusted, and mean squared error)
  4. Visualizing parallel slopes regression models using ggplot2-like syntax.

This R package is designed to supplement the book “Statistical Inference via Data Science: A ModernDive into R and the Tidyverse” available at ModernDive.com. For more background, read our Journal of Open Source Education paper.

Installation

Get the released version from CRAN:

install.packages("moderndive")

Or the development version from GitHub:

# If you haven't installed remotes yet, do so:
# install.packages("remotes")
remotes::install_github("moderndive/moderndive")

Basic usage

library(moderndive)
score_model <- lm(score ~ age, data = evals)
  1. Get a tidy regression table with confidence intervals:

    get_regression_table(score_model)
    ## # A tibble: 2 × 7
    ##   term      estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
    ## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001
    
  2. Get information on each point/observation in your regression, including fitted/predicted values and residuals, in a single data frame:

    get_regression_points(score_model)
    ## # A tibble: 463 × 5
    ##       ID score   age score_hat residual
    ##    <int> <dbl> <int>     <dbl>    <dbl>
    ##  1     1   4.7    36      4.25    0.452
    ##  2     2   4.1    36      4.25   -0.148
    ##  3     3   3.9    36      4.25   -0.348
    ##  4     4   4.8    36      4.25    0.552
    ##  5     5   4.6    59      4.11    0.488
    ##  6     6   4.3    59      4.11    0.188
    ##  7     7   2.8    59      4.11   -1.31 
    ##  8     8   4.1    51      4.16   -0.059
    ##  9     9   3.4    51      4.16   -0.759
    ## 10    10   4.5    40      4.22    0.276
    ## # … with 453 more rows
    
  3. Get scalar summaries of a regression fit including R-squared and R-squared adjusted but also the (root) mean-squared error:

    get_regression_summaries(score_model)
    ## # A tibble: 1 × 9
    ##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
    ##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
    ## 1     0.011         0.009 0.292 0.540 0.541      5.34   0.021     1   463
    
  4. Visualize parallel slopes models using the geom_parallel_slopes() custom ggplot2 geometry:

    library(ggplot2)
    ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
      geom_point() +
      geom_parallel_slopes(se = FALSE) +
      labs(x = "Age", y = "Teaching score", color = "Ethnicity")

Statement of Need

Linear regression has long been a staple of introductory statistics courses. While the curricula of introductory statistics courses has much evolved of late, the overall importance of regression remains the same. Furthermore, while the use of the R statistical programming language for statistical analysis is not new, recent developments such as the tidyverse suite of packages have made statistical computation with R accessible to a broader audience. We go one step further by leveraging the tidyverse and the broom packages to make linear regression accessible to students taking an introductory statistics course. Such students are likely to be new to statistical computation with R; we designed moderndive with these students in mind.

Contributor code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.


Six features

Why should you use the moderndive package for introductory linear regression? Here are six features:

  1. Focus less on p-value stars, more confidence intervals
  2. Outputs as tibbles
  3. Produce residual analysis plots from scratch using ggplot2
  4. A quick-and-easy Kaggle predictive modeling competition submission!
  5. Visual model selection: plot parallel slopes & interaction regression models
  6. Produce metrics on the quality of regression model fits

Data background

We first discuss the model and data background. The data consists of end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin. This data is included in the evals data frame from the moderndive package.

In the following table, we present a subset of 9 of the 14 variables included for a random sample of 5 courses[1]:

  1. ID uniquely identifies the course whereas prof_ID identifies the professor who taught this course. This distinction is important since many professors taught more than one course.
  2. score is the outcome variable of interest: average professor evaluation score out of 5 as given by the students in this course.
  3. The remaining variables are demographic variables describing that course’s instructor, including bty_avg (average “beauty” score) for that professor as given by a panel of 6 students.[2]
ID prof_ID score age bty_avg gender ethnicity language rank
355 71 4.9 50 3.333 male minority english teaching
262 49 4.3 52 3.167 male not minority english tenured
441 89 3.7 35 7.833 female minority english tenure track
51 10 4.3 47 5.500 male not minority english teaching
49 9 4.5 33 4.667 female not minority english tenure track

1. Focus less on p-value stars, more confidence intervals

We argue that the summary.lm() output is deficient in an introductory statistics setting because:

  1. The Signif. codes: 0 '' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 only encourage p-hacking. In case you have not yet been convinced of the perniciousness of p-hacking, perhaps comedian John Oliver can convince you.
  2. While not a silver bullet for eliminating misinterpretations of statistical inference, confidence intervals present students with a sense of the associated effect sizes of any explanatory variables. Thus, practical as well as statistical significance is emphasized. These are not included by default in the output of summary.lm().

Instead of summary(), let’s use the get_regression_table() function:

get_regression_table(score_model)
## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001

Observe how the p-value stars are omitted and confidence intervals for the point estimates of all regression parameters are included by default. By including them in the output, we can then emphasize to students that they “surround” the point estimates in the estimate column. Note the confidence level is defaulted to 95%; this default can be changed using the conf.level argument:

get_regression_table(score_model, conf.level = 0.99)
## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.13     4.79 
## 2 age         -0.006     0.003     -2.31   0.021   -0.013    0.001

2. Outputs as tibbles

While one might argue that extracting the intercept and slope coefficients can be “simply” done using coefficients(score_model), what about the standard errors? For example, a Google query of “how do I extract standard errors from lm in R” yielded results from the R mailing list and from Cross Validated suggesting we run:

sqrt(diag(vcov(score_model)))
## (Intercept)         age 
## 0.126778499 0.002569157

We argue that this task shouldn’t be this hard, especially in an introductory statistics setting. To rectify this, the three get_regression_* functions all return data frames in the tidyverse-style tibble (tidy table) format. Therefore you can extract columns using the pull() function from the dplyr package:

library(dplyr)
get_regression_table(score_model) %>%
  pull(std_error)
## [1] 0.127 0.003

or equivalently you can use the $ sign operator from base R:

get_regression_table(score_model)$std_error
## [1] 0.127 0.003

Furthermore, by piping the above get_regression_table(score_model) output into the kable() function from the knitr package, you can obtain aesthetically pleasing regression tables in R Markdown documents, instead of tables written in jarring computer output font:

library(knitr)
get_regression_table(score_model) %>%
  kable()
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.462 0.127 35.195 0.000 4.213 4.711
age -0.006 0.003 -2.311 0.021 -0.011 -0.001

3. Produce residual analysis plots from scratch using ggplot2

How can we extract point-by-point information from a regression model, such as the fitted/predicted values and the residuals? (Note we only display the first 10 out of 463 of such values for brevity’s sake.)

fitted(score_model)
##        1        2        3        4        5        6        7        8 
## 4.248156 4.248156 4.248156 4.248156 4.111577 4.111577 4.111577 4.159083 
##        9       10 
## 4.159083 4.224403
residuals(score_model)
##           1           2           3           4           5           6 
##  0.45184376 -0.14815624 -0.34815624  0.55184376  0.48842294  0.18842294 
##           7           8           9          10 
## -1.31157706 -0.05908286 -0.75908286  0.27559666

But why have the original explanatory/predictor age and outcome variable score in evals, the fitted/predicted values score_hat, and residual floating around in separate vectors? Since each observation relates to the same course, we argue it makes more sense to organize them together in the same data frame using get_regression_points():

score_model_points <- get_regression_points(score_model)
score_model_points
## # A tibble: 10 × 5
##       ID score   age score_hat residual
##    <int> <dbl> <int>     <dbl>    <dbl>
##  1     1   4.7    36      4.25    0.452
##  2     2   4.1    36      4.25   -0.148
##  3     3   3.9    36      4.25   -0.348
##  4     4   4.8    36      4.25    0.552
##  5     5   4.6    59      4.11    0.488
##  6     6   4.3    59      4.11    0.188
##  7     7   2.8    59      4.11   -1.31 
##  8     8   4.1    51      4.16   -0.059
##  9     9   3.4    51      4.16   -0.759
## 10    10   4.5    40      4.22    0.276

Observe that the original outcome variable score and explanatory/predictor variable age are now supplemented with the fitted/predicted values score_hat and residual columns. By putting the fitted values, predicted values, and residuals next to the original data, we argue that the computation of these values is less opaque. For example, instructors can emphasize how all values in the first row of output are computed.

Furthermore, recall that since all outputs in the moderndive package are tibble data frames, custom residual analysis plots can be created instead of relying on the default plots yielded by plot.lm(). For example, we can check for the normality of residuals using the histogram of residuals shown in .

# Code to visualize distribution of residuals:
ggplot(score_model_points, aes(x = residual)) +
  geom_histogram(bins = 20) +
  labs(x = "Residual", y = "Count")

Histogram visualizing distribution of residuals.

As another example, we can investigate potential relationships between the residuals and all explanatory/predictor variables and the presence of heteroskedasticity using partial residual plots, like the partial residual plot over age shown in . If the term “heteroskedasticity” is new to you, it corresponds to the variability of one variable being unequal across the range of values of another variable. The presence of heteroskedasticity violates one of the assumptions of inference for linear regression.

# Code to visualize partial residual plot over age:
ggplot(score_model_points, aes(x = age, y = residual)) +
  geom_point() +
  labs(x = "Age", y = "Residual")

Partial residual residual plot over age.

4. A quick-and-easy Kaggle predictive modeling competition submission!

With the fields of machine learning and artificial intelligence gaining prominence, the importance of predictive modeling cannot be understated. Therefore, we’ve designed the get_regression_points() function to allow for a newdata argument to quickly apply a previously fitted model to new observations.

Let’s create an artificial “new” dataset consisting of two instructors of age 39 and 42 and save it in a tibble data frame called new_prof. We then set the newdata argument to get_regression_points() to apply our previously fitted model score_model to this new data, where score_hat holds the corresponding fitted/predicted values.

new_prof <- tibble(age = c(39, 42))
get_regression_points(score_model, newdata = new_prof)
## # A tibble: 2 × 3
##      ID   age score_hat
##   <int> <dbl>     <dbl>
## 1     1    39      4.23
## 2     2    42      4.21

Let’s do another example, this time using the Kaggle House Prices: Advanced Regression Techniques practice competition ( displays the homepage for this competition).

House prices Kaggle competition homepage.

House prices Kaggle competition homepage.

This Kaggle competition requires you to fit/train a model to the provided train.csv training set to make predictions of house prices in the provided test.csv test set. We present an application of the get_regression_points() function allowing students to participate in this Kaggle competition. It will:

  1. Read in the training and test data.
  2. Fit a naive model of house sale price as a function of year sold to the training data.
  3. Make predictions on the test data and write them to a submission.csv file that can be submitted to Kaggle using get_regression_points(). Note the use of the ID argument to use the id variable in test to identify the rows (a requirement of Kaggle competition submissions).
library(readr)
library(dplyr)
library(moderndive)

# Load in training and test set
train <- read_csv("https://moderndive.com/data/train.csv")
test <- read_csv("https://moderndive.com/data/test.csv")

# Fit model:
house_model <- lm(SalePrice ~ YrSold, data = train)

# Make predictions and save in appropriate data frame format:
submission <- house_model %>%
  get_regression_points(newdata = test, ID = "Id") %>%
  select(Id, SalePrice = SalePrice_hat)

# Write predictions to csv:
write_csv(submission, "submission.csv")

After submitting submission.csv to the leaderboard for this Kaggle competition, we obtain a “root mean squared logarithmic error” (RMSLE) score of 0.42918 as seen in .

Resulting Kaggle RMSLE score.

Resulting Kaggle RMSLE score.

5. Visual model selection: plot parallel slopes & interaction regression models

For example, recall the earlier visualizations of the interaction and parallel slopes models for teaching score as a function of age and ethnicity we saw in Figures and . Let’s present both visualizations side-by-side in .

Interaction (left) and parallel slopes (right) models.

Students might be wondering “Why would you use the parallel slopes model on the right when the data clearly form an”X" pattern as seen in the interaction model on the left?" This is an excellent opportunity to gently introduce the notion of model selection and Occam’s Razor: an interaction model should only be used over a parallel slopes model if the additional complexity of the interaction model is warranted. Here, we define model “complexity/simplicity” in terms of the number of parameters in the corresponding regression tables:

# Regression table for interaction model:
interaction_evals <- lm(score ~ age * ethnicity, data = evals)
get_regression_table(interaction_evals)
## # A tibble: 4 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  2.61      0.518      5.04   0        1.59     3.63 
## 2 age                        0.032     0.011      2.84   0.005    0.01     0.054
## 3 ethnicity: not minority    2.00      0.534      3.74   0        0.945    3.04 
## 4 age:ethnicitynot minor…   -0.04      0.012     -3.51   0       -0.063   -0.018
# Regression table for parallel slopes model:
parallel_slopes_evals <- lm(score ~ age + ethnicity, data = evals)
get_regression_table(parallel_slopes_evals)
## # A tibble: 3 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  4.37      0.136     32.1    0        4.1      4.63 
## 2 age                       -0.006     0.003     -2.5    0.013   -0.012   -0.001
## 3 ethnicity: not minority    0.138     0.073      1.89   0.059   -0.005    0.282

The interaction model is “more complex” as evidenced by its regression table involving 4 rows of parameter estimates whereas the parallel slopes model is “simpler” as evidenced by its regression table involving only 3 parameter estimates. It can be argued however that this additional complexity is warranted given the clearly different slopes in the left-hand plot of .

We now present a contrasting example, this time from Chapter 6 of the online version of ModernDive Subsection 6.3.1 involving Massachusetts USA public high schools.[3] Let’s plot both the interaction and parallel slopes models in .

# Code to plot interaction and parallel slopes models for MA_schools
ggplot(
  MA_schools,
  aes(x = perc_disadvan, y = average_sat_math, color = size)
) +
  geom_point(alpha = 0.25) +
  labs(
    x = "% economically disadvantaged",
    y = "Math SAT Score",
    color = "School size"
  ) +
  geom_smooth(method = "lm", se = FALSE)

ggplot(
  MA_schools,
  aes(x = perc_disadvan, y = average_sat_math, color = size)
) +
  geom_point(alpha = 0.25) +
  labs(
    x = "% economically disadvantaged",
    y = "Math SAT Score",
    color = "School size"
  ) +
  geom_parallel_slopes(se = FALSE)

Interaction (left) and parallel slopes (right) models.

In terms of the corresponding regression tables, observe that the corresponding regression table for the parallel slopes model has 4 rows as opposed to the 6 for the interaction model, reflecting its higher degree of “model simplicity.”

# Regression table for interaction model:
interaction_MA <-
  lm(average_sat_math ~ perc_disadvan * size, data = MA_schools)
get_regression_table(interaction_MA)
## # A tibble: 6 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                594.       13.3      44.7     0      568.     620.   
## 2 perc_disadvan             -2.93      0.294    -9.96    0       -3.51    -2.35 
## 3 size: medium             -17.8      15.8      -1.12    0.263  -48.9     13.4  
## 4 size: large              -13.3      13.8      -0.962   0.337  -40.5     13.9  
## 5 perc_disadvan:sizemedi…    0.146     0.371     0.393   0.694   -0.585    0.877
## 6 perc_disadvan:sizelarge    0.189     0.323     0.586   0.559   -0.446    0.824
# Regression table for parallel slopes model:
parallel_slopes_MA <-
  lm(average_sat_math ~ perc_disadvan + size, data = MA_schools)
get_regression_table(parallel_slopes_MA)
## # A tibble: 4 × 7
##   term          estimate std_error statistic p_value lower_ci upper_ci
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept       588.       7.61     77.3     0       573.     603.  
## 2 perc_disadvan    -2.78     0.106   -26.1     0        -2.99    -2.57
## 3 size: medium    -11.9      7.54     -1.58    0.115   -26.7      2.91
## 4 size: large      -6.36     6.92     -0.919   0.359   -20.0      7.26

Unlike our earlier comparison of interaction and parallel slopes models in , in this case it could be argued that the additional complexity of the interaction model is not warranted since the 3 three regression lines in the left-hand interaction are already somewhat parallel. Therefore the simpler parallel slopes model should be favored.

Going one step further, notice how the three regression lines in the visualization of the parallel slopes model in the right-hand plot of have similar intercepts. In can thus be argued that the additional model complexity induced by introducing the categorical variable school size is not warranted. Therefore, a simple linear regression model using only perc_disadvan percent of the student body that are economically disadvantaged should be favored.

While many students will inevitably find these results depressing, in our opinion, it is important to additionally emphasize that such regression analyses can be used as an empowering tool to bring to light inequities in access to education and inform policy decisions.

6. Produce metrics on the quality of regression model fits

Recall the output of the standard summary.lm() from earlier:

## 
## Call:
## lm(formula = score ~ age, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9185 -0.3531  0.1172  0.4172  0.8825 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.461932   0.126778  35.195   <2e-16 ***
## age         -0.005938   0.002569  -2.311   0.0213 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5413 on 461 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.009311 
## F-statistic: 5.342 on 1 and 461 DF,  p-value: 0.02125

Say we wanted to extract the scalar model summaries at the bottom of this output, such as R-squared, R-squared adjusted, the F-statistic, and the degrees of freedom df. We can do so using the get_regression_summaries() function.

get_regression_summaries(score_model)
## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.011         0.009 0.292 0.540 0.541      5.34   0.021     1   463

We’ve supplemented the standard scalar summaries output yielded by summary() with the mean squared error mse and root mean squared error rmse given their popularity in machine learning settings.

The inner workings

How does this all work? Let’s open the hood of the moderndive package.

Three wrappers to broom functions

As we mentioned earlier, the three get_regression_* functions are wrappers of functions from the broom package for converting statistical analysis objects into tidy tibbles along with a few added tweaks, but with the introductory statistics student in mind:

  1. get_regression_table() is a wrapper for broom::tidy().
  2. get_regression_points() is a wrapper for broom::augment().
  3. get_regression_summaries is a wrapper for broom::glance().

Why did we take this approach to address the initial 5 common student questions at the outset of the article?

  1. By writing wrappers to pre-existing functions, instead of creating new custom functions, there is minimal re-inventing of the wheel necessary.
  2. In our experience, novice R users had a hard time understanding the broom package function names tidy(), augment(), and glance(). To make them more user-friendly, the moderndive package wrappers have much more intuitively named get_regression_table(), get_regression_points(), and get_regression_summaries().
  3. The variables included in the outputs of the above 3 broom functions are not all applicable to an introductory statistics students and of those that were, we found they were not intuitive for new R users. We therefore cut out some of the variables from the output and renamed some of the remaining variables. For example, compare the outputs of the get_regression_points() wrapper function and the parent broom::augment() function.
get_regression_points(score_model)
broom::augment(score_model)

The source code for these three get_regression_* functions can be found on here.

Custom geometries

The geom_parallel_slopes() is a custom built geom extension to the ggplot2 package. For example, the ggplot2 webpage page gives instructions on how to create such extensions. The source code for geom_parallel_slopes() written by Evgeni Chasnovski can be found on GitHub.

  1. For details on the remaining 5 variables, see the help file by running ?evals.

  2. Note that gender was collected as a binary variable at the time of the study (2005).

  3. For more details on this dataset, see the help file by running ?MA_schools.

moderndive_book's People

Contributors

data-becki avatar dsolito avatar erinann avatar gungormetehan avatar ismayc avatar jacobbien avatar javi-s avatar joelostblom avatar kbodwin avatar kismay avatar kmkinnaird avatar mariumtapal avatar mitsuoxv avatar mrlinuxfish avatar nataliegnelson avatar oscci avatar pursuitofdatascience avatar rudeboybert avatar rutendom avatar s00singla avatar simplyjin avatar smetzer180 avatar starryz avatar torockel avatar traviscibot avatar watanabesmith avatar whoishomer avatar wjhopper avatar xiaochi-liu avatar yihui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moderndive_book's Issues

kindle and/or pdf versions?

Are there Kindle and/or PDF versions of Modern Dive available for download? If not, there should be, as there is for OpenIntro Statistics. I realize that I could (probably!) make such a resource using tools provided via bookdown, but that is certainly too difficult for most users.

Figure out how to get skimr::kable() to work

The line across here from the output is weird. Maybe we can ask for the {skimr} authors to tweak this? skimr::kable() doesn't seem to work for me when I try it in bookdown either?

screenshot 2018-08-13 15 16 22

4.9.1 Resources

The coggle mind map for data visualization link is broken. Third paragraph in section 4.9.1.

Weary wariness

In section 2.3 you have the sentence: "By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be weary."

While I'm sure studying statistics makes people weary on occasion, I believe the word you want here is "wary".

Cheers!

Clean up index.Rmd -> include_image() function

To use "open in new tab" markdown functionality. Ex:

[Map](https://deskarati.com/2012/03/31/london-undergrounds-real-map/)
[Map](https://deskarati.com/2012/03/31/london-undergrounds-real-map/){target="_blank"}

Suggested DataCamp chapters vs. entire courses

I've been assigning my students the suggested DataCamp chapters at the beginning of each ModernDive chapter to accompany MD. The experience went well with MD chapter 3, which suggested chapters 2 and 4 of David Robinson's "Introduction to the Tidyverse" DC course. Even though the suggestions had students skip chapters 1 and 3, most were fine and could understand what was going on without those chapters. Plus, I had them finish those DC chapters when they got to MD chapter 5 (since they're suggested there).

When students got to MD chapter 4 on tidy data, though, most panicked and struggled a ton (and many just gave up). The suggested DC resource is chapter 3 of @apreshill's "Working with data in the Tidyverse" course. Unlike David Robinson's course, which students were somehow able to skip around in with no problem, Alison's DC course is pedagogically designed to build on previous chapters, so students needed to have completed chapters 1 and 2 to work on 3 (at least that's what the few that went back and did 1–2 told me—I'm not 100% certain of that because I haven't actually taken her DC course 😬).

It might be helpful to not suggest specific chapters from DC courses, but whole courses instead. MD does this later on—chapter 10 suggests the whole courses on "Inference for Numerical Data" and "Inference for Categorical Data", for instance, instead of chapters.

Clarify t-test formula

Hi,

After teaching this past quarter, I realized the formula given for the t-test needs clarification. (I know you guys obviously know the difference, but I found myself wishing for more narrative around the formula provided in the text.)

The formula given is:

$$T =\dfrac{ (\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{ \sqrt{\dfrac{{s_1}^2}{n_1} + \dfrac{{s_2}^2}{n_2}} }$$

screen shot 2018-09-13 at 10 08 15 am

But what is not said is that this the formula for Welch's t-test, where var.equal = FALSE:

screen shot 2018-09-13 at 10 15 06 am

The modified degrees of freedom would need to be clarified here as you can't use the formula given and actually do the t-test by hand without that too (and when teaching, helps students understand why all of a sudden the df can have decimals):

screen shot 2018-09-13 at 10 15 49 am

The t-test formula for var.equal = TRUE (in all its pooled SE glory) is here:

screen shot 2018-09-13 at 10 14 18 am

If you want an easy copy/paste, the formulas are in this raw Rmd.

Again thanks for the awesome text- I really enjoyed teaching with it this past quarter!

Alison

An error occured when knitting

When knitting, error occurred with the following message:

Error reading bibliography .\bib/packages.bib (line 94, column 1):
unexpected "}"
expecting space or "="
Error running filter pandoc-citeproc:
Filter returned error status 1

SessionInfo:

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RevoUtils_10.0.7     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15    bookdown_0.7    digest_0.6.15   rprojroot_1.3-2 backports_1.1.2 magrittr_1.5   
 [7] evaluate_0.10.1 stringi_1.1.6   rmarkdown_1.9   tools_3.4.3     stringr_1.3.0   xfun_0.1       
[13] yaml_2.1.18     compiler_3.4.3  htmltools_0.3.6 knitr_1.20 

Suggestion: Include CI for get_correlation function

For some reason, perhaps historical, it is unusual for people to include 95% CI with correlations. Instead they just discuss whether 'significant' or not.
If the get_correlation function automatically gave you the 95% CI, it would help educate people to stop interpreting correlation coefficients as precise estimates, particularly with small N.

Add docs/ to .gitignore

@ismayc Is there any reason to keep docs/ synced on GitHub, since travis-ci renders the book to the gh-pages branch? I feel like we should only sync changes to .Rmd and .yml files and images/data files, otherwise it increases the chances of merge conflicts and makes for harder commit/branch comparisons.

"10.9.8 Simulated data" needs some clarification

First of all thank you so much for your book. It made sampling and hypothesis testing so much more clear for me. Something I never really understood in med school.

One thing I tripped over a little bit is the section "10.9.8 Simulated data" in the chapter on hypothesis testing the difference of means. Especially the "tactile experiment" wasn't clear the first time I read it. I needed like 5-6 reads to fully understand the idea behind the experiment:

The next step is to put the two stacks of index cards together, creating a new set of 68 cards. If we assume that the two population means are equal, we are saying that there is no association between ratings and genre (romance vs action). We can use the index cards to create two new stacks for romance and action movies. First, we must shuffle all the cards thoroughly. After doing so, in this case with equal values of sample sizes, we split the deck in half.

The thing that wasn't clear for me was, that you build two new stacks for "action" and "romance" movies that do not consist entirely of one genre but is a mixture of both. Since the H0 hypothesis is that there is no difference between the ratings of the two genres that's a valid way to simulate the H0 hypothesis. But that wasn't that clear for me. Maybe a small sketch of the experiment or something would make this much more clear. Even a sentence like "Note that the new stack for action and romance movies does not consist entirely of the specific genre" or something along those lines would be awesome.

I hope my point is clear. I would love to assist clarifying this section

Thanks again for your awesome content 👍

chapter 8 typos

"They were listed on the contexts of the box that the bowl came in"

Should be "contents", I think.

Give a summary of the `get_` functions

I think students may struggle a bit to understand when to use each of the three get_ functions since they are similar. Would be useful to add a closing summary to Chapter 7 discussing this. (Not pressing, but helpful to include on the next big release.)

Add summary table of all models at end of Ch7

For the five models:

  • 1 numerical x: intercept + slope
  • 1 categorical x: baseline + offsets
  • interaction: baseline, slopes, offset to baseline, offset to slopes
  • parallel slopes: intercept + offsets + single slope
  • 2 numerical: intercept + slopes

Address all warning messages in Ch9 due to infer version bump to v0.4.0

Example 1

bootstrap_distribution %>% 
  visualize(obs_stat = x_bar)

yields

Warning message:
`visualize()` shouldn't be used to plot p-value. Arguments `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are deprecated. Use `shade_p_value()` instead. 

Example 2

bootstrap_distribution %>% 
  visualize(endpoints = percentile_ci, direction = "between")

yields

Warning message:
`visualize()` shouldn't be used to plot confidence interval. Arguments `endpoints`, `endpoints_color`, and `ci_fill` are deprecated. Use `shade_confidence_interval()` instead.

Error 1

The following

sampling_distribution %>%
  visualize(fill = "salmon")

yields

Error in vapply(theory_types, function(x) { : values must be length 1,
 but FUN(X[[1]]) result is length 0

and was replaced with

ggplot(sampling_distribution, aes(x = stat)) +
  geom_histogram(bins = 10, fill = "salmon", color = "white")

Add discussion of power into Chapter 10

This book is a phenomenal resource! I'm using parts of it for a workshop I'm giving to graduate students in microbiology. The students have had no formal prior instruction in statistics (or have forgotten everything they learned in a stat 1000 class years ago).

Chapter 10 is an excellent introduction to the basics of hypothesis testing. For my purposes, the next logical thing to discuss after alpha and beta is power. I've pulled in relevant material from here: http://www.statisticsteacher.org/2017/09/15/what-is-power/ for the first iteration of my workshop, and suggest that a similar discussion would be at home in your book.

I tend to think that if I find something helpful others will too, so thought I would pass these thoughts along to you.

Again, thank you thank you thank you for developing this amazing resource.

Beef up Chapter 11: Wrapping Up

In Chapter 11, include links to all code/data referred to in current Section 1.1.1 What you will learn from this book.

What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion, such as How strong is the relationship between per capita income and crime in Chicago neighborhoods? and How many f**ks does Quentin Tarantino give (as measured by the amount of swearing in his films)?. Further discussions on data stories can be found in this Think With Google article.

For other examples of data stories constructed by students like yourselves, look at the final projects for two courses that have previously used ModernDive:

Normalization or standardization? Chapter 10

Maybe consider using standardization in place of normalization.

In my experience, it is common for students (and researchers) to mistakenly believe that normalization implies they are making the variable/data normal when in fact they are simply transforming their variable/data to a standard scale.

Just as the standard error is a special type of standard deviation, standardization can be thought of as a special type of normalization. It is such an important type that it gets its own name.

Section 10.8.1:

What is commonly done in statistics is the process of normalization. What this entails is calculating the mean and standard deviation of a variable. Then you subtract the mean from each value of your variable and divide by the standard deviation. The most common normalization is known as the z-score

Release steps

Copied from corresponding Google Doc (only the first part):

1. Final edits to .Rmd files

a) Formatting

  • Borders of all histograms should be white, making the binning structure easier to read.
  • Use computer font for all computing concepts: function(), data_frame, variable_names, package_name.
  • Remove all &, %, and _ in fig.cap for R chunk options since they break PDF build
  • Ensure no code exceeds 80 characters. While HTML code block outputs tolerate >80, PDF code block outputs do not.
  • Ensure skimr::skim() code is not actually run, but all calls and outputs are hard-coded (with hist preview removed and --- output cut down to 80 characters), since this will break all knitr::kable() code for rest of book.
  • Apply {styler} code to all internal code (i.e. code the reader doesn't see) using styler's RStudio Add-In.

b) References

  • Search for all broken references (search for @ref and ?? in html output)
  • Maintain Chapter/Section/Subsection naming consistency
  • Ensure all Figures & Tables are referenced.
  • If possible, ensure index is up-to-date by adding \index{<term>} tags

c) Dataset management

  • Remove all load() calls
  • Ensure all CSV’s are loaded from moderndive.com and not other sites. See index.Rmd, set-options R chunk, copy all needed csv files to docs/
  • Create bit.ly links for all linked Google Docs

d) Spell check

  • Do it in RStudio.
  • Scan over all changed content to make sure grammar is correct.

e) R Scripts: Make sure docs/scripts folder contains the appropriate scripts

  • Development version should link to moderndive.netlify.app/scripts since code may be updated in development causing problems if someone on moderndive.com is trying to look over all the code for what is there
  • If possible, go over all code blocks and ensure that purl = FALSE is set for all code chunks we don't want shared.

f) ETC

  • Ensure only Shutterstock licensed photos/images are used.
  • If possible, delete contents of rds/ and rebuild all .rds files in case any got stale.

2. Final sanity checks

  • Update all R packages before one final build
  • Build PDF version
  • Ensure that .travis.yml does not have any GitHub development .9000 versions of packages.

3. Switch from dev to release version on GitHub:

a) Create new GitHub branch

  • Create update-release branch to be used for pull request.

b) Edit index.Rmd

At the top

  • YAML: Change r format(Sys.time(), '%B %d, %Y') to release date of form “January 1, 1970”

set-options R chunk

  • Current version information:
    • Remove .9000 and bump version number
    • Set date to release date: replace format(Sys.time(), '%B %d, %Y') with date of form “January 1, 1970”
  • Latest release information:
    • Update to release values
  • CRAN packages needed:
    • Ensure only CRAN versions used
    • Periodically: Do a search for use of all packages and remove those no longer used

Lower in the file

  • Comment out development version warning block
  • Flip dev_version boolean to FALSE

c) Edit preface.Rmd

Section 1.4 “About this book”

  • Latest published version: ensure info is correct
  • Previous versions: Add previous version info
  • Add previous version to previous_version/ folder. Steps:
    • Go to previous_versions/ and add new subfolder for soon-to-be previous version
    • Go to GitHub releases pages and download .zip of source code of soon-to-be previous version
    • In index.Rmd
      • Uncomment notice about this being an out-of-date version and hard-code version number
      • Ensure comment is surrounded by *** to highlight this note
    • Build book and delete redundant nested docs/previous_versions
    • Copy contents of docs/ folder to the new subfolder in previous_versions/

d) Edit other files

  • NEWS.md: Update with all significant changes from this TODO
  • .gitignore: Remove bib/packages.bib and docs/

e) Final step

  • Clean and rebuild book

4. Publish release

  • Merge dev-to-release-vX.X.X branch into master
  • Merge master into release via PR (trick to remember: right into left). Or consider doing this?
  • Rebuild release/docs and commit. You might need to remove docs/ from .gitignore to be able to commit to release branch
  • Tag release on GitHub
  • Add link to Previous versions around line 474 of index.Rmd of most recent previous version.
  • Send email to MailChimp
  • Make sure all links in left-hand navbar index work
  • Relatedly, ensure moderndive package on CRAN is updated around same time as ModernDive book version bump.

4. Revert back to dev version on GitHub:

a) Create new GitHub branch

  • Create release-to-dev branch to be used for pull request. See example commit on GitHub to revert back to devel version by undoing all changes to:

b) Edit index.Rmd

At the top

  • YAML: Change release date back to r format(Sys.time(), '%B %d, %Y')
  • Return “development branch” version warning block and flip dev_version flag

set-options R chunk

  • Current version information:
    • Add .9000 to version number
    • Set date to r format(Sys.time(), '%B %d, %Y')

c) Edit other files:

  • NEWS.md: Add section for new dev version
  • Add bib/packages.bib to .gitignore
  • Revert .travis.yml so that infer and moderndive packages are github dev versions

Broken Image

Hello,

Just wanted to give you a heads up that figure 6.2 isn't rendering. I just see a big white box.

GitHub/Publishing

New desiderata:

  • Transfer ownership of repo from https://github.com/ismayc/moderndiver-book to ModernDive GitHub organization https://github.com/moderndive/
  • From there have
    • main branch be source code for latest released version
    • devel branch for source code for the development version
  • Deploy/host the different versions of ModernDive via netlify as follows:
    • Have release version at moderndive.com deployed via netlify instead of HostGator as per this tweet
    • Development version at http://moderndive.netlify.com
    • Past versions output/source code at moderndive.github.io/<old_version>
  • Fix bookdown.org ModernDive cover issue due to https on HostGator

Originally stated action items: Use travis deployment instead of manual builds for both ModernDive.com and

Load moderndive::evals

Replace all load(url("http://www.openintro.org/stat/data/evals.RData")) with library(moderndive)

3.2 Markdown structure error

Following the code in the "Modern Drive" markdown book located on bookdown.org I encounter the follow errors in the .md html output after running data(flights). Per Ismay book "Getting used to R, Studio, and R Markdown" there was comment about chunks--calling functions listed above will result in error if not called in the current chunk. Something to that effect. For the remainder of the chapter 3, I had to use the eval function to bypass the .md error in the console otherwise all processes halted. Any thoughts or instruction on where I went wrong?

slope_obs object not found

Having just started recently, I am working my way through this very interesting project.
The knitting hangs at this piece of code contained in 11-inference-for-regression.Rmd:

null_slope_distn %>%
visualize(obs_stat = slope_obs, direction = "greater")

It says slope_obs object not found. Just by looking with my untrained eyes, it seems that that data frame is defined at a later stage.

Add correspondence to DataCamp for flipped classrooms

This is likely to change with time,
and will need to be regularly updated, but

Roughly thinking:

Pre-course/ initial lab: Installing R and R Studio
based on Getting used to R, RStudio, and R Markdown
Read: R Packages: A Beginner's Guide online

MD 1& 2 = Working with the RStudio IDE parts 1/2,
Introduction to R,
DataCamp Cheatsheets link,
Intermediate R
MD 3 = Introduction to the Tidyverse,
Data Vis w/ggplot part 1
MD4 = Introduction to Data,
Data Cleaning in R
MD5 = Data Manipulation in R with Dplyr,
Joining Data in R with Dplyr
MD6 = Exploratory Data Analysis,
Intro to Stats: Correlation and Linear Regression,
Correlation and Regression?? (seems like pick one of b or c)
MD7 = Intro to Stats: Mult Regression,
Multiple &Logistic Regression
MD8 = Foundations of Inference
MD9 = Inference for Numerical Data
MD10 = ??
MD 11= Inference for Linear Regression
MD12 = Communicating with the Tidyverse,
Reporting with R Markdown,
Building Web Applications in R with Shiny

Also, Introduction to Data (MD4?), Correlation and Regression (MD6?) might fit in somewhere
Maybe more case studies?

Introducing forcats

Re this section:
You can manually specify which continent to use as baseline instead of the default choice of whichever comes first alphabetically, but we leave that to a more advanced course. (The forcats package is particularly nice for doing this and we encourage you to explore using it.)
It is great to learn about forcats: I seem to remember stumbling on it after a very frustrating experience of trying to reorder levels of a factor. While I can see that you don't want to digress too much from explaining about regression, I wonder if it would not be worth saying a bit more about R's default behaviour with factor levels, as I think many people get stuck on this when they come to analyse their own data.

C.2 Interactive graphics

When I run the comands of C.2.1 topic the graph remains empty!
Could you help me, or show the "right way" where I can make questions?
...
dyRangeSelector(dygraph(flights_summarized))

If use ts {stats} the graph appears but with X scale empty

Best Regards!

section 9.4

Suggestion: when comparing histograms for bootstrap and sampling distributions, would be better if one could be superimposed on the other: or at least have them shown side by side. Currently have to scroll up and down to compare them.

Error encountered rendering book

Hi,

I am trying to render the book in html_book using the following command

bookdown::render_book("index.Rmd", "bookdown::html_book")

I encountered the following error:
Error in split_chapters(output, page_builder, number_sections, split_by, :
The document must start with a first (#) or second level (##) heading
In addition: There were 50 or more warnings (use warnings() to see the first 50)

What am I doing wrong?

Herman

Table links and automatic numbering not working in HTML

While figure linking within the text is working, the same links for tables are not. They all render as "Table ??" where the "??" links to a non-existent header (I think). See image:

image

Issue was present throughout chapters 8 and 9, haven't investigated further. From browsing Rmd it seems like the references are correctly spelling the names of code chunks.

Consider elevating the statistical background appendix to a short chapter

Filed per https://twitter.com/ModernDive/status/1073340286091386881. May I suggest elevating the statistical background section to a short chapter.

Currently there are no tidy, bayesian, R-based introduction to statistics books that I am aware of. As such, initial statistics must be taught from a more traditional book (e.g. https://www.amazon.com/Introductory-Statistics-R-Computing/dp/0387790535/). It is unlikely an instructor would start with one such book for the basic first section and then transition to a more modern approach for future topics (confidence, visualization, modeling, hypothesis testing, etc).

Modern Dive could help solve this by adding an introductory chapter that expands the Statistical Background appendix. It can probably be less than the equivalent chapters in other stats books since, as was said in the tweet, the concepts will hopefully be interspersed within other sections to allow learning as doing. (I would recommend reviewing the other sections to ensure they do cover using the mean, median, mode, quantiles, SD, variance, and several common distributions such as normal/guassian, bernoulli, beta, biomial, uniform, geometric, poisson, gamma, log normal, exponential, and general power-law distributions).

I would suggest the goal is not to teach students when and how to use these concepts (as hopefully the rest of the book takes care of that), but provide context so that when they see them in use they understand how they fit into statistics as a whole. (For example, https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/ gives an interesting quick explanation of basic distributions and their relationships.)

To that end, it may even be beneficial to mention common traditional statistics (p-value, t-test, etc) in this section and then point to the Appendix where they are explained, not necessarily to give students an alternative to the primary approaches taught, but simply so they understand where these things they will hear commonly sit in the context of what they have learned.

And thank you for what is ultimately the go-to reference for a tidy approach to statistics. I think it's sorely needed and an excellent book with or without modification. I look forward to buying a hard copy as soon as they come off the presses!

Small changes before Saturday

This will serve as future release checklist as well.

  • Just this time: get package list consistent
  • Check for broken @ and ?? references
  • Ensure Chapter/Section/Subsection consistency. Ex: 3/3.1/3.1.2
  • Hide all Learning Check solutions
  • Take banner off of master branch docs deploy after we create a dev branch.
  • Make changes in index.Rmd to update versions of development and released.
  • Tag release on GitHub for new released version.
  • Update past versions list in Section 1.4. See #23 and #31
  • Change travis to look for changes on the dev branch and then publish to gh-pages (for Albert's class on Monday).
  • Create a referral page from https://github.com/ismayc/moderndiver-book to https://github.com/moderndive/moderndive_book (Contents in the old repo should be removed prior to the next major release--likely this summer.)
  • Change links to script files to be to https://moderndive.com/scripts/ for the released version of the book. Not that big of a deal but it would be confusing if we changed some of the code in the development version of the book and then someone was trying to run the ModernDive.com code instead. We should probably set a toggle for this. (Delayed until next release.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.