moderndive / moderndive_book Goto Github PK

Statistical Inference via Data Science: A ModernDive into R and the Tidyverse

License: Other

CSS 0.01% HTML 97.34% Shell 0.01% TeX 1.76% R 0.88%

bootstrap-method confidence-intervals data-science data-visualization data-wrangling dplyr ggplot2 hypothesis-testing infer moderndive permutation-test r regression regression-models rstats rstudio statistical-inference tidy tidyverse

moderndive_book's Introduction

moderndive R Package

Overview

The moderndive R package consists of datasets and functions for tidyverse-friendly introductory linear regression. These tools leverage the well-developed tidyverse and broom packages to facilitate

Working with regression tables that include confidence intervals
Accessing regression outputs on an observation level (e.g. fitted/predicted values and residuals)
Inspecting scalar summaries of regression fit (e.g. R-squared, R-squared adjusted, and mean squared error)
Visualizing parallel slopes regression models using ggplot2-like syntax.

This R package is designed to supplement the book “Statistical Inference via Data Science: A ModernDive into R and the Tidyverse” available at ModernDive.com. For more background, read our Journal of Open Source Education paper.

Installation

Get the released version from CRAN:

install.packages("moderndive")

Or the development version from GitHub:

# If you haven't installed remotes yet, do so:
# install.packages("remotes")
remotes::install_github("moderndive/moderndive")

Basic usage

library(moderndive)
score_model <- lm(score ~ age, data = evals)

Get a tidy regression table with confidence intervals:

get_regression_table(score_model)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001

Get information on each point/observation in your regression, including fitted/predicted values and residuals, in a single data frame:

get_regression_points(score_model)

## # A tibble: 463 × 5
##       ID score   age score_hat residual
##    <int> <dbl> <int>     <dbl>    <dbl>
##  1     1   4.7    36      4.25    0.452
##  2     2   4.1    36      4.25   -0.148
##  3     3   3.9    36      4.25   -0.348
##  4     4   4.8    36      4.25    0.552
##  5     5   4.6    59      4.11    0.488
##  6     6   4.3    59      4.11    0.188
##  7     7   2.8    59      4.11   -1.31 
##  8     8   4.1    51      4.16   -0.059
##  9     9   3.4    51      4.16   -0.759
## 10    10   4.5    40      4.22    0.276
## # … with 453 more rows

Get scalar summaries of a regression fit including R-squared and R-squared adjusted but also the (root) mean-squared error:

get_regression_summaries(score_model)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.011         0.009 0.292 0.540 0.541      5.34   0.021     1   463

Visualize parallel slopes models using the geom_parallel_slopes() custom ggplot2 geometry:

library(ggplot2)
ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE) +
  labs(x = "Age", y = "Teaching score", color = "Ethnicity")

Statement of Need

Linear regression has long been a staple of introductory statistics courses. While the curricula of introductory statistics courses has much evolved of late, the overall importance of regression remains the same. Furthermore, while the use of the R statistical programming language for statistical analysis is not new, recent developments such as the tidyverse suite of packages have made statistical computation with R accessible to a broader audience. We go one step further by leveraging the tidyverse and the broom packages to make linear regression accessible to students taking an introductory statistics course. Such students are likely to be new to statistical computation with R; we designed moderndive with these students in mind.

Contributor code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Six features

Why should you use the moderndive package for introductory linear regression? Here are six features:

Focus less on p-value stars, more confidence intervals
Outputs as tibbles
Produce residual analysis plots from scratch using ggplot2
A quick-and-easy Kaggle predictive modeling competition submission!
Visual model selection: plot parallel slopes & interaction regression models
Produce metrics on the quality of regression model fits

Data background

We first discuss the model and data background. The data consists of end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin. This data is included in the evals data frame from the moderndive package.

In the following table, we present a subset of 9 of the 14 variables included for a random sample of 5 courses[1]:

ID uniquely identifies the course whereas prof_ID identifies the professor who taught this course. This distinction is important since many professors taught more than one course.
score is the outcome variable of interest: average professor evaluation score out of 5 as given by the students in this course.
The remaining variables are demographic variables describing that course’s instructor, including bty_avg (average “beauty” score) for that professor as given by a panel of 6 students.[2]

ID	prof_ID	score	age	bty_avg	gender	ethnicity	language	rank
355	71	4.9	50	3.333	male	minority	english	teaching
262	49	4.3	52	3.167	male	not minority	english	tenured
441	89	3.7	35	7.833	female	minority	english	tenure track
51	10	4.3	47	5.500	male	not minority	english	teaching
49	9	4.5	33	4.667	female	not minority	english	tenure track

1. Focus less on p-value stars, more confidence intervals

We argue that the summary.lm() output is deficient in an introductory statistics setting because:

The Signif. codes: 0 '' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 only encourage p-hacking. In case you have not yet been convinced of the perniciousness of p-hacking, perhaps comedian John Oliver can convince you.
While not a silver bullet for eliminating misinterpretations of statistical inference, confidence intervals present students with a sense of the associated effect sizes of any explanatory variables. Thus, practical as well as statistical significance is emphasized. These are not included by default in the output of summary.lm().

Instead of summary(), let’s use the get_regression_table() function:

get_regression_table(score_model)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001

Observe how the p-value stars are omitted and confidence intervals for the point estimates of all regression parameters are included by default. By including them in the output, we can then emphasize to students that they “surround” the point estimates in the estimate column. Note the confidence level is defaulted to 95%; this default can be changed using the conf.level argument:

get_regression_table(score_model, conf.level = 0.99)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.13     4.79 
## 2 age         -0.006     0.003     -2.31   0.021   -0.013    0.001

2. Outputs as tibbles

While one might argue that extracting the intercept and slope coefficients can be “simply” done using coefficients(score_model), what about the standard errors? For example, a Google query of “how do I extract standard errors from lm in R” yielded results from the R mailing list and from Cross Validated suggesting we run:

sqrt(diag(vcov(score_model)))

## (Intercept)         age 
## 0.126778499 0.002569157

We argue that this task shouldn’t be this hard, especially in an introductory statistics setting. To rectify this, the three get_regression_* functions all return data frames in the tidyverse-style tibble (tidy table) format. Therefore you can extract columns using the pull() function from the dplyr package:

library(dplyr)
get_regression_table(score_model) %>%
  pull(std_error)

## [1] 0.127 0.003

or equivalently you can use the $ sign operator from base R:

get_regression_table(score_model)$std_error

## [1] 0.127 0.003

Furthermore, by piping the above get_regression_table(score_model) output into the kable() function from the knitr package, you can obtain aesthetically pleasing regression tables in R Markdown documents, instead of tables written in jarring computer output font:

library(knitr)
get_regression_table(score_model) %>%
  kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.462	0.127	35.195	0.000	4.213	4.711
age	-0.006	0.003	-2.311	0.021	-0.011	-0.001

3. Produce residual analysis plots from scratch using `ggplot2`

How can we extract point-by-point information from a regression model, such as the fitted/predicted values and the residuals? (Note we only display the first 10 out of 463 of such values for brevity’s sake.)

fitted(score_model)

##        1        2        3        4        5        6        7        8 
## 4.248156 4.248156 4.248156 4.248156 4.111577 4.111577 4.111577 4.159083 
##        9       10 
## 4.159083 4.224403

residuals(score_model)

##           1           2           3           4           5           6 
##  0.45184376 -0.14815624 -0.34815624  0.55184376  0.48842294  0.18842294 
##           7           8           9          10 
## -1.31157706 -0.05908286 -0.75908286  0.27559666

But why have the original explanatory/predictor age and outcome variable score in evals, the fitted/predicted values score_hat, and residual floating around in separate vectors? Since each observation relates to the same course, we argue it makes more sense to organize them together in the same data frame using get_regression_points():

score_model_points <- get_regression_points(score_model)
score_model_points

## # A tibble: 10 × 5
##       ID score   age score_hat residual
##    <int> <dbl> <int>     <dbl>    <dbl>
##  1     1   4.7    36      4.25    0.452
##  2     2   4.1    36      4.25   -0.148
##  3     3   3.9    36      4.25   -0.348
##  4     4   4.8    36      4.25    0.552
##  5     5   4.6    59      4.11    0.488
##  6     6   4.3    59      4.11    0.188
##  7     7   2.8    59      4.11   -1.31 
##  8     8   4.1    51      4.16   -0.059
##  9     9   3.4    51      4.16   -0.759
## 10    10   4.5    40      4.22    0.276

Observe that the original outcome variable score and explanatory/predictor variable age are now supplemented with the fitted/predicted values score_hat and residual columns. By putting the fitted values, predicted values, and residuals next to the original data, we argue that the computation of these values is less opaque. For example, instructors can emphasize how all values in the first row of output are computed.

Furthermore, recall that since all outputs in the moderndive package are tibble data frames, custom residual analysis plots can be created instead of relying on the default plots yielded by plot.lm(). For example, we can check for the normality of residuals using the histogram of residuals shown in .

# Code to visualize distribution of residuals:
ggplot(score_model_points, aes(x = residual)) +
  geom_histogram(bins = 20) +
  labs(x = "Residual", y = "Count")

As another example, we can investigate potential relationships between the residuals and all explanatory/predictor variables and the presence of heteroskedasticity using partial residual plots, like the partial residual plot over age shown in . If the term “heteroskedasticity” is new to you, it corresponds to the variability of one variable being unequal across the range of values of another variable. The presence of heteroskedasticity violates one of the assumptions of inference for linear regression.

# Code to visualize partial residual plot over age:
ggplot(score_model_points, aes(x = age, y = residual)) +
  geom_point() +
  labs(x = "Age", y = "Residual")

4. A quick-and-easy Kaggle predictive modeling competition submission!

With the fields of machine learning and artificial intelligence gaining prominence, the importance of predictive modeling cannot be understated. Therefore, we’ve designed the get_regression_points() function to allow for a newdata argument to quickly apply a previously fitted model to new observations.

Let’s create an artificial “new” dataset consisting of two instructors of age 39 and 42 and save it in a tibble data frame called new_prof. We then set the newdata argument to get_regression_points() to apply our previously fitted model score_model to this new data, where score_hat holds the corresponding fitted/predicted values.

new_prof <- tibble(age = c(39, 42))
get_regression_points(score_model, newdata = new_prof)

## # A tibble: 2 × 3
##      ID   age score_hat
##   <int> <dbl>     <dbl>
## 1     1    39      4.23
## 2     2    42      4.21

Let’s do another example, this time using the Kaggle House Prices: Advanced Regression Techniques practice competition ( displays the homepage for this competition).

House prices Kaggle competition homepage.

This Kaggle competition requires you to fit/train a model to the provided train.csv training set to make predictions of house prices in the provided test.csv test set. We present an application of the get_regression_points() function allowing students to participate in this Kaggle competition. It will:

Read in the training and test data.
Fit a naive model of house sale price as a function of year sold to the training data.
Make predictions on the test data and write them to a submission.csv file that can be submitted to Kaggle using get_regression_points(). Note the use of the ID argument to use the id variable in test to identify the rows (a requirement of Kaggle competition submissions).

library(readr)
library(dplyr)
library(moderndive)

# Load in training and test set
train <- read_csv("https://moderndive.com/data/train.csv")
test <- read_csv("https://moderndive.com/data/test.csv")

# Fit model:
house_model <- lm(SalePrice ~ YrSold, data = train)

# Make predictions and save in appropriate data frame format:
submission <- house_model %>%
  get_regression_points(newdata = test, ID = "Id") %>%
  select(Id, SalePrice = SalePrice_hat)

# Write predictions to csv:
write_csv(submission, "submission.csv")

After submitting submission.csv to the leaderboard for this Kaggle competition, we obtain a “root mean squared logarithmic error” (RMSLE) score of 0.42918 as seen in .

Resulting Kaggle RMSLE score.

5. Visual model selection: plot parallel slopes & interaction regression models

For example, recall the earlier visualizations of the interaction and parallel slopes models for teaching score as a function of age and ethnicity we saw in Figures and . Let’s present both visualizations side-by-side in .

Students might be wondering “Why would you use the parallel slopes model on the right when the data clearly form an”X" pattern as seen in the interaction model on the left?" This is an excellent opportunity to gently introduce the notion of model selection and Occam’s Razor: an interaction model should only be used over a parallel slopes model if the additional complexity of the interaction model is warranted. Here, we define model “complexity/simplicity” in terms of the number of parameters in the corresponding regression tables:

# Regression table for interaction model:
interaction_evals <- lm(score ~ age * ethnicity, data = evals)
get_regression_table(interaction_evals)

## # A tibble: 4 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  2.61      0.518      5.04   0        1.59     3.63 
## 2 age                        0.032     0.011      2.84   0.005    0.01     0.054
## 3 ethnicity: not minority    2.00      0.534      3.74   0        0.945    3.04 
## 4 age:ethnicitynot minor…   -0.04      0.012     -3.51   0       -0.063   -0.018

# Regression table for parallel slopes model:
parallel_slopes_evals <- lm(score ~ age + ethnicity, data = evals)
get_regression_table(parallel_slopes_evals)

## # A tibble: 3 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  4.37      0.136     32.1    0        4.1      4.63 
## 2 age                       -0.006     0.003     -2.5    0.013   -0.012   -0.001
## 3 ethnicity: not minority    0.138     0.073      1.89   0.059   -0.005    0.282

The interaction model is “more complex” as evidenced by its regression table involving 4 rows of parameter estimates whereas the parallel slopes model is “simpler” as evidenced by its regression table involving only 3 parameter estimates. It can be argued however that this additional complexity is warranted given the clearly different slopes in the left-hand plot of .

We now present a contrasting example, this time from Chapter 6 of the online version of ModernDive Subsection 6.3.1 involving Massachusetts USA public high schools.[3] Let’s plot both the interaction and parallel slopes models in .

# Code to plot interaction and parallel slopes models for MA_schools
ggplot(
  MA_schools,
  aes(x = perc_disadvan, y = average_sat_math, color = size)
) +
  geom_point(alpha = 0.25) +
  labs(
    x = "% economically disadvantaged",
    y = "Math SAT Score",
    color = "School size"
  ) +
  geom_smooth(method = "lm", se = FALSE)

ggplot(
  MA_schools,
  aes(x = perc_disadvan, y = average_sat_math, color = size)
) +
  geom_point(alpha = 0.25) +
  labs(
    x = "% economically disadvantaged",
    y = "Math SAT Score",
    color = "School size"
  ) +
  geom_parallel_slopes(se = FALSE)

In terms of the corresponding regression tables, observe that the corresponding regression table for the parallel slopes model has 4 rows as opposed to the 6 for the interaction model, reflecting its higher degree of “model simplicity.”

# Regression table for interaction model:
interaction_MA <-
  lm(average_sat_math ~ perc_disadvan * size, data = MA_schools)
get_regression_table(interaction_MA)

## # A tibble: 6 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                594.       13.3      44.7     0      568.     620.   
## 2 perc_disadvan             -2.93      0.294    -9.96    0       -3.51    -2.35 
## 3 size: medium             -17.8      15.8      -1.12    0.263  -48.9     13.4  
## 4 size: large              -13.3      13.8      -0.962   0.337  -40.5     13.9  
## 5 perc_disadvan:sizemedi…    0.146     0.371     0.393   0.694   -0.585    0.877
## 6 perc_disadvan:sizelarge    0.189     0.323     0.586   0.559   -0.446    0.824

# Regression table for parallel slopes model:
parallel_slopes_MA <-
  lm(average_sat_math ~ perc_disadvan + size, data = MA_schools)
get_regression_table(parallel_slopes_MA)

## # A tibble: 4 × 7
##   term          estimate std_error statistic p_value lower_ci upper_ci
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept       588.       7.61     77.3     0       573.     603.  
## 2 perc_disadvan    -2.78     0.106   -26.1     0        -2.99    -2.57
## 3 size: medium    -11.9      7.54     -1.58    0.115   -26.7      2.91
## 4 size: large      -6.36     6.92     -0.919   0.359   -20.0      7.26

Unlike our earlier comparison of interaction and parallel slopes models in , in this case it could be argued that the additional complexity of the interaction model is not warranted since the 3 three regression lines in the left-hand interaction are already somewhat parallel. Therefore the simpler parallel slopes model should be favored.

Going one step further, notice how the three regression lines in the visualization of the parallel slopes model in the right-hand plot of have similar intercepts. In can thus be argued that the additional model complexity induced by introducing the categorical variable school size is not warranted. Therefore, a simple linear regression model using only perc_disadvan percent of the student body that are economically disadvantaged should be favored.

While many students will inevitably find these results depressing, in our opinion, it is important to additionally emphasize that such regression analyses can be used as an empowering tool to bring to light inequities in access to education and inform policy decisions.

6. Produce metrics on the quality of regression model fits

Recall the output of the standard summary.lm() from earlier:

## 
## Call:
## lm(formula = score ~ age, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9185 -0.3531  0.1172  0.4172  0.8825 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.461932   0.126778  35.195   <2e-16 ***
## age         -0.005938   0.002569  -2.311   0.0213 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5413 on 461 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.009311 
## F-statistic: 5.342 on 1 and 461 DF,  p-value: 0.02125

Say we wanted to extract the scalar model summaries at the bottom of this output, such as R-squared, R-squared adjusted, the F-statistic, and the degrees of freedom df. We can do so using the get_regression_summaries() function.

get_regression_summaries(score_model)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.011         0.009 0.292 0.540 0.541      5.34   0.021     1   463

We’ve supplemented the standard scalar summaries output yielded by summary() with the mean squared error mse and root mean squared error rmse given their popularity in machine learning settings.

The inner workings

How does this all work? Let’s open the hood of the moderndive package.

Three wrappers to `broom` functions

As we mentioned earlier, the three get_regression_* functions are wrappers of functions from the broom package for converting statistical analysis objects into tidy tibbles along with a few added tweaks, but with the introductory statistics student in mind:

get_regression_table() is a wrapper for broom::tidy().
get_regression_points() is a wrapper for broom::augment().
get_regression_summaries is a wrapper for broom::glance().

Why did we take this approach to address the initial 5 common student questions at the outset of the article?

By writing wrappers to pre-existing functions, instead of creating new custom functions, there is minimal re-inventing of the wheel necessary.
In our experience, novice R users had a hard time understanding the broom package function names tidy(), augment(), and glance(). To make them more user-friendly, the moderndive package wrappers have much more intuitively named get_regression_table(), get_regression_points(), and get_regression_summaries().
The variables included in the outputs of the above 3 broom functions are not all applicable to an introductory statistics students and of those that were, we found they were not intuitive for new R users. We therefore cut out some of the variables from the output and renamed some of the remaining variables. For example, compare the outputs of the get_regression_points() wrapper function and the parent broom::augment() function.

get_regression_points(score_model)
broom::augment(score_model)

The source code for these three get_regression_* functions can be found on here.

Custom geometries

The geom_parallel_slopes() is a custom built geom extension to the ggplot2 package. For example, the ggplot2 webpage page gives instructions on how to create such extensions. The source code for geom_parallel_slopes() written by Evgeni Chasnovski can be found on GitHub.

For details on the remaining 5 variables, see the help file by running ?evals.
Note that gender was collected as a binary variable at the time of the study (2005).
For more details on this dataset, see the help file by running ?MA_schools.

moderndive_book's People

Contributors

Stargazers

Watchers

Forkers

jameshunterbr williamlfang tuqmano bikegeek avontd2868 the-r2 jbm1966 neokito jcassiojr andymeng proback cmg777 malcomdinisa mitchellakeba jeperez rlugojr bawcos byuidatascience hermandr nmarkgraf scaneda wmcoelho johnjchristie fmcgrade ismaytemp faffermcgee whrl nemochina2008 radovankavicky gapdata wangyuan1994wy jaewilson07 gabrielferraro jiayiji redtrades jt14den jpomaha brycesisu snowdj ucla-data-science-center gcalvin817 ismayc greglnelson drhorst ytlogos erikhoward opensource-vivek anhnguyendepocen puneet988 jykwon colonelxy-zz qtao kashenfelter emechebe astaples deleoloruntoba pkq kylemonahan colima xxx324 mojmirdocekal mbcann01 brad-cannell ybj2004 thiyangt aariq romyravines fengfengyuyu anhhd toaki07 anouel andrewsky123 hmoralesos goodbiro sean-cheng sammerk ariff118 andrewheiss stat-jet-asu watanabesmith sachinbuzz mnblanco ncody9 perlatex frank-sw nubiofs ulyngs xyysfwdm emraher arektobek wasilios-hariskos lsanders1 whoishomer oscci shendahuang nulib asiwome jmtirvel akuyper rutendom

moderndive_book's Issues

Ensure Chapter/Section/Subsection language is correct

Need to do a full pass over the book

An incomplete list of unlinked tables

Comparison: Let’s compare the 3 standard errors we computed above in Table ??:

https://moderndive.com/8-sampling.html#using-different-shovels

We present a wide array of such scenarios in Table ??.

https://moderndive.com/9-confidence-intervals.html

Will add more as I find them.

kindle and/or pdf versions?

Are there Kindle and/or PDF versions of Modern Dive available for download? If not, there should be, as there is for OpenIntro Statistics. I realize that I could (probably!) make such a resource using tools provided via bookdown, but that is certainly too difficult for most users.

Add tactile_shovel1 dataset to package

@smetzer180 found in 9.6 EXAMPLE: One proportion we say the tactile_shovel1 dataframe is included in moderndive package, but it isn't (we manually create it). At the very least in short term show code that manually creates this, in medium term add to moderndive package.

Fix equation formatting in Chapters 6 and 7

Show missing values as NA in PDF

Can't get options(knitr.kable.NA = 'NA')or just omitting it to work which should print NA as default in a PDF.

Have all CSV's be on moderndive.com

For example, in Ch4 Tidy:

life_expectancy <- read_csv('http://ismayc.github.io/le_mess.csv')

Create flowchart in keynote

Just like infer

Figure out how to get skimr::kable() to work

The line across here from the output is weird. Maybe we can ask for the {skimr} authors to tweak this? skimr::kable() doesn't seem to work for me when I try it in bookdown either?

4.9.1 Resources

The coggle mind map for data visualization link is broken. Third paragraph in section 4.9.1.

Weary wariness

In section 2.3 you have the sentence: "By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be weary."

While I'm sure studying statistics makes people weary on occasion, I believe the word you want here is "wary".

Cheers!

Make sure links/refs are working

Need to do a scan over the whole book to make sure all references are matching.

Hide all LC solutions before release

Make tables narrower to fit in PDF

Clean up index.Rmd -> include_image() function

To use "open in new tab" markdown functionality. Ex:

[Map](https://deskarati.com/2012/03/31/london-undergrounds-real-map/)
[Map](https://deskarati.com/2012/03/31/london-undergrounds-real-map/){target="_blank"}

Suggested DataCamp chapters vs. entire courses

I've been assigning my students the suggested DataCamp chapters at the beginning of each ModernDive chapter to accompany MD. The experience went well with MD chapter 3, which suggested chapters 2 and 4 of David Robinson's "Introduction to the Tidyverse" DC course. Even though the suggestions had students skip chapters 1 and 3, most were fine and could understand what was going on without those chapters. Plus, I had them finish those DC chapters when they got to MD chapter 5 (since they're suggested there).

When students got to MD chapter 4 on tidy data, though, most panicked and struggled a ton (and many just gave up). The suggested DC resource is chapter 3 of @apreshill's "Working with data in the Tidyverse" course. Unlike David Robinson's course, which students were somehow able to skip around in with no problem, Alison's DC course is pedagogically designed to build on previous chapters, so students needed to have completed chapters 1 and 2 to work on 3 (at least that's what the few that went back and did 1–2 told me—I'm not 100% certain of that because I haven't actually taken her DC course 😬).

It might be helpful to not suggest specific chapters from DC courses, but whole courses instead. MD does this later on—chapter 10 suggests the whole courses on "Inference for Numerical Data" and "Inference for Categorical Data", for instance, instead of chapters.

Clarify t-test formula

Hi,

After teaching this past quarter, I realized the formula given for the t-test needs clarification. (I know you guys obviously know the difference, but I found myself wishing for more narrative around the formula provided in the text.)

The formula given is:

$$T =\dfrac{ (\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{ \sqrt{\dfrac{{s_1}^2}{n_1} + \dfrac{{s_2}^2}{n_2}} }$$

But what is not said is that this the formula for Welch's t-test, where var.equal = FALSE:

The modified degrees of freedom would need to be clarified here as you can't use the formula given and actually do the t-test by hand without that too (and when teaching, helps students understand why all of a sudden the df can have decimals):

The t-test formula for var.equal = TRUE (in all its pooled SE glory) is here:

If you want an easy copy/paste, the formulas are in this raw Rmd.

Again thanks for the awesome text- I really enjoyed teaching with it this past quarter!

Alison

An error occured when knitting

When knitting, error occurred with the following message:

Error reading bibliography .\bib/packages.bib (line 94, column 1):
unexpected "}"
expecting space or "="
Error running filter pandoc-citeproc:
Filter returned error status 1

SessionInfo:

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RevoUtils_10.0.7     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15    bookdown_0.7    digest_0.6.15   rprojroot_1.3-2 backports_1.1.2 magrittr_1.5   
 [7] evaluate_0.10.1 stringi_1.1.6   rmarkdown_1.9   tools_3.4.3     stringr_1.3.0   xfun_0.1       
[13] yaml_2.1.18     compiler_3.4.3  htmltools_0.3.6 knitr_1.20

Potential Issues in section 7.7.4 Data

Hi I recently discovered your book. It's very helpful and I am thinking of using it with my students.

I wonder if you have issues in section 7 starting at 7.7.4.
https://ismayc.github.io/moderndiver-book/7-hypo.html#data-1

movie_example.pdf

Remove Learning Check Solutions before next semester

Noticed that Learning Check Solutions are appearing. For example, https://www.moderndive.org/2-getting-started.html#nycflights13 . Just a note to remember to turn that off before Albert gets to teaching again.

Suggestion: Include CI for get_correlation function

For some reason, perhaps historical, it is unusual for people to include 95% CI with correlations. Instead they just discuss whether 'significant' or not.
If the get_correlation function automatically gave you the 95% CI, it would help educate people to stop interpreting correlation coefficients as precise estimates, particularly with small N.

Add docs/ to .gitignore

@ismayc Is there any reason to keep docs/ synced on GitHub, since travis-ci renders the book to the gh-pages branch? I feel like we should only sync changes to .Rmd and .yml files and images/data files, otherwise it increases the chances of merge conflicts and makes for harder commit/branch comparisons.

"10.9.8 Simulated data" needs some clarification

First of all thank you so much for your book. It made sampling and hypothesis testing so much more clear for me. Something I never really understood in med school.

One thing I tripped over a little bit is the section "10.9.8 Simulated data" in the chapter on hypothesis testing the difference of means. Especially the "tactile experiment" wasn't clear the first time I read it. I needed like 5-6 reads to fully understand the idea behind the experiment:

The next step is to put the two stacks of index cards together, creating a new set of 68 cards. If we assume that the two population means are equal, we are saying that there is no association between ratings and genre (romance vs action). We can use the index cards to create two new stacks for romance and action movies. First, we must shuffle all the cards thoroughly. After doing so, in this case with equal values of sample sizes, we split the deck in half.

The thing that wasn't clear for me was, that you build two new stacks for "action" and "romance" movies that do not consist entirely of one genre but is a mixture of both. Since the H0 hypothesis is that there is no difference between the ratings of the two genres that's a valid way to simulate the H0 hypothesis. But that wasn't that clear for me. Maybe a small sketch of the experiment or something would make this much more clear. Even a sentence like "Note that the new stack for action and romance movies does not consist entirely of the specific genre" or something along those lines would be awesome.

I hope my point is clear. I would love to assist clarifying this section

Thanks again for your awesome content 👍

chapter 8 typos

"They were listed on the contexts of the box that the bowl came in"

Should be "contents", I think.

Give a summary of the `get_` functions

I think students may struggle a bit to understand when to use each of the three get_ functions since they are similar. Would be useful to add a closing summary to Chapter 7 discussing this. (Not pressing, but helpful to include on the next big release.)

Add summary table of all models at end of Ch7

For the five models:

1 numerical x: intercept + slope
1 categorical x: baseline + offsets
interaction: baseline, slopes, offset to baseline, offset to slopes
parallel slopes: intercept + offsets + single slope
2 numerical: intercept + slopes

Address all warning messages in Ch9 due to infer version bump to v0.4.0

Example 1

bootstrap_distribution %>% 
  visualize(obs_stat = x_bar)

yields

Warning message:
`visualize()` shouldn't be used to plot p-value. Arguments `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are deprecated. Use `shade_p_value()` instead.

Example 2

bootstrap_distribution %>% 
  visualize(endpoints = percentile_ci, direction = "between")

yields

Warning message:
`visualize()` shouldn't be used to plot confidence interval. Arguments `endpoints`, `endpoints_color`, and `ci_fill` are deprecated. Use `shade_confidence_interval()` instead.

Error 1

The following

sampling_distribution %>%
  visualize(fill = "salmon")

yields

Error in vapply(theory_types, function(x) { : values must be length 1,
 but FUN(X[[1]]) result is length 0

and was replaced with

ggplot(sampling_distribution, aes(x = stat)) +
  geom_histogram(bins = 10, fill = "salmon", color = "white")

Add discussion of power into Chapter 10

This book is a phenomenal resource! I'm using parts of it for a workshop I'm giving to graduate students in microbiology. The students have had no formal prior instruction in statistics (or have forgotten everything they learned in a stat 1000 class years ago).

Chapter 10 is an excellent introduction to the basics of hypothesis testing. For my purposes, the next logical thing to discuss after alpha and beta is power. I've pulled in relevant material from here: http://www.statisticsteacher.org/2017/09/15/what-is-power/ for the first iteration of my workshop, and suggest that a similar discussion would be at home in your book.

I tend to think that if I find something helpful others will too, so thought I would pass these thoughts along to you.

Again, thank you thank you thank you for developing this amazing resource.

Beef up Chapter 11: Wrapping Up

In Chapter 11, include links to all code/data referred to in current Section 1.1.1 What you will learn from this book.

What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion, such as How strong is the relationship between per capita income and crime in Chicago neighborhoods? and How many f**ks does Quentin Tarantino give (as measured by the amount of swearing in his films)?. Further discussions on data stories can be found in this Think With Google article.

For other examples of data stories constructed by students like yourselves, look at the final projects for two courses that have previously used ModernDive:

Middlebury College MATH 116 Introduction to Statistical and Data Sciences using student collected data.
Pacific University SOC 301 Social Statistics using data from the fivethirtyeight R package.

Normalization or standardization? Chapter 10

Maybe consider using standardization in place of normalization.

In my experience, it is common for students (and researchers) to mistakenly believe that normalization implies they are making the variable/data normal when in fact they are simply transforming their variable/data to a standard scale.

Just as the standard error is a special type of standard deviation, standardization can be thought of as a special type of normalization. It is such an important type that it gets its own name.

Section 10.8.1:

What is commonly done in statistics is the process of normalization. What this entails is calculating the mean and standard deviation of a variable. Then you subtract the mean from each value of your variable and divide by the standard deviation. The most common normalization is known as the z-score

Release steps

Copied from corresponding Google Doc (only the first part):

1. Final edits to `.Rmd` files

a) Formatting

Borders of all histograms should be white, making the binning structure easier to read.
Use computer font for all computing concepts: function(), data_frame, variable_names, package_name.
Remove all &, %, and _ in fig.cap for R chunk options since they break PDF build
Ensure no code exceeds 80 characters. While HTML code block outputs tolerate >80, PDF code block outputs do not.
Ensure skimr::skim() code is not actually run, but all calls and outputs are hard-coded (with hist preview removed and --- output cut down to 80 characters), since this will break all knitr::kable() code for rest of book.
Apply {styler} code to all internal code (i.e. code the reader doesn't see) using styler's RStudio Add-In.

b) References

Search for all broken references (search for @ref and ?? in html output)
Maintain Chapter/Section/Subsection naming consistency
Ensure all Figures & Tables are referenced.
If possible, ensure index is up-to-date by adding \index{<term>} tags

c) Dataset management

Remove all load() calls
Ensure all CSV’s are loaded from moderndive.com and not other sites. See index.Rmd, set-options R chunk, copy all needed csv files to docs/
Create bit.ly links for all linked Google Docs

d) Spell check

Do it in RStudio.
Scan over all changed content to make sure grammar is correct.

e) R Scripts: Make sure docs/scripts folder contains the appropriate scripts

Development version should link to moderndive.netlify.app/scripts since code may be updated in development causing problems if someone on moderndive.com is trying to look over all the code for what is there
If possible, go over all code blocks and ensure that purl = FALSE is set for all code chunks we don't want shared.

f) ETC

Ensure only Shutterstock licensed photos/images are used.
If possible, delete contents of rds/ and rebuild all .rds files in case any got stale.

2. Final sanity checks

Update all R packages before one final build
Build PDF version
Ensure that .travis.yml does not have any GitHub development .9000 versions of packages.

3. Switch from dev to release version on GitHub:

a) Create new GitHub branch

Create update-release branch to be used for pull request.

b) Edit `index.Rmd`

At the top

YAML: Change r format(Sys.time(), '%B %d, %Y') to release date of form “January 1, 1970”

set-options R chunk

Current version information:
- Remove .9000 and bump version number
- Set date to release date: replace format(Sys.time(), '%B %d, %Y') with date of form “January 1, 1970”
Latest release information:
- Update to release values
CRAN packages needed:
- Ensure only CRAN versions used
- Periodically: Do a search for use of all packages and remove those no longer used

Lower in the file

Comment out development version warning block
Flip dev_version boolean to FALSE

c) Edit `preface.Rmd`

Section 1.4 “About this book”

Latest published version: ensure info is correct
Previous versions: Add previous version info
Add previous version to previous_version/ folder. Steps:
- Go to previous_versions/ and add new subfolder for soon-to-be previous version
- Go to GitHub releases pages and download .zip of source code of soon-to-be previous version
- In index.Rmd
  - Uncomment notice about this being an out-of-date version and hard-code version number
  - Ensure comment is surrounded by *** to highlight this note
- Build book and delete redundant nested docs/previous_versions
- Copy contents of docs/ folder to the new subfolder in previous_versions/

d) Edit other files

NEWS.md: Update with all significant changes from this TODO
.gitignore: Remove bib/packages.bib and docs/

e) Final step

Clean and rebuild book

4. Publish release

Merge dev-to-release-vX.X.X branch into master
Merge master into release via PR (trick to remember: right into left). Or consider doing this?
Rebuild release/docs and commit. You might need to remove docs/ from .gitignore to be able to commit to release branch
Tag release on GitHub
Add link to Previous versions around line 474 of index.Rmd of most recent previous version.
Send email to MailChimp
Make sure all links in left-hand navbar index work
Relatedly, ensure moderndive package on CRAN is updated around same time as ModernDive book version bump.

4. Revert back to dev version on GitHub:

a) Create new GitHub branch

Create release-to-dev branch to be used for pull request. See example commit on GitHub to revert back to devel version by undoing all changes to:

b) Edit index.Rmd

At the top

YAML: Change release date back to r format(Sys.time(), '%B %d, %Y')
Return “development branch” version warning block and flip dev_version flag

set-options R chunk

Current version information:
- Add .9000 to version number
- Set date to r format(Sys.time(), '%B %d, %Y')

c) Edit other files:

NEWS.md: Add section for new dev version
Add bib/packages.bib to .gitignore
Revert .travis.yml so that infer and moderndive packages are github dev versions

Broken Image

Hello,

Just wanted to give you a heads up that figure 6.2 isn't rendering. I just see a big white box.

DataCamp access to "Effective Data Storytelling using the Tidyverse"

I have assigned my students to take the Data Camp class:

https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse-free

After finishing Applying R Basics, they are confronted with an Upgrade Your Account to Continue screen.

With no obvious way to continue without paying.

I tested it myself and the same thing happened. I assume that the whole course should be free . . .

GitHub/Publishing

New desiderata:

Transfer ownership of repo from https://github.com/ismayc/moderndiver-book to ModernDive GitHub organization https://github.com/moderndive/
From there have
- main branch be source code for latest released version
- devel branch for source code for the development version
Deploy/host the different versions of ModernDive via netlify as follows:
- Have release version at moderndive.com deployed via netlify instead of HostGator as per this tweet
- Development version at http://moderndive.netlify.com
- Past versions output/source code at moderndive.github.io/<old_version>
Fix bookdown.org ModernDive cover issue due to https on HostGator

Originally stated action items: Use travis deployment instead of manual builds for both ModernDive.com and

Could use the gh-pages branch of http://github.com/ismayc/moderndiver-book to publish the book to instead using travis
This allows us to only have to change the Rmd files on GitHub.com and have the page automatically update
- Currently this is done via travis on:
  http://github.com/ismayc/moderndiver-book to create a new repo at
  http://github.com/ismayc/moderndive that creates a book at
  http://ismayc.github.io/moderndive and also at
  http://moderndive.netlify.com
- There’s no need to push to a new repo though (could just as easily push to gh-pages on moderndiver-book instead)
- This is the process I did with http://ismayc.github.io/chemistr-book
Should also create a moderndive-releases GitHub repo that is updated only when we go to the next release
- Then this repo could build via travis in the gh-pages branch and we could use netlify.com to come to populate ModernDive.com (also could get https support for free with netlify)

Load moderndive::evals

Replace all load(url("http://www.openintro.org/stat/data/evals.RData")) with library(moderndive)

3.2 Markdown structure error

Following the code in the "Modern Drive" markdown book located on bookdown.org I encounter the follow errors in the .md html output after running data(flights). Per Ismay book "Getting used to R, Studio, and R Markdown" there was comment about chunks--calling functions listed above will result in error if not called in the current chunk. Something to that effect. For the remainder of the chapter 3, I had to use the eval function to bypass the .md error in the console otherwise all processes halted. Any thoughts or instruction on where I went wrong?

Data Camp class on Effective Data Storytelling not working

The book recommends a Data Camp class and provides this link

https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/

Although other Data Camp courses work fine for me, this one does not. I always get this message:

Setting up your session took too long.
If the problem persists, please report an issue.

Should this be working for me? I would like to assign it to the students in my Harvard class next week.

After release, lock master branch; do all editing in dev branch

Fix links to previous versions in Section 1.4

Currently they read from Hostgator folder previous_versions

slope_obs object not found

Having just started recently, I am working my way through this very interesting project.
The knitting hangs at this piece of code contained in 11-inference-for-regression.Rmd:

null_slope_distn %>%
visualize(obs_stat = slope_obs, direction = "greater")

It says slope_obs object not found. Just by looking with my untrained eyes, it seems that that data frame is defined at a later stage.

Add correspondence to DataCamp for flipped classrooms

This is likely to change with time,
and will need to be regularly updated, but

Roughly thinking:

Pre-course/ initial lab: Installing R and R Studio
based on Getting used to R, RStudio, and R Markdown
Read: R Packages: A Beginner's Guide online

MD 1& 2 = Working with the RStudio IDE parts 1/2,
Introduction to R,
DataCamp Cheatsheets link,
Intermediate R
MD 3 = Introduction to the Tidyverse,
Data Vis w/ggplot part 1
MD4 = Introduction to Data,
Data Cleaning in R
MD5 = Data Manipulation in R with Dplyr,
Joining Data in R with Dplyr
MD6 = Exploratory Data Analysis,
Intro to Stats: Correlation and Linear Regression,
Correlation and Regression?? (seems like pick one of b or c)
MD7 = Intro to Stats: Mult Regression,
Multiple &Logistic Regression
MD8 = Foundations of Inference
MD9 = Inference for Numerical Data
MD10 = ??
MD 11= Inference for Linear Regression
MD12 = Communicating with the Tidyverse,
Reporting with R Markdown,
Building Web Applications in R with Shiny

Also, Introduction to Data (MD4?), Correlation and Regression (MD6?) might fit in somewhere
Maybe more case studies?

Introducing forcats

Re this section:
You can manually specify which continent to use as baseline instead of the default choice of whichever comes first alphabetically, but we leave that to a more advanced course. (The forcats package is particularly nice for doing this and we encourage you to explore using it.)
It is great to learn about forcats: I seem to remember stumbling on it after a very frustrating experience of trying to reorder levels of a factor. While I can see that you don't want to digress too much from explaining about regression, I wonder if it would not be worth saying a bit more about R's default behaviour with factor levels, as I think many people get stuck on this when they come to analyse their own data.

C.2 Interactive graphics

When I run the comands of C.2.1 topic the graph remains empty!
Could you help me, or show the "right way" where I can make questions?
...
dyRangeSelector(dygraph(flights_summarized))

If use ts {stats} the graph appears but with X scale empty

Best Regards!

Add Learning Checks to Chapters 6+

We could start by converting the Your Turns into Learning Checks.

section 9.4

Suggestion: when comparing histograms for bootstrap and sampling distributions, would be better if one could be superimposed on the other: or at least have them shown side by side. Currently have to scroll up and down to compare them.

Error encountered rendering book

Hi,

I am trying to render the book in html_book using the following command

bookdown::render_book("index.Rmd", "bookdown::html_book")

I encountered the following error:
Error in split_chapters(output, page_builder, number_sections, split_by, :
The document must start with a first (#) or second level (##) heading
In addition: There were 50 or more warnings (use warnings() to see the first 50)

What am I doing wrong?

Herman

Table links and automatic numbering not working in HTML

While figure linking within the text is working, the same links for tables are not. They all render as "Table ??" where the "??" links to a non-existent header (I think). See image:

Issue was present throughout chapters 8 and 9, haven't investigated further. From browsing Rmd it seems like the references are correctly spelling the names of code chunks.

Confidence intervals interpretation

Comparing the text to this article http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf it seems that the interpretation of the confidence intervals is not accurate. Although the bootstrapping process makes clear the frequentist approach, the interpretation seems to forget it.
I'd be happy to help review it if that'd be helpful.

Consider elevating the statistical background appendix to a short chapter

Filed per https://twitter.com/ModernDive/status/1073340286091386881. May I suggest elevating the statistical background section to a short chapter.

Currently there are no tidy, bayesian, R-based introduction to statistics books that I am aware of. As such, initial statistics must be taught from a more traditional book (e.g. https://www.amazon.com/Introductory-Statistics-R-Computing/dp/0387790535/). It is unlikely an instructor would start with one such book for the basic first section and then transition to a more modern approach for future topics (confidence, visualization, modeling, hypothesis testing, etc).

Modern Dive could help solve this by adding an introductory chapter that expands the Statistical Background appendix. It can probably be less than the equivalent chapters in other stats books since, as was said in the tweet, the concepts will hopefully be interspersed within other sections to allow learning as doing. (I would recommend reviewing the other sections to ensure they do cover using the mean, median, mode, quantiles, SD, variance, and several common distributions such as normal/guassian, bernoulli, beta, biomial, uniform, geometric, poisson, gamma, log normal, exponential, and general power-law distributions).

I would suggest the goal is not to teach students when and how to use these concepts (as hopefully the rest of the book takes care of that), but provide context so that when they see them in use they understand how they fit into statistics as a whole. (For example, https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/ gives an interesting quick explanation of basic distributions and their relationships.)

To that end, it may even be beneficial to mention common traditional statistics (p-value, t-test, etc) in this section and then point to the Appendix where they are explained, not necessarily to give students an alternative to the primary approaches taught, but simply so they understand where these things they will hear commonly sit in the context of what they have learned.

And thank you for what is ultimately the go-to reference for a tidy approach to statistics. I think it's sorely needed and an excellent book with or without modification. I look forward to buying a hard copy as soon as they come off the presses!

Small changes before Saturday

This will serve as future release checklist as well.

moderndive / moderndive_book Goto Github PK

moderndive_book's Introduction

moderndive R Package

Overview

Installation

Basic usage

Statement of Need

Contributor code of conduct

Six features

Data background

1. Focus less on p-value stars, more confidence intervals

2. Outputs as tibbles

3. Produce residual analysis plots from scratch using ggplot2

4. A quick-and-easy Kaggle predictive modeling competition submission!

5. Visual model selection: plot parallel slopes & interaction regression models

6. Produce metrics on the quality of regression model fits

The inner workings

Three wrappers to broom functions

Custom geometries

moderndive_book's People

Contributors

Stargazers

Watchers

Forkers

moderndive_book's Issues

Example 1

Example 2

Error 1

1. Final edits to .Rmd files

a) Formatting

b) References

c) Dataset management

d) Spell check

e) R Scripts: Make sure docs/scripts folder contains the appropriate scripts

f) ETC

2. Final sanity checks

3. Switch from dev to release version on GitHub:

a) Create new GitHub branch

b) Edit index.Rmd

At the top

set-options R chunk

Lower in the file

c) Edit preface.Rmd

Section 1.4 “About this book”

d) Edit other files

e) Final step

4. Publish release

4. Revert back to dev version on GitHub:

a) Create new GitHub branch

b) Edit index.Rmd

At the top

set-options R chunk

c) Edit other files:

Recommend Projects

Recommend Topics

Recommend Org

3. Produce residual analysis plots from scratch using `ggplot2`

Three wrappers to `broom` functions

1. Final edits to `.Rmd` files

b) Edit `index.Rmd`

c) Edit `preface.Rmd`