rafalab / dsbook Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 1.0K 338.49 MB

Repository for data science book

License: Other

R 22.50% TeX 77.31% CSS 0.19%

dsbook's Introduction

Hi there 👋

I am a Professor and Chair of the Department of Data Science at Dana-Farber Cancer Institute.
Also Professor of Applied Statistics at Harvard.
My research focuses on Genomics but I am interested in applications in general.
I teaches several Data Science courses.
I blog posts at Simply Statistics.
The best way to keep up with the latest news is to follow @rafalab.

Learn more about me at http://rafalab.dfci.harvard.edu/

dsbook's People

Contributors

Stargazers

Watchers

Forkers

ituco xie186 frfy michaschwab ilirsheraj desautm bertst sergiotrans allenzone83 snowdj mrmelchi kapilbhatt5796 almouthnna nickyfoto sternshein hervwan sanchitt feigeliudan01 it-gro gkovaig ola-pettersson moloneb louisk123 ag229 changrbc alabarga alvarolarreategui rdparker29 dherrell installingrattle nguyenchithien emechebe allensmile sarangkapse fw1121 fozyurt dennism1025 jakevc prasadbc xkudsraw anhnguyendepocen glengemann wangdata tratschonkel nachoag76 alanponce stewartli betagamer25 learneriq joebo2014 bdivet fanhe020 pappuks bribeir0 techcolony ntodata gkud jonjon2018 sven56 hieuqtran ss1git jelliclecats thom-j-h roberttheachiever jruss1321 san-git schenx technocrat pavelelk agaiha meglantz gwierzchowski ayesha-jamil18 shadrack4292 ytsiamis jlh2018 srshfo jiyeon7 aejb22122 vallecillos chakradhar27 mgchild nicolashefti tdnguyen2020 lunaweng fnavarro94 rakiucm emmrou cbuwana atzakas arunkumarramanan changrong1023 pandulis davidchang49 datavizi lasvegascoder rodrigochaves73 ddp01 millacurafa benschepperle

dsbook's Issues

74.1 Clustering: Broken figure

There is an issue with the figure at the very end of section 74.1. It is not properly displayed and shows up as a broken image symbol.

Here's the figure:

dsbook/ml/clustering.Rmd

Line 77 in bbfe610

```{r dendrogram-2, , out.width="100%", fig.height=4}

And here's the view in the textbook (I linked to section 74.2 since it's at the very end of 74.1):

https://rafalab.github.io/dsbook/clustering.html#k-means

Thanks!

Misplaced filter() call in ch. 33 user effect plot

I think the summarize() and filter() calls in the below code (from 33.7.6) are out of order. The intent is to filter for users with more than 100 ratings, but filtering after summarizing does nothing because summarize automatically ungroups the dataframe, so n() is equal to the number of rows.

train_set %>% 
  group_by(userId) %>% 
  summarize(b_u = mean(rating)) %>% 
  filter(n()  >=100) %>%
  ggplot(aes(b_u)) + 
  geom_histogram(bins = 30, color = "black")

Filtering between the group_by and summarize achieves the intended purpose:

train_set %>% 
  group_by(userId) %>% 
  filter(n() >=100) %>%
  summarize(b_u = mean(rating)) %>% 
  ggplot(aes(b_u)) + 
  geom_histogram(bins = 30, color = "black")

One other thing: the text says to filter for users with over 100 ratings, but the code uses >= 100.

recommendation-systems.Rmd

Line 39

This file creates a table that is described as "Let’s show the matrix for seven users and four movies." However, the code creates a table of only six users (verified by running and in the pdf of the book).

66.3 prob_cond_prob function

It is in the .Rmd source code, but not available to the reader of the output. Suggest it be included in dslabs

plot_cond_prob <- function(p_hat=NULL){ tmp <- mnist_27$true_p if(!is.null(p_hat)){ tmp <- mutate(tmp, p=p_hat) } tmp %>% ggplot(aes(x_1, x_2, z=p, fill=p)) + geom_raster(show.legend = FALSE) + scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) + stat_contour(breaks=c(0.5),color="black") }

MNIST dataset (Section 27.3)

The MNIST data in dslabs_0.7.3 is stored under the object mnist_27. Section 27.3 in the book gets this wrong, when it asks the reader to use mnist. Furthermore, the name of the outcome vector is now y, not labels. The data I get for y[5] and y[6] in the example further down also seems to be different (the numerals 7 and 2 respectively).

Header typo

Typo line 700 in recommendation-systems.Rmd (missing a 'c')

Connetion to SVD and PCA -> Connection to SVD and PCA

trump_tweets

Hello,

I tried to compile the code on R studio to generate a book, but I get the same error each time:
Quitting from lines 13757-13758 (book.Rmd)
Error in eval(expr, envir, enclos) : objet 'trump_tweets' introuvable
Calls: ... handle -> withCallingHandlers -> withVisible -> eval -> eval
De plus : There were 18 warnings (use warnings() to see them)

Arnaud

Mistake in 33.9.2 formulas?

In ml/recommendation-systems.Rmd lines 430 or 434 (section 33.9.2), are there Ns missing?

I believe line 430 should be $$ \frac{1}{N}\sum_{u,i} \left(y_{u,i} - \mu - b_i\right)^2 + \frac{\lambda}{N} \sum_{i} b_i^2 $$ or line 434 should be $$ \frac{1}{N\lambda + n_i} \sum_{u=1}^{n_i} \left(Y_{u,i} - \hat{\mu}\right) $$.

Typos in chapter 10

Section 10.1 2nd paragraph talks about comparing four percentages, but there are five in each pie chart.

Paragraph right after the first bar charts talks about following a horizontal line to the x-axis, I think that should be y-axis instead.

fiftystater package no longer supported

dsbook/R/motivation.Rmd

Line 31 in ea9c29f

library(fiftystater)

The "fiftystater" package was removed from the CRAN repository as of 2018-10-10.
https://cran.r-project.org/web/packages/fiftystater/index.html

On POSIX filenames are case-sensitive

I am trying to compile the book into a PDF and i'm getting the following error:

'``
Quitting from lines 258-259 (book.Rmd)
Error in knitr::include_graphics(file.path(img_path, "RStudio.png")) :
Cannot find the file(s): "R/img/RStudio.png"


That file, in fact, is called "rstudio.png" and not "RStudio.png"

A doubt in Section 27.4.5

I am a little confused in this section where we try to resolve the confusion matrix with F1 Score. There is a discrepancy in the value found in best_cutoff and the value considered (best_cutoff = 66 and the value written = 65). And this wouldn't be an issue but the code after it has taken best_cutoff as it is. It would be helpful to know which one is correct.

missing 3 small parts of code in section 19.2.1

Dear Rafael,

In the section 19.2.1 'Understanding confounding through stratification' of the online web version of the book.

Three small part of R code may be missing as below.

dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_strata = round(BB/G,1),
HR_per_game = HR/G,
R_per_game = R/G) %>%
filter(BB_strata >= 2.8 & BB_strata <= 3.9)

dat %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(~BB_strata)

dat %>% group_by(BB_strata) %>%
summarize(slope = cor(HR_per_game, R_per_game)*sd(R_per_game)/sd(HR_per_game))

The above 3 small part of code are used to plot 3 x 4 scatter plots (HR_per_game versus R_per_game) and to show the slope of change from the original.
i.e. before showing below in the online version.

... In this case, the slopes do not change much from the original:

#> # A tibble: 12 x 2
#> BB_strata slope
#>
#> 1 2.8 1.53
#> 2 2.9 1.57
#> 3 3 1.52
#> 4 3.1 1.49
#> 5 3.2 1.58
#> 6 3.3 1.56
#> # ... with 6 more rows
...

These 3 small part of code can be found in file 'linear-models.Rmd' at
https://github.com/rafalab/dsbook/blob/master/regression/linear-models.Rmd

Please add back these 3 small part of R code in the online version of the book. Also, the PDF file of the whole book may be affected by this small issue.

Please check the online version.

If I am correct, please help us to add back these small parts of code in the next update.

Thanks in advance.

Best regards,
Jimmy

Smoother in second plot 28.3.2 not shown

The smoother in the second plot in section 28.3.2 "Beware of default smoothing parameters" is not visible.

Running the code to reproduce the figure

polls_2008 %>% ggplot(aes(day, margin)) + geom_point() + geom_smooth()

resulted in this error:

geom_smooth() using formula 'y ~ x'
Warning message:
Computation failed in stat_smooth():
'what' must be a function or character string

Ch 14 .11 - typo?

"We say that a random quantity is normally distributed with average m and standard deviation s if its probability distribution is defined by:

F(a) = rnorm(a, m, s)

====
Should it be pnorm?
Propose:

F(a) = pnorm(a, m, s)

Incorrect intermediate value

dsbook/dataviz/gapminder.Rmd

Line 314 in d549886

 then, to determine `x`, we need to compute $10^{1.5}$, which is not easy to do in our heads. The advantage of using logged scales is that we see the original values on the axes. However, the advantage of showing logged scales is that the original values are displayed in the plot, which are easier to interpret. For example, we would see "32 dollars a day" instead of "5 log base 2 dollars a day". 

$10^{1.5}$ should read $10^{0.5}$

Typo Chapter 4, Section 4.5

In the subsection "4.5 Variable names", second sentence of the first paragraph:

"Some basic rules in R is that they have to start with a letter, can’t contain spaces and should be variables that are predefined in R."

I guess it should be:

"Some basic rules in R is that they have to start with a letter, can’t contain spaces and should not be variables that are predefined in R."

Section 8.15 Exercises: Question 15 should cite to question 10 (?) instead of question 3

In option c and d of question 15:
As seen in question 3, ...

Question 3 in the book is about a plot image Study the following boxplots showing population sizes by country:
while question 10 is related to normal distribution vs real data:

Notice that the approximation calculated in question two is very close to the exact calculation in the first question. Now perform the same task for more extreme values. Compare the exact calculation and the normal approximation for the interval (79,81]. How many times bigger is the actual proportion than the approximation?

getting-started.Rmd screenshot error

There is a typo in the screenshot just below this sentence:

"You can also use the key binding: Ctrl+Shift+Enter on Windows or command+shift+return on the Mac."

The code in the screenshot says geom_points instead of geom_point. It confused some people who were trying to code along.

A smaller note for this chapter - some people are trying to code along and we haven't yet explained that you need to install packages like tidyverse and dslabs before using them. Might help to add a passing reference to the following section, https://rafalab.github.io/dsbook/getting-started.html#installing-r-packages.

Put eval=TRUE in install-libraries.Rmd

Can we put eval=TRUE everywhere in install-libraries.Rmd?

In this way it is possible to install all the required libraries and compile the book with a simple:

source(purl("install-libraries.Rmd"))

Discrepancies between .Rmd and .html, rules for contribution

There are discrepancies in the .Rmd and .html bookdown pages. The ones I have found relate to typos that look like they still need to be corrected in the book but have been fixed in the code.

Compare "accountfor" at the end of the second paragraph of this section: https://rafalab.github.io/dsbook/linear-models.html

and corresponding R markdown:

dsbook/regression/linear-models.Rmd

Line 154 in ea9c29f

 When we are not able to randomly assign each individual to a treatment or control group, confounding is particularly prevalent. For example, consider estimating the effect of eating fast foods on life expectancy using data collected from a random sample of people in a jurisdiction. Fast food consumers are more likely to be smokers, drinkers, and have lower incomes. Therefore, a naive regression model may lead to an overestimate of a negative health effect. So how do we do account for confounding in practice? 

I will try to open a pull request to resolve this by running knit on edited files and putting the knit files in the directory currently holding the html files. However, I have never built something in bookdown, learned Git mostly through this course, and there is no contributing guide for the project. Any guidance would be appreciated.

`sentiments` in tidytext has substantially changed, raising license issues and code errors

Section 27.3 of the textbook (wrangling/text-mining.Rmd section on sentiment analysis) performs sentiment analysis using the tidytext package. The bulk of the analysis relies on the nrc lexicon. However, this lexicon is no longer part of the sentiments object: the new version of tidytext has removed nrc because it has a license that states it cannot be redistrbuted. See this tidytext issue for more information.

All sentiment lexicons that use non-numeric sentiments have either been removed due to license issues (nrc) or now explicitly trigger an alert that they require a license for commercial use (loughran). I assume this qualifies as commercial use. Therefore we need to replace or remove the nrc and loughran lexicons.

Equation correction

dsbook/inference/confidence-intervals-p-values.Rmd

Line 101 in d549886

 is `0.995 - 0.005 = 0.99`. We can use this approach for any proportion $p$: we set `z = qnorm(1 - (1 - p)/2)` because $1 - (1 - p)/2 + (1 - p)/2 = p$. 

Formula should read: $1 - (1 - p)/2 - (1 - p)/2 = p$ to correctly illustrate interval p$ is remaining area.

Chapter 68: Test from $train or $test ?

I'm not sure if this is on purpose or typo.
At very beginning of Chapter 68 there is code:

set.seed(123)
index <- sample(nrow(mnist$train$images), 10000)
x <- mnist$train$images[index,]
y <- factor(mnist$train$labels[index])

index <- sample(nrow(mnist$train$images), 1000)
x_test <- mnist$train$images[index,]
y_test <- factor(mnist$train$labels[index])

I'm not sure if instructions in second group maybe should use mnist$test?

Typo in Chapter 16.8: Power

Hello,

There appears to be a small typo in the calculation for the confidence intervals for the spread.

I believe it should read as follows:

N <- 25
x_hat <- 0.48
(2 * x_hat - 1) + c(-1.96, 1.96) * 2 * sqrt(x_hat * (1 - x_hat) / N)
[1] -0.4316863 0.3516863

Excellent introductory book to R; thank you!

Regards,
Mohamed

Typo in Ch 2 (2.4.1)

Version 2019-03-17: Description on page 30 of RStudio panes and tabs says "On the right, the top pane includes three tabs: Extensions, History and Connections..." The image at top of the next page shows these as "Environment," "History," and "Connections." "Environment" appears in my recently downloaded Windows version of RStudio as well. "Extension" should be changed to "Environment."

Section 11.10 typos (chapter 2)

in the second paragraph starting with "First let’s define the theoretical quantiles...." the last sentence has 2 typos: "thsi" instead of "this" and "rguments" instead of "arguments".

Code error: Section 72.2: Wrong model's RMSE used

Hi,

In the last part of section 72.2, we are trying to see if the regularized movie effect yielded better results than the other models. Unfortunately, in the code, we are re-using the RMSE from model 2 instead of using the RMSE from model 3. We then come to the conclusion that the improvement is substantial, but it is not quite that substantial in reality.

RMSE with Movie Effect: 0.986
RMSE with Regularized Movie Effect: 0.9649
RMSE quoted in the book: 0.885

Here's the faulty code:

predicted_ratings <- test_set %>%
left_join(movie_reg_avgs, by='movieId') %>%
mutate(pred = mu + b_i) %>%
.$pred

model_3_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
data_frame(method="Regularized Movie Effect Model",
RMSE = model_2_rmse ))

The bolded part should read model_3_rmse

Chapter 2.4.6 Lists example

dsbook/R/R-basics.Rmd

Lines 531 to 536 in 6ec8eaf

 ```{r, echo=FALSE} 

 record <- list(name = "John Doe", 

  student_id = 1234, 

  grades = c(95, 82, 91, 97, 93), 

  final_grade = "A") 

 ```

I was curious, should echo=FALSE be removed from this block? Currently it hides the definition of the example list object record, which makes the next lines (inspecting the object, accessing elements of it, etc.) not work.

Chapter 70: `lim` variable undefined

The code from ch 70.1:

rafalib::mypar()
plot(z, xlim=lim, ylim = lim - mean(lim))

cannot be run because variable lim in not defined.
Adding line:

lim <- c(min(z[,1]) - 1, max(z[,1]) + 1)

does the trick.

§66.6 typo and reference to a numbered figure when figures are unnumbered

Too minor for a pull request

The actuary is:

should be

The Accuracy is:

and in Exercise 5 reference is made to "figure 3," but figures are unnumbered.

BTW: As the series has gotten progressively into new territory for me, I cloned this repository to be able to cut and paste the code examples (and a few hidden in chunks) to run the code. That way, when a snippet appears in the lectures, I can pause and run it to see in better detail what you're conveying. Students who are not familiar with github would benefit by having a facility to download the code snippets with any required libraries in a file per video segment.

Cannot compile: R asks for user input

I am trying to compile a PDF book by running:

bookdown::render_book("index.Rmd", "bookdown::pdf_book")

But the process gets stuck. In particular, at unnamed-chunk-976, R prints "Selection" and asks for user input. Whatever the user input, the "Selection" prompt is displayed again:

label: unnamed-chunk-976
Selection: sadasdasd

31.2.1 section issue

Hey, I noticed that on the last part of the 31.2.1 section the code shown to compare the estimates uses the object tmp which its not created so I'm unable to recreate the plot. Here's the part:

The resulting predictions are similar. This is because the two estimates of p(x) are larger than 1/2 in about the same region of x:

data.frame(x = seq(min(tmp$x), max(tmp$x))) %>%
  mutate(logistic = plogis(glm_fit$coef[1] + glm_fit$coef[2]*x),
         regression = lm_fit$coef[1] + lm_fit$coef[2]*x) %>%
  gather(method, p_x, -x) %>%
  ggplot(aes(x, p_x, color = method)) + 
  geom_line() +
  geom_hline(yintercept = 0.5, lty = 5)

`> data.frame(x = seq(min(tmp$x), max(tmp$x))) %>%

 mutate(logistic = plogis(glm_fit$coef[1] + glm_fit$coef[2]*x),

        regression = lm_fit$coef[1] + lm_fit$coef[2]*x) %>%

```
 gather(method, p_x, -x) %>%
```

 ggplot(aes(x, p_x, color = method)) +

```
 geom_line() +
```
```
 geom_hline(yintercept = 0.5, lty = 5)
```

Error in seq(min(tmp$x), max(tmp$x)) : object 'tmp' not found`

Could you upload a PDF version to study or print?:)

Hello!

I would love to have a PDF version of the program so it would be much easier to study! Please let me know if that is possible:)

Regards,
Fernando.

Build says Latex could not compile file but it doesn't appear to be true

Running

bookdown::render_book("index.Rmd", "bookdown::pdf_book", output_dir="./PDF")

ends with:

Error: LaTeX failed to compile book.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips. See book.log for more info.

The last lines of book.log are as follows:

Here is how much of TeX's memory you used:
 21646 strings out of 481677
 421366 string characters out of 5932691
 820033 words of memory out of 5000000
 38244 multiletter control sequences out of 15000+600000
 543574 words of font info for 106 fonts, out of 8000000 for 9000
 14 hyphenation exceptions out of 8191
 60i,8n,118p,1460b,576s stack positions out of 5000i,500n,10000p,200000b,80000s

Output written on book.pdf (269 pages).

And the book.pdf file is present.

So it appears that the first error message is wrong?

Chapter 2.4.5. Factors: slight confusion around default order of levels

In the introduction to Factors (2.4.5), it says that "The default is for the levels to follow alphabetical order." In the levels(murders$region) example above however, the levels are clearly not in alphabetical order. To avoid confusion, the text should explain that the data set murders as shown in the example has already been ordered in a different way.

mathematical expression is not correct

Typo Chapter 5

Chapter 5, Exercise 1
I believe the formula should be n(n+1)/2, rather than as shown below.

What is the sum of the first 100 positive integers? There is a formula that tells us the sum of integers 1 through n. It is n(n−1)/2. Define n=100 and then use R to compute the sum of 1 through 100 using the formula. What is the sum?

Chapter 7, Exercise 11
I believe you mean "to force an integer" here.

The class of class(a<-1) is numeric not integer. R defaults to numeric and to force a number, you need to add the letter L. Confirm that the class of 1L is integer.

14.5.1 Multiplication rule

In the study material, the likelihood of an Ace followed by a face card is given as
1/13×12/52≈0.02

The video mentioned that 10's can also be included and that given that the ace is already removed the last section should be divided by 51. Which give you
1/13×16/51≈0.02

I think the guide should be corrected?

"results" object not defined in inference/t-distribution.Rmd

"results" object is not defined in this file, so this file will not run independently.

dsbook/inference/t-distribution.Rmd

Line 71 in 1512dc0

results %>% filter(state == "Wisconsin")

Thanks!

Mistake in 33.5.1 formula?

In ml/dimension-reduction.Rmd line 70 (section 33.5.1) I believe there may be mistake in the formula for average distance as X_{i,j}-X_{i,j} will be 0.

Building instructions

I have been trying to build the pdf locally and have been completely unsuccessful.
Are there any available instructions?

If not, I can provide details of the errors where I am stumped.

Typo in Section 67.2

In the dsbook/ml/trees.Rmd file, at line line number 103, instead of linolenic, I think it should be linoleic. This may please be considered.

Wrong license on the homepage

Hello, the homepage says that the license is Creative Commons Attribution 3.0 Unported (CC-BY), but I believe it was updated to Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). Could you update the homepage as well?

PS - We're translating this material into Japanese 🇯🇵 for non-commercial purposes (will publish using the same license) and we appreciate this permissive license that allows us to do so.

Typo in chapter 15.10

Question 14 in the exercises asks "What is the standard error of S?" The previous two questions refer to Y and this question has already been asked in question 7. I think this should read "What is the standard error of Y?".

Code error: CH72.exercises

Code error in the book exercises for regularization chapter (setup):

code reads: schools <- schools %>% mutate(score = sapply(avg_score, mean))

should read: schools <- schools %>% mutate(avg_score = sapply(scores, mean))

Small errors in Preface

Typo in "annonucements" (should be "announcements") at bottom of preface. Second line and last line should also have periods after the links.

I only see these files as HTML and am not sure where to fix them myself - sorry Rafa. Thought you would like to know anyway as it's the first thing people see.

Missing image: Section 55.3 ml/intro-ml.Rmd

Hello! There is an improperly embedded image in Section 55.3 regarding zip code sorting.

The image from ml/intro-ml.Rmd line 63 does not properly render into the html version of the textbook.

Thanks!

Version of PDF on Leanpub is out of date (2019-05-02)

Hi @rafalab really excited to be joining your course on edX 👍
I just downloaded the PDF of the book from Leanpub https://leanpub.com/datasciencebook
And the PDF 2019-05-01

and the site say that it was LAST UPDATED ON 2019-05-02

Meanwhile the HTML version https://rafalab.github.io/dsbook was last updated 2019-10-25:

It appears that new versions merged into master on GitHub
are not being automatically published to the PDF version on Leanpub.

💭

Bootstrap object missing

Hello!

I was going through the Bootstrap lecture and noticed that the code to define "income" object is missing, so I can't replicate the code. Would you add that part:)?

	```{r, echo=FALSE}
	record <- list(name = "John Doe",
	student_id = 1234,
	grades = c(95, 82, 91, 97, 93),
	final_grade = "A")
	```

rafalab / dsbook Goto Github PK

dsbook's Introduction

Hi there 👋

dsbook's People

Contributors

Stargazers

Watchers

Forkers

dsbook's Issues

💭

Recommend Projects

Recommend Topics

Recommend Org