Giter VIP home page Giter VIP logo

thinkstats2's Introduction

ThinkStats2

Order Think Stats from Amazon.com.

Download this book in PDF.

Read this book online.

Read the related blog, Probably Overthinking It.

Think Stats is an introduction to Statistics and Data Science for Python programmers. If you have basic skills in Python, you can use them to learn concepts in probability and statistics and practical skills for working with data.

  • This book emphasizes simple techniques you can use to explore real data sets and answer interesting questions.

  • It includes case studies using datasets from the National Institutes of Health and other sources.

  • Many of the exercises use short programs to run experiments and help readers develop understanding.

This book is available under a Creative Commons license, which means that you are free to copy, distribute, and modify it, as long as you attribute the source and don’t use it for commercial purposes.

Working with the code

The easiest way to work with this code it to run it on Colab, which is a free service that runs Jupyter notebooks in a web browser. For every chapter, I provide two notebooks: one contains the code from the chapter and the exercises; the other also contains the solutions.

Chapter 1:

Examples from the chapter

Solutions to exercises

Chapter 2:

Examples from the chapter

Solutions to exercises

Chapter 3:

Examples from the chapter

Solutions to exercises

Chapter 4:

Examples from the chapter

Solutions to exercises

Chapter 5:

Examples from the chapter

Solutions to exercises

Chapter 6:

Examples from the chapter

Solutions to exercises

Chapter 7:

Examples from the chapter

Solutions to exercises

Chapter 8:

Examples from the chapter

Solutions to exercises

Chapter 9:

Examples from the chapter

Solutions to exercises

Chapter 10:

Examples from the chapter

Solutions to exercises

Chapter 11:

Examples from the chapter

Solutions to exercises

Chapter 12:

Examples from the chapter

Solutions to exercises

Chapter 13:

Examples from the chapter

Solutions to exercises

Chapter 14:

Examples from the chapter

Solutions to exercises

If you want to run these notebooks on your own computer, you can download them individually from GitHub or download the entire repository in a Zip file.

I developed this book using Anaconda, which is a free Python distribution that includes all the packages you'll need to run the code (and lots more). I found Anaconda easy to install. By default it does a user-level installation, so you don't need administrative privileges. You can download it here.

thinkstats2's People

Contributors

733amir avatar abhayana24 avatar allendowney avatar anasghrab avatar bgrant avatar burmecia avatar charleswhchan avatar claytoncook12 avatar dahmian avatar devinshanahan avatar djoume avatar dlorch avatar dnouri avatar felgru avatar gbremer avatar ilillii avatar imbolc avatar jnk22 avatar kmiddleton avatar laga avatar mo0nmokey avatar mrsampson avatar nirs avatar punchagan avatar raghothams avatar robin-norwood avatar steven-chau avatar westurner avatar wwliao avatar yongduek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

thinkstats2's Issues

Typo in Chapter 2 Exercises

-From the paragraph immediately above Exercise 2.3:

In the repository you downloaded, you should find a file named
chap01ex.py ; you can use this file as a starting place for the following exer-
cises. My solution is in chap02soln.py .

I think this should be chap02ex.py

Nitpick with solution for Chapter 3, Exercise 3

The solution offered in chap03soln actually computes the length difference between the 1st birth and the 2nd, as opposed 1st vs all others.

Code below computes difference between 1st birth prglngth and mean of other births' prglngth. The results are extremely similar but hey.

hist = thinkstats2.Hist()

livePreg = preg[preg.outcome == 1]
livePregMap = nsfg.MakePregMap(livePreg)

for caseid, indices in livePregMap.items():
    if len(indices) >= 2:
        ownLivePregs = livePreg.loc[indices]
        
        firstValue = ownLivePregs.iloc[0].prglngth # assumption: ownLivePregs default sort is same as sorting ascending by birthordr
        others = ownLivePregs.iloc[1:]
        othersMean = others.prglngth.mean()
        
        diff = othersMean - firstValue # same as np.diff([firstValue, othersMean])[0]

        hist[int(diff)] += 1

Thanks for the book and the code examples! Very enjoyable so far :)

Error in chap03ex.ipynb

Hi, I noticed that in 7th cell when we are plotting the pmf, pmf instance should be updated in 6th cell. Not the hist instance.

n = hist.Total()
pmf = hist.Copy()
for x, freq in hist.Items():
    pmf[x] = freq / n

6.1 - NormalPdf class

Your book has been awesome thus far! I came across a potential issue in your NormalPdf class, specifically when working with the example on adult female heights from the BRFSS dataset (page 76-77).

Per your code, the definition of the Density function looks to simply return the scipy.stats.norm.pdf() method. What I found was that running pdf.Density(mean + std) does not yield the same results as running the scipy.stats.norm.pdf() method.

thinkstats2 Module
mean, var = 163, 52.8
std = math.sqrt(var)
pdf = thinkstats2.NormalPdf(mean, std)
pdf.Density(mean + std)
yields 0.0333001

scipy.stats
scipy.stats.norm.pdf(mean + std)
yields 0.0

Am I mistaken that both approaches are the same at the core?

Thanks!

SurvivalFunction class instantiation missing ss parameter in survival notebook

In cell 3 of the survival notebook, the instantiation of a survival.SurvivalFunction class fails because it is missing the required second parameter, ss.
The code in the notebook is --

cdf = thinkstats2.Cdf(complete, label='cdf')
sf = survival.SurvivalFunction(cdf, label='survival')
thinkplot.Plot(sf)
thinkplot.Config(xlabel='duration (weeks)', ylabel='survival function')
#thinkplot.Save(root='survival_talk1', formats=['png'])

The exception is --

08/29/2017 09:47:44 PM INFO: Running cell:
variables = GoMining(join)

08/29/2017 09:47:45 PM INFO: Cell raised uncaught exception: 
---------------------------------------------------------------------------
PatsyError                                Traceback (most recent call last)
<ipython-input-14-cb99e4bff205> in GoMining(df)
     16 
---> 17             model = smf.ols(formula, data=df)
     18             if model.nobs < len(df)/2:

~/anaconda/envs/thinkstats2-36/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    154         tmp = handle_formula_data(data, None, formula, depth=eval_env,
--> 155                                   missing=missing)
    156         ((endog, exog), missing_idx, design_info) = tmp

~/anaconda/envs/thinkstats2-36/lib/python3.6/site-packages/statsmodels/formula/formulatools.py in handle_formula_data(Y, X, formula, depth, missing)
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:

~/anaconda/envs/thinkstats2-36/lib/python3.6/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    311     if lhs.shape[1] == 0:
--> 312         raise PatsyError("model is missing required outcome variables")
    313     return (lhs, rhs)

PatsyError: model is missing required outcome variables

During handling of the above exception, another exception occurred:

NameError                                 Traceback (most recent call last)
<ipython-input-15-ba2f7ffb2f05> in <module>()

The class code snippet is --

class SurvivalFunction(object):
    """Represents a survival function."""

    def __init__(self, ts, ss, label=''):
        self.ts = ts
        self.ss = ss
        self.label = label

This is failing with a new install of Anaconda Python 2.7, and I would expect this to fail in Python 3 as well.

bug in 8.7 Glossary - bias (of an estimator):

The current (2.0.27) definition of bias (of an estimator):
The tendency of an estimator to be above or below the actual value of the parameter, when averaged over repeated experiments.
should be read as
The tendency of estimates by an estimator to be above or below the actual value of the parameter, when averaged over repeated experiments.

ImportError: No module named gensvd

Hi,

I run a python script for my radar image processing but I had the following error.

Traceback (most recent call last):
File "TimefnInvert.py", line 17, in
import solver.tikh as tikh
File "/Volumes/chung/analysis/TSX_cost_flow/solver/tikh.py", line 14, in
import solver.gsvd.cgsvd as gsvd
File "/Volumes/chung/analysis/TSX_cost_flow/solver/gsvd/cgsvd.py", line 1, in
import gensvd
ImportError: No module named gensvd

I don't know how can I install the module gensvd for python. I tried to use 'pip install gensvd' but there was no gensvd could be found for installing. Any idea?
Thank you,
Chung

QUESTION: Why not use panda's histogram function?

(First - is filing an issue like this an OK way to ask a question? If not I'll happily retract/delete it and/or please close it :) )

Second - this is book is excellent - thank you for sharing it!!

I'm working my way through your book and I've got a question: I notice that you've created your own ThinkStats2.Hist class. I dug around a bit and it looks like pandas has it's own hist method (http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms). I haven't tried it (yet), but i was wondering if there was a particular reason why you used your own class instead of the one built into Pandas?

p.120 3rd paragraph 9.8 First babies again

The second sentence goes as "But after 1000 iterations, the largest test statistic generated under the null hypothesis is 32. "
However, executing hypothesis.py,
pregnancy length chi-squared brings actual = 101.5014... and ts max = 27.7315...
So, the second sentence should be read as "But after 1000 iterations, the largest test statistic generated under the null hypothesis is 28. "

error in Percentile2

From Gary Foreman

I believe I have found an error in your Percentile2 function that you include in Chapter 4.2 of the second edition. Specifically, the Percentile function and the Percentile2 function do not produce consistent results.

In the context of the example from the book, where we have a list of scores [55, 66, 77, 88, 99], let's consider the output of the two functions when we want to find the score with percentile rank 81. The Percentile function searches for the lowest score that corresponding to percentile rank 81. Score 88 has percentile rank 80, so the output will be 99, i.e. the next lowest score.

The Percentile2 function determines index as
index = percentile_rank * (len(scores) -1) // 100
For the case of percentile_rank = 81, and scores = [55, 66, 77, 88, 99]
index = 81 * (5 - 1) // 100 = 324 // 100 = 3
scores.sorted()[3] = 88.

I've coded this up and put it on gist, which can be found here https://gist.github.com/garyForeman/89ab99cd83ac47acd900

I've also created an alternative function called MyPercentile2, which behaves consistently with Percentile (at least for the test cases I have explored). Specifically, I assign index as
index = int(math.ceil(percentile_rank * len(scores) / 100.)) - 1

Please feel free to contact me if you have any questions or comments about the code on gist. And again, I really appreciate all the work you've done to create this valuable resource.

Add seaborn and lifelines to the list of module dependencies

The seaborn library is referenced in the pandas_examples.ipynb notebook, and the lifelines library is referenced in the survival.ipynb notebook. Both should be called out as dependencies for the end users and developers in the README and CONTRIBUTING docs.

_DictWrapper class invokes pandas Series method not present in Py27 when updating data

in the chap03ex notebook, cell 3 fails when the _DictWrapper class updates data from a pandas Series when running with Python 2.7. The class uses the items() method, which is not present in the 0.20.3 Python 2.7 release of pandas.

The code in cell 3 --

hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
thinkplot.Hist(hist)
thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='Count')

The exception --

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-1500ad553f53> in <module>()
----> 1 hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
      2 thinkplot.Hist(hist)
      3 thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='Count')

/Users/gbremer/projects/ThinkStats2/code/thinkstats2.pyc in __init__(self, obj, label)
    153             self.d.update(obj.Items())
    154         elif isinstance(obj, pandas.Series):
--> 155             self.d.update(obj.value_counts().items())
    156         else:
    157             # finally, treat it like a list

/Users/gbremer/anaconda/envs/thinkstats2-27/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   3075         if (name in self._internal_names_set or name in self._metadata or
   3076                 name in self._accessors):
-> 3077             return object.__getattribute__(self, name)
   3078         else:
   3079             if name in self._info_axis:

AttributeError: 'Series' object has no attribute 'items'

The Series.value_counts() object, a Series instance, does have an iteritems() method which works.

The code works in Python 3.6 using both the items() and iteritems() method; items() and iteritems() are both implemented.

Looking at the pandas docs, items() and iteritems() are equivalent --

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

items() | Lazily iterate over (index, value) tuples
iteritems() | Lazily iterate over (index, value) tuples

Typo for slope value

On the top of page 132, the sentence says

The slope is 0.175 pounds per year.

The decimal point is off by one. It should say

The slope is 0.0175 pounds per year.

since the value of slope is

>>> slope
0.017453851471802836

The other values around it (such as slope * diff_age) match what I see.

Typo in Dungeons.py

Hallo Allen,

I love your books! :-) I think here (line 106) should read 'Max' instead of 'Sum':

``thinkplot.Save(root='dungeons2',
xlabel='Max of three d6',
ylabel='Probability',
axis=[2, 19, 0, 0.23],
formats=FORMATS,
legend=False)

Bye
Fabio

prglength seems to be missing

On page 7 of Think Stats2 (v 2.0.21), section 1.5 "Variables," there is a list of variables used in the book. All variables seem to exist in the dataframe, except 'prglength.'

I pasted the results of df.columns.tolist() into a text document and searched for prglength (even length) but didn't find anything.

I also attempted to search the PDF to see where else this variable might be used but came across another error. Selecting the text 'prglength' and pasting it shows non-alphanumeric characters, rather han the word "prglength." Search for 'prglength' (when I type it myself in the find box) returns zero results! In other words, the document doesn't seem very searchable.

I'm on OSX, viewing the book on Chrome's PDF viewer. I did run nsgf and saw a confirmation of all tests passing.

I'm trying to work through the book by executing each line of code as I see it and came across this error. Thought I'd point it out.

How to install thinkstats2 in anaconda python

Hi, this is probably a silly question but how can i install thinkstats2 package. I am going trough your videos on safaribooks and trying to replicate the notebooks but can't seem to be able to install the package. Trying:

conda install [thinkstats2]
and get:
Package not found: " Package missing in current win-64 channels
I couldn't install it with pip either

Thank you

No module named 'thinkstats2'

Hi! When I copy and paste the provided nsfg.py script in Atom or Spyder I get following error:

File "preg.py", line 11, in
import thinkstats2
ImportError: No module named 'thinkstats2'

Why is there no module named thinkstats2? Thank you!

A simpler solution for Exercise 2.3

def Mode(hist):
    p = max(set(hist), key=hist.Freq)
    return p

In all my test cases this performs similarly to the solution given in chap02soln.py, but it seems to me to be a little bit less conceptually complicated.

Follow common naming conventions/distinguish classes and functions

I get extremely confused when reading your book as it's difficult to tell what's a function and what's a class. PEP-8 generally recommends that classes are named with CapWords and functions with lower_case ("words separated by underscores as necessary to improve readability"). The other common guide, Google's Python Style Guide, also recommends CapWords for classes and lower_case_with_underscores for functions.

I'm not advocating that one or the other must be rigidly followed, but at the very least I'd prefer that classes and functions/methods are named differently.

Proposed feature -- Travis CI build for tests, style checks, pdf generation

Here's something to consider for the project -- a CI build to run the tests nightly, run style checks, check your pdf build. The magic ingredient is a .travis.yml file in the root of the repository that scripts up an environment setup and build commands. It may be useful, it may not.

https://travis-ci.org/gbremer/ThinkStats2

The tests are failing because they want to display matplotlib plots -- that can be accounted for or fixed if this is considered useful.

Use hyperref package for pdf metadata (TOC, Title, Author)

It would be nice to have a "document outline" (I would call it a metadata TOC) for using the navigation tree in pdf viewers (usually you can toggle between thumbnail view and document outline). This can easily be added by using the hyperref package -- by default it generates the document outline and you can specify more metadata. For example:

\usepackage{hyperref}

\hypersetup{
    pdftitle={Think Stats}
    pdfauthor={Allen B. Downey}
}

It will also automatically linkify all your references! From the link above:
"This will automatically turn all your internal references into hyperlinks. It won't affect the way to write your documents: just keep on using the standard \label-\ref system (discussed in the chapter on Labels and Cross-referencing); with hyperref those "connections" will become links and you will be able to click on them to be redirected to the right page. Moreover the table of contents, list of figures/tables and index will be made of hyperlinks, too."

thinkstats2.EvalLognormalCdf TypeError

When running thinkstats2.EvalLognormalCdf(df.income, mu=0, sigma=1) python fails with

thinkstats2.EvalLognormalCdf(df.income, mu=0, sigma=1)
  File "/path/ThinkStats2/code/thinkstats2.py", line 1895, in EvalLognormalCdf
    return stats.lognorm.cdf(x, loc=mu, scale=sigma)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/stats/distributions.py", line 1329, in cdf
    args, loc, scale = self._parse_args(*args, **kwds)
TypeError: _parse_args() takes at least 2 arguments (3 given)

If I change EvalLognormalCdf to not use named args, it seems to work, i.e. return stats.lognorm.cdf(x, mu, sigma).

TypeError: main() takes exactly 1 argument

Hi Allen,

Thanks for the book & code. I just started working on the code & got this error message when I run the nsfg.py code:

TypeError Traceback (most recent call last)
in ()
165
166 if name == 'main':
--> 167 main(*sys.argv)

TypeError: main() takes exactly 1 argument (3 given)

Can you please explain the issue here and how to fix it? I'm new to Python and spent some time researching this issue but couldn't sort it out.

Thanks

Issue with code in Section 11.4

I'm getting an error when running the code in this section. Here's the shell session:

In[5]: live, firsts, others = first.MakeFrames()
ln[6]: live = live[live.prglngth > 30]
In[7]: import chap01soln
In[8]: resp = chap01soln.ReadFemResp()
In[9]: resp.index = resp.caseid
In[10]: join = live.join(resp, on='caseid', rsuffix='_r')
...
In[15]: def find_vars(data):
...         t = []
...         for name in join.columns:
...             try:
...                 if join[name].var() < 1e-7:
...                     continue
...                 formula = 'totalwgt_lb ~ agepreg + ' + name
...                 model = smf.ols(formula, data=join)
...                 if model.nobs < len(join) / 2:
...                     continue
...                 results = model.fit()
...             except (ValueError, TypeError):
...                 continue
...             t.append((results.rsquared, name))
...         return t
In[16]: t = find_vars(join)
Traceback (most recent call last):
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3035, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-d2e6ddbced30>", line 1, in <module>
    t = find_vars(join)
  File "<ipython-input-15-7173795bc6ef>", line 8, in find_vars
    model = smf.ols(formula, data=join)
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/statsmodels/base/model.py", line 147, in from_formula
    missing=missing)
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/statsmodels/formula/formulatools.py", line 65, in handle_formula_data
    NA_action=na_action)
  File "/Users/lucian/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 299, in dmatrices
    raise PatsyError("model is missing required outcome variables")
PatsyError: model is missing required outcome variables

Suggestion to use shallow git clone to download repository

The repository with its full history is quite big, so cloning it locally as you suggest in chapter 0.2 might transfer more data than needed. Someone who just wants to go through the exercises might be better advised to shallow clone your repository:

$ git clone --depth 1 https://github.com/AllenDowney/ThinkStats2.git
Cloning into 'ThinkStats2'...
remote: Counting objects: 420, done.
remote: Compressing objects: 100% (275/275), done.
remote: Total 420 (delta 150), reused 368 (delta 145), pack-reused 0
Receiving objects: 100% (420/420), 133.64 MiB | 1.50 MiB/s, done.
Resolving deltas: 100% (150/150), done.
Checking connectivity... done.

In fact, downloading the ZIP might be the best option altogether.

Files for chapter 1 and 2 exercises

I didn't check the other chapters, but is there any reason you don't mention the fact that there are files for each chapter's exercises? Instead, you instruct the reader to create these files.

Import Error: version `CXXABI_1.3.9' not found

Hi. I've just forked the repo and tried to run the first code, nsfg.py. I get an error regarding something with matplotlib.

My intuition tells me that this may have alreade a solution, but I have found none. The full trace of the error is the following

diego@ubuntu:~/Documents/dev/ThinkStats2/code$ python nsfg.py 
Traceback (most recent call last):
  File "nsfg.py", line 12, in <module>
    import thinkstats2
  File "/home/diego/Documents/dev/ThinkStats2/code/thinkstats2.py", line 34, in <module>
    import thinkplot
  File "/home/diego/Documents/dev/ThinkStats2/code/thinkplot.py", line 12, in <module>
    import matplotlib.pyplot as plt
  File "/home/diego/anaconda3/lib/python3.5/site-packages/matplotlib/pyplot.py", line 29, in <module>
    import matplotlib.colorbar
  File "/home/diego/anaconda3/lib/python3.5/site-packages/matplotlib/colorbar.py", line 32, in <module>
    import matplotlib.artist as martist
  File "/home/diego/anaconda3/lib/python3.5/site-packages/matplotlib/artist.py", line 15, in <module>
    from .transforms import (Bbox, IdentityTransform, TransformedBbox,
  File "/home/diego/anaconda3/lib/python3.5/site-packages/matplotlib/transforms.py", line 39, in <module>
    from matplotlib._path import (affine_transform, count_bboxes_overlapping_bbox,
ImportError: /home/diego/anaconda3/lib/python3.5/site-packages/matplotlib/../../../libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/diego/anaconda3/lib/python3.5/site-packages/matplotlib/_path.cpython-35m-x86_64-linux-gnu.so)

If it means something, I'm using
matplotlib (2.0.2)
Python 3.5.4 :: Anaconda, Inc.
conda 4.3.30
gcc (GCC) 4.8.5

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

Thanks

Typo in chap02ex

screen shot 2017-12-16 at 1 21 29 pm

Hey I recently git cloned this repo and I noticed the above typo in chap02ex near the second exercise.

Should be we can select not we can selection

Thanks for the great work!

code in book does not seem to work

Hi Allen,

Thanks for your book, it's great.
I executed the code in section 4.2, which reads as follows in your source TeX code:

\begin{verbatim}
def Percentile2(scores, percentile_rank):
    scores.sort()
    index = percentile_rank * (len(scores)-1) / 100
    return scores[index]
\end{verbatim}

When I executed this

scores = [55, 66, 77, 88, 99]
Percentile2(scores, 50.)

I get an error due to the fact that index is not an integer, but a floating point value.
I suggest using a cast to int as in

def Percentile2(scores, percentile_rank):
    scores.sort()
    index = int(percentile_rank * (len(scores)-1) / 100)
    return scores[index]

I guess this solution still needs checking for appropriate rounding...

Incorrect example code

Chapter 1 (location 415 in kindle version)

df.birthwgt_lb.value_counts(sort=False)
Traceback (most recent call last):
File "", line 1, in
TypeError: value_counts() got an unexpected keyword argument 'sort'

Cannot Run Code

I suppose that it's something I'm doing wrong, but anytime I try to run the code from my Mac OSX terminal I'm getting the following error:

./nsfg2.py: line 16: syntax error near unexpected token (' ./nsfg2.py: line 16:def MakeFrames():'

This happens right from the start, when I try to run ./nsfg2.py. I did install Anaconda and try to run my code through there, no dice. Could not figure out how to actually run a .py file from Anaconda.

So then I did some pip installs for the packages called-out in the book, I installed pandas, numpy, scipy, statsmodela, and matplotlib. Still getting this error message.

Initial Test Breaks if not executed from code directory

This is a minor suggestion and not really necessary.

After cloning the repo I executed the command

python code/nsfg.py

It failed to find the data files and therefore threw some errors. I suggest adding a line in the book that mentions that we should be in the code directory. This can also be resolved by adding a couple of lines to the file and I am adding a PR for that.

Thanks.

chap01ex.ipynb

I've entered the correct answer for the first exercise.
Question: Select the birthord column, print the value counts, and compare to results published in the codebook
Answer: preg.outcome.value_counts().sort_index()

However, when I ran the code, it showed me the NameError. I assumed that preg has been pre-defined. Did I miss any steps? Thanks!


NameError Traceback (most recent call last)
in ()
----> 1 preg.birthord.value_counts().sort_index()

NameError: name 'preg' is not defined

Provide support for Python 3

The ThinkStats2 code base is not Python 3 compliant, borne out by a quick grep on the code and looking for print statements. A quick improvement and win would be to run futurize on the Python to knock out the low hanging fruit, and after that review the notebooks for similar sorts of code. There is at least one issue that refers to a Python 3 incompatibility in notebook code.

Since Python 2 is not end-of-life, the fixes should not port the code to Python 3 but ensure the code is compatible between Python 2 and Python 3 for next few years.

Out of place sentence in chapter 6

On page 92, there is the following code:

sample = [random.gauss(mean, std) for i in range(500)]
sample_pdf = thinkstats2.EstimatedPdf(sample)
thinkplot.Pdf(sample_pdf, label='sample KDE')

Below, there is a sentence starting with

pmf is a Pmf object that ...

However, pmf doesn't appear anywhere on the page.

birthwgt_lb cleanup example show non-existent output

book.tex line 1209 mention a 51 pound baby:

1209 51 1

And explain that that the value was cleaned up:

df.birthwgt_lb[df.birthwgt_lb > 20] = np.nan

But when the reader execute the example code, the 51 pound baby is not there, because it was already cleaned by CleanFemPreg.

I think the best way to resolve this would be to remove the cleanup function from ReadFemPreg(), and instruct the user to call it. This can be exercise 1.

(Chapter 1.7 Validation) Result of value_counts slightly different than one in book

Hi,
I just run code provided in the chapter 1.7. While df.outcome.value_counts(sort=False) yields exactly what is shown in the book's example, result of df.birthwgt_lb.value_counts(sort=False) gives me slightly different output (see below) than it is shown in book. Other than that little issue, everything is fine - all weight values for birthwgt_lb column are summarized correctly.

EDIT
Oh, and apparently I'm missing the 51 pound baby :)
EDIT
I just ran through the source code of nsfg.py it make sense now that 51 pound baby is removed by call to CleanFemPreg.

Result of running df.birthwgt_lb.value_counts(sort=False):

8.0     1889
7.0     3049
6.0     2223
4.0      229
5.0      697
10.0     132
12.0      10
14.0       3
3.0       98
1.0       40
2.0       53
0.0        8
9.0      623
11.0      26
13.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Can someone please let me know what might be causing this?

I'm running my code using Python(3.5.2) in virtualenv(15.1.0) with installed dependencies:

cycler (0.10.0)
matplotlib (2.0.2)
numpy (1.13.0)
pandas (0.20.2)
patsy (0.4.1)
pip (9.0.1)
pyparsing (2.2.0)
python-dateutil (2.6.0)
pytz (2017.2)
scipy (0.19.1)
setuptools (18.0.1)
six (1.10.0)
statsmodels (0.8.0)
wheel (0.24.0)

cannot open chap01ex.ipynb

Cannot open the chapter 1 ipython notebook exercises, both the problem and the solution files. I get the following error message
"Unreadable Notebook: /Users/Home/git/ThinkStats2/code/chap01ex.ipynb NotJSONError('Notebook does not appear to be JSON: u'\n\n\n\n\n<html lang="e...',)"

I upgraded my ipython to version 4.2.1 and it didn't resolve the problem

Iteration of dict keys and values fails in panda_demos notebook when running in Python 3

In cell 24, an interation over a dictionary's keys and values fails with an attribute exception, that iteritems() is not available.

The code is --

import numpy

for sex, weights in d.iteritems():
    print(sex, numpy.log(weights).mean(), numpy.log(weights).std())

The exception is --

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-31c478b07415> in <module>()
      1 import numpy
      2 
----> 3 for sex, weights in d.iteritems():
      4     print(sex, numpy.log(weights).mean(), numpy.log(weights).std())

AttributeError: 'dict' object has no attribute 'iteritems'

This fails with a new install of Anaconda Python 3.6, and succeeds in a new install of Python 2.7.

bug in thinkplot.Pmf

from john miller:

it looks like the way you're calculating bars is broken.

Here's the Hist plot:

And here's the Pmf plot:

here's the Pmf data. You can see that, for example, there is only one value near .05 between 0.86 and 0.88, but two are drawn. Same for others.

{0.86728: 0.05221414408460013,
0.87273: 0.00033046926635822867,
0.87543: 0.0016523463317911435,
0.88343: 0.031229345670852608,
0.89127: 0.00066093853271645734,
0.89384: 0.00033046926635822867,
0.89893: 0.036516853932584269,
0.90394: 0.00016523463317911433,
0.90641: 0.0019828155981493718,
0.90885: 0.00016523463317911433,
0.91368: 0.036021150033046928,
0.91841: 0.00016523463317911433,
0.92074: 0.0029742233972240581,
0.92305: 0.00016523463317911433,
0.92758: 0.038995373430270985,
0.932: 0.00049570389953734295,
0.93417: 0.00066093853271645734,
0.93632: 0.00033046926635822867,
0.94052: 0.041969596827495043,
0.9466: 0.0014871116986120291,
0.94857: 0.00033046926635822867,
0.95241: 0.056345009914077988,
0.95793: 0.0009914077990746859,
0.96315: 0.048248512888301384,
0.96646: 0.00016523463317911433,
0.96806: 0.0013218770654329147,
0.96963: 0.00033046926635822867,
0.97266: 0.085756774619960341,
0.97693: 0.0009914077990746859,
0.97827: 0.00016523463317911433,
0.98085: 0.055188367481824187,
0.98443: 0.00049570389953734295,
0.98766: 0.081956378056840709,
0.99053: 0.0014871116986120291,
0.99302: 0.09517514871116986,
0.99514: 0.0018175809649702576,
0.99689: 0.2027428949107733,
0.99825: 0.00033046926635822867,
0.99922: 0.11698612029081294,
0.99965: 0.00016523463317911433,
0.99981: 0.00049570389953734295}

Here's what I would expect it to look like:

Modular latex source

A useful way to structure latex documents is to use the \input{} or \include{} commands to make a modular document. This would be a nice way to separate chapters in the latex source.

chap01ex - quibble with the prglngth and agepreg exercises

For the exercises:

Print value counts for prglngth and compare to results published in the codebook and Print value counts for agepreg and compare to results published in the codebook..

I didn't like that you can't directly compare the results of the code with the data given in those links. My solution was:

idxes = ((0, 14), (14, 27), (27, -1))
for i in idxes:
    print(df['prglngth'].value_counts().sort_index()[i[0]:i[1]].sum())
print(df['prglngth'].value_counts().sum())

With the results:

3522
793
9276
13593

Although there may be more elegant/numpy-thonic ways of doing that.

Error in nsfg.py at line 98 for item()

First of all, I would like to thank you for such a lovely tutorial.
I found an error at line 98 in nsfg.py that item() function is not there. Do I need to use compulsary use Python3 as your git repo says?
It may sounds silly but I am a newbee on this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.