amueller / introduction_to_ml_with_python Goto Github PK

Notebooks and code for the book "Introduction to Machine Learning with Python"

Jupyter Notebook 99.60% Python 0.40%

introduction_to_ml_with_python's Introduction

Introduction to Machine Learning with Python

This repository holds the code for the forthcoming book "Introduction to Machine Learning with Python" by Andreas Mueller and Sarah Guido. You can find details about the book on the O'Reilly website.

The book requires the current stable version of scikit-learn, that is 0.20.0. Most of the book can also be used with previous versions of scikit-learn, though you need to adjust the import for everything from the model_selection module, mostly cross_val_score, train_test_split and GridSearchCV.

This repository provides the notebooks from which the book is created, together with the mglearn library of helper functions to create figures and datasets.

For the curious ones, the cover depicts a hellbender.

All datasets are included in the repository, with the exception of the aclImdb dataset, which you can download from the page of Andrew Maas. See the book for details.

If you get ImportError: No module named mglearn you can try to install mglearn into your python environment using the command pip install mglearn in your terminal or !pip install mglearn in Jupyter Notebook.

Errata

Please note that the first print of the book is missing the following line when listing the assumed imports:

from IPython.display import display

Please add this line if you see an error involving display.

The first print of the book used a function called plot_group_kfold. This has been renamed to plot_label_kfold because of a rename in scikit-learn.

Setup

To run the code, you need the packages numpy, scipy, scikit-learn, matplotlib, pandas and pillow. Some of the visualizations of decision trees and neural networks structures also require graphviz. The chapter on text processing also requires nltk and spacy.

The easiest way to set up an environment is by installing Anaconda.

Installing packages with conda:

If you already have a Python environment set up, and you are using the conda package manager, you can get all packages by running

conda install numpy scipy scikit-learn matplotlib pandas pillow graphviz python-graphviz

For the chapter on text processing you also need to install nltk and spacy:

conda install nltk spacy

Installing packages with pip

If you already have a Python environment and are using pip to install packages, you need to run

pip install numpy scipy scikit-learn matplotlib pandas pillow graphviz

You also need to install the graphiz C-library, which is easiest using a package manager. If you are using OS X and homebrew, you can brew install graphviz. If you are on Ubuntu or debian, you can apt-get install graphviz. Installing graphviz on Windows can be tricky and using conda / anaconda is recommended. For the chapter on text processing you also need to install nltk and spacy:

pip install nltk spacy

Downloading English language model

For the text processing chapter, you need to download the English language model for spacy using

python -m spacy download en

Submitting Errata

If you have errata for the (e-)book, please submit them via the O'Reilly Website. You can submit fixes to the code as pull-requests here, but I'd appreciate it if you would also submit them there, as this repository doesn't hold the "master notebooks".

introduction_to_ml_with_python's People

Contributors

Stargazers

Watchers

Forkers

iassael lenovor pseemakurthi gzzgz qingsong99 parakrant sudarshan1413 wanjinchang rjonczy lepy ml-ai-nlp-ir linuxcarey ahmed-touati vpunia-dev sibirtsev francescoperera slon1024 bafurtado mmottahedi trietptm-on-coding-algorithms danielhabib cauyrd ryanther ahmedhamedtn rajat1994 fsgp cczysz oditorium warvito btbytes arkoneogy rlugojr tairycy pythonnuts uestcwangxiao mitchshack wangjiahong hdyen djedamski mcolic armgilles qgzang garyci nikky4d a414930249 saraswatmks vanglian centem anthar sandeepsingh rl3012 xiaohu2015 savvastj faameem choldgraf kayshrk ashish-bold iangow nvenkataraman1 sam0999 libardo1 chongbingbao qiwsir zenghf rayleighchen creativedutchmen bingbai kenhollandwhy laventura wangxiao5791509 benjamesbabala littletiger311 qixianbiao jeffbar robingong soledad89 hikariai codeaudit tkamag abcdexter binbinbian snowdj ruguevara yazdavar raggleton anilcs13m kentchun33333 dolittle007 redwa charlesaydin ravnoor giancarlok yunxileo shi-wu cristianpachacama ajagaja pdaicode binhna rishi1212 drstatsvenu

introduction_to_ml_with_python's Issues

Error on preamble import

I am getting the following error:

from preamble import *

%matplotlib inline

ImportError Traceback (most recent call last)
in ()
----> 1 from preamble import *
2 get_ipython().magic(u'matplotlib inline')

/Users/ssen/Box Sync/projects/introduction_to_ml_with_python/preamble.py in ()
2 import numpy as np
3 import matplotlib.pyplot as plt
----> 4 import mglearn
5
6 set_matplotlib_formats('pdf', 'png')

/Users/ssen/Box Sync/projects/introduction_to_ml_with_python/mglearn/init.pyc in ()
----> 1 from . import plots
2 from . import tools
3 from .plots import cm3, cm2
4
5 all = ['tools', 'plots', 'cm3', 'cm2']

/Users/ssen/Box Sync/projects/introduction_to_ml_with_python/mglearn/plots.py in ()
9 plot_single_hidden_layer_graph,
10 plot_two_hidden_layer_graph)
---> 11 from .plot_linear_regression import plot_linear_regression_wave
12 from .plot_tree_nonmonotonous import plot_tree_not_monotone
13 from .plot_scaling import plot_scaling

/Users/ssen/Box Sync/projects/introduction_to_ml_with_python/mglearn/plot_linear_regression.py in ()
3
4 from sklearn.linear_model import LinearRegression
----> 5 from sklearn.model_selection import train_test_split
6 from .datasets import make_wave
7

ImportError: No module named model_selection

Is this covered in your book?

I ordered, but have not yet received your book, so I wanted to ask a basic question to make sure your book covers basic problems like this one.

While diving in, I was thinking of looking at linear regression for a set of error data. These data have two columns: (1) "ErrorDate" -> week number (of year), and (2) "ErrorCount" (how many errors did the system have in that week).

I would imagine these data are pretty noisy (random), but who knows?

Anyway, I tried to load this data and do a basic LinearRegression fit test but got an error. (My pandas book has not arrived from the third-party seller either).

"ValueError: Expected 2D array, got 1D array instead:"

--
The code seems so simple, like it should work:

Read CSV data into dataframe

thedf = pd.read_csv("Errors.csv", sep=",")

X_train, X_test, y_train, y_test = train_test_split(
thedf['ErrorCount'], thedf['ErrorDate'], random_state=0)

print (ussdf.head())

>>>> Prints:
ErrorDate ErrorCount
0 1 80
1 2 118
2 3 249
3 4 397
4 5 159

So far, so good..

But, the shape is apparently wrong:

print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

>>>> Prints:
X_test shape: (13,)
y_test shape: (13,)

So, I see the shape is the problem, but I'm wondering if your book covers this basic Pandas set up or if I need to wait for my Pandas book to arrive and hope it covers how to "resize" (I think) the data frame?

Or, is this a case where I need that more introductory book you mentioned as anyone using your book would have this basic pandas knowledge?

It can be frustrating climbing that learning curve...

Thanks,

about: plt.legend(loc=4) [In[61], page-295, Ch-5]

Respected Dr. Muller,
Just for your information: when I use loc=4 in plt.legend, I get the following error:
UserWarning: Unrecognized location "4". Falling back on "best";

By the way, I am able to see the output figure, so the issue is not a problem. However, I just wanted to let you know about this.

matplotlib.version
Out: '2.1.0'

Thank you,
Sincerely,
Nikhilesh

small plots when rerunning the jupyter notebooks

If the jupyter notebooks are executed (again) the resulting plots are a lot smaller than before. I suggest to set the figure.figsize option in the preample.py to something like:

plt.rcParams['figure.figsize'] = 15, 10

Question about make_blobs(n_samples) [In[51] (page-288, Ch-5), ISBN# 978-1-449-36941-5]

Respected Dr. Muller,
I have been having hard time figuring out how to properly use cross-validation (especially, cross_val_predict) in lieu of the train_test_split() method used in In[51] (page-288, Ch-5) - once I can properly formulate the question, I shall ask you about it. Meanwhile, I have another question that concerns with the same code (In[51], page-288, Ch-5): the first two lines of that code are:
from mglearn.datasets import make_blobs
X, y = make_blobs(n_samples=(400, 50), centers=2, cluster_std=[7.0, 2], random_state=22)

My question concerns with: n_samples=(400, 50), where, 400 implies negative class points and 50 implies positive class points (as stated in the book)
However, I could not find any reference of the class distribution in n_samples in your website where you define it (https://github.com/amueller/mglearn/blob/master/mglearn/make_blobs.py).
[I am a newbie in terms of Python (whatever Python I have learned is by going through your codes from the book, one line at a time) and also programming in general. So, please forgive me if I am getting confused and asking you a stupid question]

Everywhere I have looked, the n_samples is designated as: n_samples=100 [for example, or some other number, but not (100, 2) for instance]
If you can guide me to a general syntax for n_samples, I will greatly appreciate.

Thank you so much for all your help,
Sincerely,
Nikhilesh

Change font size in mglearn.tools.visualize_coefficients

Hi Andreas,

I really like the book and use it for my own project. Now, one quick question is about how to change the font size in this code
mglearn.tools.visualize_coefficients(coef, feature_names, n_top_features=10)

The picture is beautiful, but the font size is often too small for a presentation slide.

Thanks.

Where is the aclImdb corpus

tree -L 2 data/aclImdb

data/aclImdb [error opening dir]

0 directories, 0 files

Hi, Amueller, how to get this data? Thank you.

MLPClassifier init() got an unexpected keyword argument 'algorithm'

I got the above error trying to run the neural network code in chapter 2. I have the latest versions of sklearn and iPython, and am running Python 3.4.3. The notebook code also gives the same result.

Introduction should include setup

You should add a section to chapter 1 that either gives instructions for getting everything you need set or point to a Wiki page or something that provides the instructions.

Currently you only have a list of some of the libraries. For example you say we need scipy but neglect to mention that you allow need to installed Pillow which is not included with scipy but is required for the imread function you are using in your plotting library. You should also provide links and instructions for setting up and using Jupyter Notebook so that people can follow along and see the charts inline as shown in the book.

'mglearn.datasets' has no attribute 'DATA_FOLDER'

Hi,

I have just cloned your repository and when I ran the notebook 02-supervised-learning, and hit the cell that contains the following code:

import os
ram_prices = pd.read_csv(os.path.join(mglearn.datasets.DATA_FOLDER, "ram_price.csv"))

plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte")

I get this error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-121-e843eddf09a9> in <module>()
      1 import os
----> 2 ram_prices = pd.read_csv(os.path.join(mglearn.datasets.DATA_FOLDER, "ram_price.csv"))
      3 
      4 plt.semilogy(ram_prices.date, ram_prices.price)
      5 plt.xlabel("Year")

AttributeError: module 'mglearn.datasets' has no attribute 'DATA_FOLDER'

In the source file of datasets.py, I do not see any variable called DATA_FOLDER. Could this be another variable that needs to be updated?

Thank you.

About: mglearn.plots.plot_cross_val_selection() [Ch-5, In[21], page-266]

Respected Dr. Muller,
Just for your information - when I run the following code from your book:
import mglearn
mglearn.plots.plot_cross_val_selection()

I do get the proper figure output, but I also get a warning which might be of interest to you:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
warnings.warn(*warn_args, **warn_kwargs)

My sklearn and mglearn versions are as follows:
sklearn.version
Out: '0.19.1'

mglearn.version
Out: '0.1.6'

Thank you,
Sincerely,
Nikhilesh

display(panda DataFrame) throws a TypeError: 'module' object is not callable

The following code snippet from 01-introduction.ipynb:

from IPython import display

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes in the Jupyter notebook
display(data_pandas)

throws an error:

TypeError                                 Traceback (most recent call last)
<ipython-input-33-b1f2029f7e30> in <module>()
     12 # IPython.display allows "pretty printing" of dataframes
     13 # in the Jupyter notebook
---> 14 display(data_pandas)

TypeError: 'module' object is not callable

I've downloaded and installed the current Anaconda3-4.4.0-Linux-x86_64.sh version from scratch and tried an updated environment - both with the same result:

$ conda info

Current conda install:

               platform : linux-64
          conda version : 4.3.21
       conda is private : False
      conda-env version : 4.3.21
    conda-build version : not installed
         python version : 3.6.1.final.0
       requests version : 2.14.2
       root environment : ~/anaconda3  (writable)
    default environment : ~/anaconda3
       envs directories : ~/anaconda3/envs
                          ~/.conda/envs
          package cache : ~/anaconda3/pkgs
                          ~/.conda/pkgs
           channel URLs : https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/linux-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : None
             netrc file : None
           offline mode : False
             user-agent : conda/4.3.21 requests/2.14.2 CPython/3.6.1 Linux/4.4.0-79-generic debian/stretch/sid glibc/2.23    
                UID:GID : 1000:1000

Just using data_pandas without the display function call prints the panda dataframe and works:

import pandas as pd
from IPython import display

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)
data_pandas

Age	Location	Name
0	24	New York	John
1	13	Paris	Anna
2	53	Berlin	Peter
3	33	London	Linda

How to install mglearn using anaconda

Sorry for beginners question, I had already spent a few days searching for instructions...

I have bought your book, and am trying to follow the examples and setup.
I have installed anaconda on OS X.
I have used github to download mglearn.

Now, I am looking for how to install mglearn in anaconda.

Many thanks

coefficients ckecking fail

Line 26 in tools.py.
if len(coefficients) != len(feature_names):
But coefficients is 2d array, it should be
if len(coefficients[0]) != len(feature_names):

Notation of Linear Regression Model

The first photo shows Andrew Ng's notation on linear notation, and the second one is from the book. What does 'b' correspond to Andrew's notation in the formula? My understanding is that the theta0x0 is equivalent to the b (b * 1, where '1' is x). If that case, is 'b' still necessary in the formula? Thank you.

07-working-with-text-data.ipynb: ModuleNotFoundError: No module named 'spacy'

On section 07-working-with-text-data.ipynb#Advanced-tokenization,-stemming-and-lemmatization I get the following error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-42-2aa020095bf1> in <module>()
----> 1 import spacy
      2 import nltk
      3 
      4 # load spacy's English-language models
      5 en_nlp = spacy.load('en')

ModuleNotFoundError: No module named 'spacy'

once I installed spacy:

$ conda install spacy
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/blinkeye/anaconda3:

The following NEW packages will be INSTALLED:

    cymem:      1.31.2-py36_0 
    murmurhash: 0.26.4-py36_0 
    plac:       0.9.6-py36_0  
    preshed:    0.46.4-py36_0 
    semver:     2.7.7-py36_0  
    spacy:      0.101.0-py36_0
    sputnik:    0.9.3-py36_0  
    thinc:      5.0.8-py36_0  
    ujson:      1.35-py36_0   

The following packages will be UPDATED:

    conda:      4.3.21-py36_0  --> 4.3.22-py36_0

Proceed ([y]/n)? y

I got further errors:

---------------------------------------------------------------------------
PackageNotFoundException                  Traceback (most recent call last)
/home/blinkeye/anaconda3/lib/python3.6/site-packages/spacy/util.py in get_package_by_name(name, via)
     43         return sputnik.package(about.__title__, about.__version__,
---> 44             name, data_path=via)
     45     except PackageNotFoundException as e:

/home/blinkeye/anaconda3/lib/python3.6/site-packages/sputnik/__init__.py in package(app_name, app_version, package_string, data_path)
    159     pool = Pool(app_name, app_version, expand_path(data_path))
--> 160     return pool.get(package_string)
    161 

/home/blinkeye/anaconda3/lib/python3.6/site-packages/sputnik/package_list.py in get(self, package_string)
     56         if not candidates:
---> 57             raise PackageNotFoundException(package_string)
     58 

PackageNotFoundException: en

Then I followed conda-forge / packages / spacy 1.8.2

$ conda install -c conda-forge spacy=1.8.2
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/blinkeye/anaconda3:

The following NEW packages will be INSTALLED:

    ftfy:      4.4.2-py36_0      conda-forge
    regex:     2017.04.05-py36_0 conda-forge
    termcolor: 1.1.0-py36_1      conda-forge
    tqdm:      4.14.0-py36_0     conda-forge

The following packages will be UPDATED:

    preshed:   0.46.4-py36_0                 --> 1.0.0-py36_0      conda-forge
    spacy:     0.101.0-py36_0                --> 1.8.2-np112py36_0 conda-forge
    thinc:     5.0.8-py36_0                  --> 6.5.2-np112py36_0 conda-forge

The following packages will be SUPERSEDED by a higher-priority channel:

    conda:     4.3.22-py36_0                 --> 4.3.21-py36_1     conda-forge
    conda-env: 2.6.0-0                       --> 2.6.0-0           conda-forge

Proceed ([y]/n)? y

which seems to work (I still see a warning):

    Warning: no model found for 'en'

    Only loading the 'en' tokenizer.

"given enough data, ridge and linear regression will have the same performance"

Is this strictly true (pp 54-55)? I am a beginner, but ridge has an artificial constraint imposed on the coefficients while linear regression does not. Therefore I would have expected linear regression performance to eventually exceed that of ridge, since regularization puts a cap on how good ridge can get. Just curious if this is wrong.

Error with mglearn.tools.visualize_coefficients()

Hi,

When I try to to observe important trigram features of my model with this code :
mask = np.array([len(feature.split(" ")) for feature in feature_names]) == 3 mglearn.tools.visualize_coefficients(coef.ravel()[mask],feature_names[mask], n_top_features=40)

I have this error and I don't understand why ..
ValueError: incompatible sizes: argument 'height' must be length 80 or scalar

Do you have an idea ?

01-introduction - ImportError: cannot import name plots

from preamble import *

I get the following error message:

-----------------------------------------------------
ImportError         Traceback (most recent call last)
<ipython-input-51-4b76e7eb44e5> in <module>()
----> 1 from preamble import *

/Users/Aanandh/Desktop/Data Science/introduction_to_ml_with_python/preamble.py in <module>()
      3 import numpy as np
      4 import matplotlib.pyplot as plt
----> 5 import mglearn
      6 from cycler import cycler
      7 

/Users/Aanandh/Desktop/Data Science/introduction_to_ml_with_python/mglearn/__init__.py in <module>()
----> 1 from . import plots
      2 from . import tools
      3 from .plots import cm3, cm2
      4 from .tools import discrete_scatter
      5 from .plot_helpers import ReBl

ImportError: cannot import name plots

Matplotlib warning regarding: mglearn.plots.plot_cross_validation() [Ch-5, In[2], p-254]

Respected Dr. Muller,
When I run this command, 'mglearn.plots.plot_cross_validation()', I get the following warning:
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py:2648: MatplotlibDeprecationWarning: The bottom kwarg to barh is deprecated use y instead. Support for bottom will be removed in Matplotlib 3.0
ret = ax.barh(*args, **kwargs)

Furthermore, the output is not clear i.e. all the 'training-data boxes' (the non-dark ones) after the first row do not appear.
The same is the case when I run the command ['mglearn.plots.plot_cross_validation()'] that appears in the next page (In[7], page-257).
I just thought to let you know about this.

Sincerely,
Nikhilesh

07-working-with-text-data.ipynb: missing data/aclImdb/train directory

The notebooks starts with a tree and cleanup command which fail on a up-to-date Ubuntu 16.04 Xenial:

Types of data represented as strings
Example application: Sentiment analysis of movie reviews
In [2]:

!tree -dL 2 data/aclImdb
/bin/sh: 1: tree: not found
In [3]:

!rm -r data/aclImdb/train/unsup
rm: cannot remove 'data/aclImdb/train/unsup': No such file or directory

The missing tree package problem is fixed by:
sudo apt install tree

The real problem follows a bit later on the In[4] code snippet:

from sklearn.datasets import load_files

reviews_train = load_files("data/aclImdb/train/")
# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

which throws:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-a724a25f6c25> in <module>()
      1 from sklearn.datasets import load_files
      2 
----> 3 reviews_train = load_files("data/aclImdb/train/")
      4 # load_files returns a bunch, containing training texts and training labels
      5 text_train, y_train = reviews_train.data, reviews_train.target

/home/blinkeye/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py in load_files(container_path, description, categories, load_content, shuffle, encoding, decode_error, random_state)
    199     filenames = []
    200 
--> 201     folders = [f for f in sorted(listdir(container_path))
    202                if isdir(join(container_path, f))]
    203 

FileNotFoundError: [Errno 2] No such file or directory: 'data/aclImdb/train/'

07-working-with-text-data.ipynb: missing multiple imports

Trying to execute:

# Technicallity: we want to use the regexp based tokenizer
# that is used by CountVectorizer  and only use the lemmatization
# from SpaCy. To this end, we replace en_nlp.tokenizer (the SpaCy tokenizer)
# with the regexp based tokenization
import re
# regexp used in CountVectorizer:
regexp = re.compile('(?u)\\b\\w\\w+\\b')

# load spacy language model and save old tokenizer
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
# replace the tokenizer with the preceding regexp
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(
    regexp.findall(string))

# create a custom tokenizer using the SpaCy document processing pipeline
# (now using our own tokenizer)
def custom_tokenizer(document):
    doc_spacy = en_nlp(document, entity=False, parse=False)
    return [token.lemma_ for token in doc_spacy]

# define a count vectorizer with the custom tokenizer
lemma_vect = CountVectorizer(tokenizer=custom_tokenizer, min_df=5)

results in the following error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-29dc777cc8f9> in <module>()
     23 
     24 # define a count vectorizer with the custom tokenizer
---> 25 lemma_vect = CountVectorizer(tokenizer=custom_tokenizer, min_df=5)

NameError: name 'CountVectorizer' is not defined

Broken mglearn figures for crossvalidation

Chapter 5 shows 2 CV procedures that are broken in mglearn:

mglearn.plots.plot_shuffle_split() doesn't return a valid figure

mglearn.plots.plot_group_kfold() doesn't exist. There is a plot_label_kfold(), but it looks like it got mangled in a PR. If you change 'labels' to groups on line 22 it works.

photos in chapter 3 coming out as green

Do you know why the photos in Chapter 3 come out green when you run it in the notebook? In the PDF there are some green photos, but not all. When running the notebook myself the photos are all green.

About the value of vmax in chapter 3 (For instance: code In[74], page 198)

Respected Dr. Muller,

When I ran the code of In[74] (page 198), I got yellow boxes without any image!
Then I found that changing vmax from 1 to ~200 produces proper images as shown in the book in page 199. (Changing vmin had no drastic effect).
I think vmax corresponds to contrast while vmin corresponds to brightness (am I right?)
(I am using Spyder 3.2.4 (with Python 3.5.4, 64 bits on Windows 10).

have to repeatedly run %matplotlib notebook to get plots

I'm guessing this isn't an issue with the book per se, but I'm getting it going through the examples in Chapter 2 (pasting them in to Jupyter). I find that I need to rerun my imports frequently to get the plots to show. In the first cell I have:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display
%matplotlib notebook

and I basically have to go back and rerun that before each cell with a plot for some reason. Any idea what could be wrong?

docker files to create a docker image with everything preconfigured

Hi Andreas

what would you think about providing a docker file for a reproducible environment in this repo? I've created the necessary docker files based on the official continuumio/anaconda image(s). Both a Python2.7 and a Python3 based image can be created including the additional packages we need to run your notebooks.

Pull-requests follows shortly (tested on Ubuntu 16.04).

This allows for us (readers) to easily run your notebooks in a pre-tested and properly configured environment. The initial docker build command takes a while but once one has all the necessary images downloaded re-creating your image takes a few seconds (if recreating the image is ever needed that is).

I'm currently binding your whole repo and files into the docker image which allows for git pull outside the container without re-creating the docker image. This means you may rename, change, delete and add new files without the need to adapt the docker files. All we (readers) need to do is pulling from this repo.

Optional (next step):
If you intend to actually provide a pre-configured docker image (not just the Dockerfile) yourself the following things might change:

you probably would want to package (e.g. snapshot/release) your notebooks from this repo into the docker image (hence having full control of the environment and the notebook versions)
you probably would want to pre-download the additional /data files in the image, see #41

What do you think?

Error in: mglearn.plots_label_kfold() (Ch-5, In [16], page-262)

Respected Dr. Muller,
When I run the following command:
import mglearn
mglearn.plots.plot_label_kfold()

I get the following warning:
AttributeError: module 'mglearn.plots' has no attribute 'plot_label_kfold'

My matplotlib, mglearn versions are:
matplotlib.version
Out: '2.1.0'
mglearn.version
Out: '0.1.6'

Thank you,
Sincerely,
Nikhilesh

Type Error issue on discrete_scatter()

Hi Andreas,

Now I'm trying to trace your repo to learn how to use scikit-learn. In the 4th row of the 2nd cell of the notebook, "02-supervised-learning.ipynb", discrete_scatter() are called to plot the blob data.

mglearn.discrete_scatter(X[:, 0], X[:, 1], y)

However I was trapped by an error which says, "TypeError: 'Cycler' object is not callable". It seems that current_cycler() which is called in the 80th line in the file "plot_helpers.py" is wrong.

current_cycler = mpl.rcParams['axes.prop_cycle']

for i, (yy, cycle) in enumerate(zip(unique_y, current_cycler())):

I guess current_cycler is not callable since it is defined as rcParams object in the previous line and the parentheses '()' should be removed. It worked when I removed them and run the program in my laptop.

I hope my point makes sense.

Takeshi

Jupyter Notebook Cannot Find Pandas

Hey, I am having some difficulty getting the examples to run in the Jupyter Notebook.

When I have it set to be a Python3 script it cannot find pandas or many other libraries.

Yet, when do a '!pip list' command it shows up fine. Any ideas?

Also I cannot get the 'display(data_pandas)' on page 10 to work. It says display is an unrecognized command.

Where is the data file for categorical feature example?

See this code:

data = pd.read_csv("/home/andy/datasets/adult.data", header=None, index_col=False,
                  names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])

Where should I get the 'adult.data' file? Thank you.

images/ directory is missing

which causes

a minor issue in 01-introduction.ipynb where the static image should be displayed:
### A First Application: Classifying iris species ![sepal_petal](images/iris_petal_sepal.png)
a minor issue for 02-supervised-learning.ipynb where the static image should be displayed:
![model_complexity](images/overfitting_underfitting_cartoon.png)
a bigger issue in 03-unsupervised-learning.ipynb where an image should be written to:
plt.savefig("images/03-face_decomposition.png") plt.close()
and the Notebook execution stops if the directory images/ does not exist.

mkdir images obviously solves the problem for 03-unsupervised-learning.ipynb, but I think the directory should be in the git repo in the first place (along with a few static images used in the notebooks).

Error trying to import mglearn

In [14]:

import mglearn

ImportError Traceback (most recent call last)
in ()
----> 1 import mglearn

C:\Anaconda3\lib\site-packages\mglearn_init_.py in ()
----> 1 from . import plots
2 from . import tools
3 from .plots import cm3, cm2
4 from .tools import discrete_scatter
5 from .plot_helpers import ReBl

C:\Anaconda3\lib\site-packages\mglearn\plots.py in ()
9 plot_single_hidden_layer_graph,
10 plot_two_hidden_layer_graph)
---> 11 from .plot_linear_regression import plot_linear_regression_wave
12 from .plot_tree_nonmonotonous import plot_tree_not_monotone
13 from .plot_scaling import plot_scaling

C:\Anaconda3\lib\site-packages\mglearn\plot_linear_regression.py in ()
3
4 from sklearn.linear_model import LinearRegression
----> 5 from sklearn.model_selection import train_test_split
6 from .datasets import make_wave
7 from .plot_helpers import cm2

C:\Anaconda3\lib\site-packages\sklearn\model_selection_init_.py in ()
15 from ._split import check_cv
16
---> 17 from ._validation import cross_val_score
18 from ._validation import cross_val_predict
19 from ._validation import learning_curve

C:\Anaconda3\lib\site-packages\sklearn\model_selection_validation.py in ()
24 from ..utils.fixes import astype
25 from ..utils.validation import _is_arraylike, _num_samples
---> 26 from ..utils.metaestimators import _safe_split
27 from ..externals.joblib import Parallel, delayed, logger
28 from ..metrics.scorer import check_scoring

ImportError: cannot import name '_safe_split'

What have I done wrong?

Update: it works now, after restarting Jupyter. Sorry, my bad.

Unclear how/where you populate some variables

It's not clear from the snippet notebook approach (I'm using a Python IDE) how you populate variables along the way in your examples.

For instance, with your Ridge Regression example:

from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

How did you populate X_train, y_train?

Will this example run as is?

My goal, with any book like this is to be able to copy and paste the code and have it run. Now, having said that, I'm now to Python so I may not understand this whole "notebook" thing that seems so prevalent.

I was hoping for individual standalone, working, programs (py files).

Can you please explain how to get all the examples to work in your notebook where it doesn't "seem" obvious how you are populating the variables?

If I can figure that out, I'll definitely buy your book.

Thanks much.

mglearn.plots does not works

mglearn.plots.plot_knn_classification(n_neighbors=3) open and instantly closes graph

TypeError: Cannot compare type 'Timestamp' with type 'float' [for Fig.4-12, In[50], page-245]

Respected Dr. Muller, please help me with this error.
Running your code as stated exactly in the book in Chapter 4, In[50] (p-245) gives me the following error:
TypeError: Cannot compare type 'Timestamp' with type 'float'

The code is as follows:
import matplotlib.pyplot as plt
import pandas as pd
import mglearn

citibike = mglearn.datasets.load_citibike()

plt.figure(figsize=(10, 3))
xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(),
freq='D')
plt.xticks(xticks, xticks.strftime("%a %m-%d"), rotation=90, ha="left")
plt.plot(citibike, linewidth=1)
plt.xlabel("Date")
plt.ylabel("Rentals")

I am running this code in (Anaconda) Spyder, version 3.2.4 and Pandas version is 0.21.0

Thank you,
Sincerely,
Nikhilesh

where is the plot for figure 1-3

Hi there,

I am a fresh bird to machine learning and try to walk through the book to grasp a general idea.

But I did not find the plot code snippet for figure 1-3. The book writes:
"Figure 1-3 is a pair plot of the features in the training set. The data points are colored according to the species the iris belongs to. To create the plot, we first convert the NumPy array into a pandas DataFrame. pandas has a function to create pair plots called scatter_matrix. The diagonal of this matrix is filled with histograms of each feature:"

then followed an input command:

In[24]:
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                            hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

then the figure 1-3 comes out. Would you please tell me which command we use to generate or plot figure 1-3?

Thank you so much!

About [2,3]- fold cross validation scores (Ch-5, page 257, paragraph below Out[6])

Respected Dr. Muller,
Sorry for asking so many questions - I hope you wont mind.
This question pertains to the paragraph right below Out[6] on page 257. There you state (and it makes sense) that for k=3, the accuracy on the IRIS dataset is 0.
However, when I do it, I do not get 0. Not only that, I get the score which is almost equal to that for k=5!

That is, running the following command:
from sklearn.datasets import load_iris
iris = load_iris()

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.model_selection import cross_val_score

for cv in [2, 3, 5, 10, 15, 20]:
scores = cross_val_score(logreg, iris.data, iris.target, cv=cv)
print('\n{ }-fold Cross-validation scores: { }'.format(cv, scores))
print('AVERAGE { }-fold Cross-validation scores: {:.2f}\n'.format(cv, scores.mean()))
print('---------------------------------------------')

I get the scores that are almost equal to each other. Why?
If my question has validity, and if you have time, then please help me with this.

Thank you,
Sincerely,
Nikhilesh

tkinter module can't be imported

I am using Python 3.5.2 on CentOS 7. I installed "yum install python3-tkinter" but it seems only working with Python 3.3, as shown by message below:

Package python3-tkinter-3.3.2-12.el7.nux.x86_64 already installed and latest version

How to make it working with Python 3.5?

I'm using the Machine Learning with Python book. Everything imports correctly but I can't get the plots to display in Jupyter notebook. The following code just prints the shape of X, but no graph.

mglearn.discrete_scatter(X[:,0], X[:,1],y)
plt.legend(["Class 0", "Class 1"], loc=4)
plt.xlabel("First feature")
plt.ylabel("Second feature")
print("X.shape : {}".format(X.shape))

question about warning on page 290

The book says: "For simplicity, we changed the threshold value based on test set
results in the code above. In practice, you need to use a hold-out
validation set, not the test set. As with any other parameter, setting
a decision threshold on the test set is likely to yield overly optimistic
results. Use a validation set or cross-validation instead."

I'm a little confused by this. Are you really saying we should experiment with different threshold values using our holdout set? I thought that the holdout was supposed to be set aside until you're completely done tweaking hyperparameters (of which threshold is one, if I'm not mistaken), and then you run it once on the holdout and that's it.

Or to put it another way, isn't the test set actually where we would want to experiment with parameters?

Also a little confused by the final sentence. There's lots of different terminology, but my impression is you either use a train/test split, or you use something more sophisticated like cross-validation. Setting the threshold in CV would be equivalent to setting the threshold in test. I don't see why CV would necessarily be a "better place" for avoiding overfitting, as this implies.

Then again, I'm a noob so I could be missing something :)

scikit-learn version

The notebooks were written using the version 0.18.dev of scikit-learn and don't work with the latest (non dev) version of the package (0.17.1). The problem is that the 0.18.dev do not seems to be available for windows, which I’m unfortunately using right now.
I have edited some scripts in order to make the notebooks work and I intend to make a pull request for these changes later on.

Errata: Potential typo in text on page 59

I didn't find the text online or I would've opened a PR, and I'm unsure if you are even interested in collecting errata, but I believe there is a typo on pg 59 (Chapter 2: Supervised Learning):

Let's analyze LinearLogistic in more detail on the Breast Cancer dataset:

The section is discussing linear SVM and logisitic regression, and I think they were just combined accidentally. I believe the text should be corrected to:

Let's analyze LogisticRegression in more detail on the Breast Cancer dataset:

01-introduction.ipynb

"from IPython import display" produces a TypeError in the python version 3.6.0 in the ipython notebook. However, as as indicated in the book, the commands when the import is performed as "from IPython.display import display" .

ImportError: No module named 'preamble'

Hi,
I am executing below lines of code but i am getting No module named 'preamble' error.

%matplotlib inline
from preamble import *

Just FYI,
after executing following line i am getting 0.18.

import sklearn; print(sklearn.version)

Wrong returned object in mglearn.datasets.load_citibike()

When I run

citibike = mglearn.datasets.load_citibike()
print("Citibike data:\n{}".format(citibike.head()))

I have the exception:


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-eb013c64f388> in <module>()
----> 1 print("Citibike data:\n{}".format(citibike.head()))

AttributeError: 'numpy.float64' object has no attribute 'head'

I guess that the returned object should be a pandas.DataFrame

many support vectors in figure 2-42

I hope this is a reasonable question. Figure 2-42 (Out[86] in the notebook for chapter 2, page 102 in the PDF) is odd, in that 100% of the points are support vectors in the top row and rightmost column. From my limited understanding I would have expected the number of support vectors to be lowest in the top row, and increase in the second and third rows as complexity is added.

(I would not have expected the number of support vectors to vary with gamma, since it seems like gamma is considered after SV's are already determined)

Just wanted to make sure this is expected, and perhaps ask you for an explanation. Thank you.

where is the "display" function?

display(mglearn.plots.plot_logisitic_regression_graph())

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-3a4bc5be286e> in <module>()
----> 1 display(mglearn.plots.plot_logisitic_regression_graph())

NameError: name 'display' is not defined

What should be imported to use the display?

reshape is no longer supported (page 246 / Chapter 4)

Cell 51 on page 246 (Chapter 4) will exit with an error for anybody running Pandas versions less than ~11 months old. This is because reshape was deprecated on Index objects for some reason.

Here is the Pandas change: pandas-dev/pandas@084ceae

The actual error is: "NotImplementedError: reshaping is not supported for Index objects"

README.md contains wrong Conda graphviz installation instructions

Installing packages with conda:

you specify:

$ conda install numpy scipy scikit-learn matplotlib pandas pillow graphviz graphviz-python

but graphviz-python does not exist and should be python-graphviz:

$ conda install numpy scipy scikit-learn matplotlib pandas pillow graphviz python-graphviz

amueller / introduction_to_ml_with_python Goto Github PK

introduction_to_ml_with_python's Introduction

Introduction to Machine Learning with Python

Errata

Setup

Installing packages with conda:

Installing packages with pip

Downloading English language model

Submitting Errata

introduction_to_ml_with_python's People

Contributors

Stargazers

Watchers

Forkers

introduction_to_ml_with_python's Issues

%matplotlib inline

Read CSV data into dataframe

Recommend Projects

Recommend Topics

Recommend Org