rasbt / python-machine-learning-book Goto Github PK

The "Python Machine Learning (1st edition)" book code repository and info resource

License: MIT License

TeX 0.04% Python 0.97% HTML 0.03% CSS 0.01% Jupyter Notebook 98.97%

machine-learning machine-learning-algorithms logistic-regression data-science data-mining python scikit-learn neural-network

python-machine-learning-book's Introduction

Python Machine Learning book code repository

IMPORTANT NOTE (09/21/2017):

This GitHub repository contains the code examples of the 1st Edition of Python Machine Learning book. If you are looking for the code examples of the 2nd Edition, please refer to this repository instead.

What you can expect are 400 pages rich in useful material just about everything you need to know to get started with machine learning ... from theory to the actual code that you can directly put into action! This is not yet just another "this is how scikit-learn works" book. I aim to explain all the underlying concepts, tell you everything you need to know in terms of best practices and caveats, and we will put those concepts into action mainly using NumPy, scikit-learn, and Theano.

You are not sure if this book is for you? Please checkout the excerpts from the Foreword and Preface, or take a look at the FAQ section for further information.

1st edition, published September 23rd 2015
Paperback: 454 pages
Publisher: Packt Publishing

Language: English
ISBN-10: 1783555130

ISBN-13: 978-1783555130
Kindle ASIN: B00YSILNL0

German ISBN-13: 978-3958454224
Japanese ISBN-13: 978-4844380603
Italian ISBN-13: 978-8850333974
Chinese (traditional) ISBN-13: 978-9864341405
Chinese (mainland) ISBN-13: 978-7111558804
Korean ISBN-13: 979-1187497035
Russian ISBN-13: 978-5970604090

Table of Contents and Code Notebooks

Simply click on the ipynb/nbviewer links next to the chapter headlines to view the code examples (currently, the internal document links are only supported by the NbViewer version). Please note that these are just the code examples accompanying the book, which I uploaded for your convenience; be aware that these notebooks may not be useful without the formulae and descriptive text.

Excerpts from the Foreword and Preface
Instructions for setting up Python and the Jupiter Notebook

Machine Learning - Giving Computers the Ability to Learn from Data [dir] [ipynb] [nbviewer]
Training Machine Learning Algorithms for Classification [dir] [ipynb] [nbviewer]
A Tour of Machine Learning Classifiers Using Scikit-Learn [dir] [ipynb] [nbviewer]
Building Good Training Sets – Data Pre-Processing [dir] [ipynb] [nbviewer]
Compressing Data via Dimensionality Reduction [dir] [ipynb] [nbviewer]
Learning Best Practices for Model Evaluation and Hyperparameter Optimization [dir] [ipynb] [nbviewer]
Combining Different Models for Ensemble Learning [dir] [ipynb] [nbviewer]
Applying Machine Learning to Sentiment Analysis [dir] [ipynb] [nbviewer]
Embedding a Machine Learning Model into a Web Application [dir] [ipynb] [nbviewer]
Predicting Continuous Target Variables with Regression Analysis [dir] [ipynb] [nbviewer]
Working with Unlabeled Data – Clustering Analysis [dir] [ipynb] [nbviewer]
Training Artificial Neural Networks for Image Recognition [dir] [ipynb] [nbviewer]
Parallelizing Neural Network Training via Theano [dir] [ipynb] [nbviewer]

Equation Reference

[PDF] [TEX]

Slides for Teaching

A big thanks to Dmitriy Dligach for sharing his slides from his machine learning course that is currently offered at Loyola University Chicago.

https://github.com/dmitriydligach/PyMLSlides

Additional Math and NumPy Resources

Some readers were asking about Math and NumPy primers, since they were not included due to length limitations. However, I recently put together such resources for another book, but I made these chapters freely available online in hope that they also serve as helpful background material for this book:

Algebra Basics [PDF] [EPUB]
A Calculus and Differentiation Primer [PDF] [EPUB]
Introduction to NumPy [PDF] [EPUB] [Code Notebook]

Citing this Book

You are very welcome to re-use the code snippets or other contents from this book in scientific publications and other works; in this case, I would appreciate citations to the original source:

BibTeX:

@Book{raschka2015python,
 author = {Raschka, Sebastian},
 title = {Python Machine Learning},
 publisher = {Packt Publishing},
 year = {2015},
 address = {Birmingham, UK},
 isbn = {1783555130}
 }

MLA:

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Feedback & Reviews

Short review snippets

Sebastian Raschka’s new book, Python Machine Learning, has just been released. I got a chance to read a review copy and it’s just as I expected - really great! It’s well organized, super easy to follow, and it not only offers a good foundation for smart, non-experts, practitioners will get some ideas and learn new tricks here as well.
– Lon Riesberg at Data Elixir

Superb job! Thus far, for me it seems to have hit the right balance of theory and practice…math and code!
– Brian Thomas

I've read (virtually) every Machine Learning title based around Scikit-learn and this is hands-down the best one out there.
– Jason Wolosonovich

The best book I've seen to come out of PACKT Publishing. This is a very well written introduction to machine learning with Python. As others have noted, a perfect mixture of theory and application.
– Josh D.

A book with a blend of qualities that is hard to come by: combines the needed mathematics to control the theory with the applied coding in Python. Also great to see it doesn't waste paper in giving a primer on Python as many other books do just to appeal to the greater audience. You can tell it's been written by knowledgeable writers and not just DIY geeks.
– Amazon Customer

Sebastian Raschka created an amazing machine learning tutorial which combines theory with practice. The book explains machine learning from a theoretical perspective and has tons of coded examples to show how you would actually use the machine learning technique. It can be read by a beginner or advanced programmer.

William P. Ross, 7 Must Read Python Books

Longer reviews

If you need help to decide whether this book is for you, check out some of the "longer" reviews linked below. (If you wrote a review, please let me know, and I'd be happy to add it to the list).

Python Machine Learning Review by Patrick Hill at the Chartered Institute for IT
Book Review: Python Machine Learning by Sebastian Raschka by Alex Turner at WhatPixel

Links

ebook and paperback at Amazon.com, Amazon.co.uk, Amazon.de
ebook and paperback from Packt (the publisher)
at other book stores: Google Books, O'Reilly, Safari, Barnes & Noble, Apple iBooks, ...
social platforms: Goodreads

Translations

Italian translation via "Apogeo"
German translation via "mitp Verlag"
Japanese translation via "Impress Top Gear"
Chinese translation (traditional Chinese)
Chinese translation (simple Chinese)
Korean translation via "Kyobo"
Polish translation via "Helion"

Literature References & Further Reading Resources

Errata

Bonus Notebooks (not in the book)

Logistic Regression Implementation [dir] [ipynb] [nbviewer]
A Basic Pipeline and Grid Search Setup [dir] [ipynb] [nbviewer]
An Extended Nested Cross-Validation Example [dir] [ipynb] [nbviewer]
A Simple Barebones Flask Webapp Template [view directory][download as zip-file]
Reading handwritten digits from MNIST into NumPy arrays [GitHub ipynb] [nbviewer]
Scikit-learn Model Persistence using JSON [GitHub ipynb] [nbviewer]
Multinomial logistic regression / softmax regression [GitHub ipynb] [nbviewer]

"Related Content" (not in the book)

SciPy 2016

We had such a great time at SciPy 2016 in Austin! It was a real pleasure to meet and chat with so many readers of my book. Thanks so much for all the nice words and feedback! And in case you missed it, Andreas Mueller and I gave an Introduction to Machine Learning with Scikit-learn; if you are interested, the video recordings of Part I and Part II are now online!

PyData Chicago 2016

I attempted the rather challenging task of introducing scikit-learn & machine learning in just 90 minutes at PyData Chicago 2016. The slides and tutorial material are available at "Learning scikit-learn -- An Introduction to Machine Learning in Python."

Note

I have set up a separate library, mlxtend, containing additional implementations of machine learning (and general "data science") algorithms. I also added implementations from this book (for example, the decision region plot, the artificial neural network, and sequential feature selection algorithms) with additional functionality.

Translations

Dear readers,
first of all, I want to thank all of you for the great support! I am really happy about all the great feedback you sent me so far, and I am glad that the book has been so useful to a broad audience.

Over the last couple of months, I received hundreds of emails, and I tried to answer as many as possible in the available time I have. To make them useful to other readers as well, I collected many of my answers in the FAQ section (below).

In addition, some of you asked me about a platform for readers to discuss the contents of the book. I hope that this would provide an opportunity for you to discuss and share your knowledge with other readers:

Google Groups Discussion Board

(And I will try my best to answer questions myself if time allows! :))

The only thing to do with good advice is to pass it on. It is never of any use to oneself.
— Oscar Wilde

Examples and Applications by Readers

Once again, I have to say (big!) THANKS for all the nice feedback about the book. I've received many emails from readers, who put the concepts and examples from this book out into the real world and make good use of them in their projects. In this section, I am starting to gather some of these great applications, and I'd be more than happy to add your project to this list -- just shoot me a quick mail!

FAQ

General Questions

Questions about the Machine Learning Field

Questions about ML Concepts and Statistics

Cost Functions and Optimization

Regression Analysis

What is the difference between Pearson R and Simple Linear Regression?

Tree models

Model evaluation

Logistic Regression

Neural Networks and Deep Learning

Preprocessing, Feature Selection and Extraction

Naive Bayes

Other

Programming Languages and Libraries for Data Science and Machine Learning

Questions about the Book

Contact

I am happy to answer questions! Just write me an email or consider asking the question on the Google Groups Email List.

If you are interested in keeping in touch, I have quite a lively twitter stream (@rasbt) all about data science and machine learning. I also maintain a blog where I post all of the things I am particularly excited about.

python-machine-learning-book's People

Contributors

Stargazers

Watchers

Forkers

josueortc m4573r kod3r vkorichkov billtoll swakiyama snowdj jmrinaldi selva86 anujk3 santonugoswami dataist2019 gotoc caohy1988 philipz cophy08 randomeffect cjliux tomdyq gaoch023 starte zhengbuqian oskor lepy snazz2001 lenovor luckytina pats-chen lizhen-dlut renyuanl jsnono marcososhiro wonyonyon yongming kgl-prml qjay612 orangelpai easonlv xuanhan863 wuvsuqkp kamendula fsgp skybird6672 xzflin huiyi1990 ruinnight pypot aimlnerd alexsisu iandriyanov flowrock islenv liangkai youngkwonjo happywwy quietcoolwu aslfu svats2k mitchshack nickwork 4sp1r3 biodun madjelan mysl ericjohn congyangmin adityagudimella fighting-liu finditapp pepsalehi pbamotra yari7852 besciak adrianplattner georgedittmar mathkann thilina27 mnfost swimablefish directorscut82 codetasks jackblades tythonlee jimmy0000 mrgloom nikinikiniki jacksyen ldfaiztt prabinrs eloisaelias nkhuyu leimingyu garftalk qiwsir aihgf blighli datascitest mxjl620 timmartin19 datakop

python-machine-learning-book's Issues

Machine Learning (Python)

Reinforcement Learning - Where Art Thou?

Wonderful book; learning a ton! Question: In first chapter, you explain the three types of learning (supervised, unsupervised and reinforcement). Usually the third is not listed. So, I search your text for other material on RL but found none. Future chapter in next edition? Future book? In your other resources, are there links about RL with a scikit-learn style? Love Karpathy's blog on "Pong from Pixels".

AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)>

trying to run gs_lr_tfidf.fit(X_train, y_train) and got the AttributeError

Running on jupyter notebook, python 3.5

python crushes when running theano code

I am trying to run the following code from the book in jupyter notebook with everything updated. However, every time python crushes and the kernel restarts. Everything is fine before this point. Any thought?
ps. using 32bit and gpu, tried dmatrix, no luck
chapter 13

import numpy as np
x = T.dmatrix(name='x')
x_sum = T.sum(x, axis=0)
calc_sum = theano.function(inputs=[x],outputs=x_sum)
ary = [[1,2,3],[1,2,3]]
print('column sum:',calc_sum(ary))

Chapter 9 live demo not working

Hi,
The link to the live example application ( http://raschkas.pythonanywhere.com/ ) is not working.
There's a "Coming soon" message, as if the page did not exist.

the Breast Cancer Wisconsin dataset is not available

In chapter6, the Breast Cancer Wisconsin dataset is not available now.
Maybe it is broken link.

currently

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

should be

df = pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

I'm sorry if I'm wrong.

Adding the iris.data file back into the repo

Can the iris.data file be added back into the repo on master?

Here's the last version I believe:
https://github.com/rasbt/python-machine-learning-book/blob/194e34f245abb97f53d0e72166ab6785d01a1e94/code/datasets/iris/iris.data

Thanks again!

Numpy Future Warning when using plot_decision_regions function

Sebastian,

I've been collecting my own data and have applied the plot_decision_regions function several times to my data but I am running into a problem with this new data. The problem is occurring here:

#plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

My enumerated object is: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
So 5 classifications hot encoded.

From what I understand, this list comprehension passes over my X_train_pca data five times and uses the boolean comparison y == cl to plot all my data points with five different colors as it passes through the markers and colormap.

Upon running, I get the warning:

FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],

The really weird part is the values in the array: X[y==cl, 0]
They now look like: [-0.4277726 -0.4277726 -0.44362509 ..., -0.4277726 -0.4277726 -0.4277726 ]
With shape (9784,) which is the original length of my X_train_pca data. (I believe it should be closer to about a fifth since most of my data is similar in length and I checked np.shape after the loop ran.)

To give a visual my data looks like this.

When it should be separated into colors with a spread looking like this.

I can't really think through the problem anymore probably due to a misunderstanding of what this future warning is trying to tell me. I am wondering if you have any ideas as to what might cause this behavior.

Updated ebook not yet available from O'Reilly

Just a heads up on this -- I checked my O'Reilly account, and they did not yet have the updated version.

I'll post here once it appears.

Kernal PCA [projecting new data points]

So you have this:

X, y = make_moons(n_samples=100, random_state=123)
alphas, lambdas =rbf_kernel_pca(X, gamma=15, n_components=1)

Then you take a sample from X:
x_new = X[25]

And then find the projection for the new sample from:

x_reproj = project_x(x_new, X, 
...       gamma=15, alphas=alphas, lambdas=lambdas)

But x_new was already a part of alphas and lambdas created using X. In other words, X already had x_new when the rbf_kernel_pca was applied. So should I be surprised that the projected value of x_new coincides exactly in the plots? I would have thought it might have been better to exclude x_new to derive alpha and lambda values and then apply project_x. Thoughts?

Missing code in chapter 11

Hi, below code is from the book but missing in the cell In[17] of ch11.ipynb.

row_clusters = linkage(df.values, method='complete', metric='euclidean')

Cant import using pickle ch9

When i try to read back the classifier on page 254 i get the following error. I have done like in the book the whole way and things have worked find until now. Any idea what has gone wrong?

Im using ipython 4.2.0

AttributeError                            Traceback (most recent call last)
<ipython-input-4-f050da95a5cf> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/var/folders/yh/mm1bdmx9073_b15lw69b2qmh0000gn/T/py71220g7y''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/var/folders/yh/mm1bdmx9073_b15lw69b2qmh0000gn/T/py71220g7y''');exec(compile(__code, '''/Users/henke/Documents/code/python/python-ml/movieclassifier/main.py''', 'exec'));

/Users/henke/Documents/code/python/python-ml/movieclassifier/main.py in <module>()
      4 from vectorizer import vect
      5 
----> 6 clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))
      7 
      8 import numpy as np

AttributeError: Can't get attribute 'tokenizer' on <module '__main__'>

Images

typo in chapter 12

In this file, the code loads the names as

labels_path = os.path.join(path, 
                           '%s-labels-idx1-ubyte' % kind)
images_path = os.path.join(path, 
                           '%s-images-idx3-ubyte' % kind)

However the linked .gz file has the names with a period, not a hyphen. It should be

labels_path = os.path.join(path, 
                           '%s-labels.idx1-ubyte' % kind)
images_path = os.path.join(path, 
                           '%s-images.idx3-ubyte' % kind)

https://www.reddit.com/r/learnpython/comments/6qc9t1/path_to_existing_file_in_root_folder_not_found_on/

Notebook error

Opening the first chapter file ch01.ipynb result in the following error:

"Unreadable Notebook: /home/antonio/libro-machine-learning/ch01.ipynb NotJSONError("Notebook does not appear to be JSON: '\n\n\n\n\n\n\n<html lang...")"

python version: 3.7 from anaconda distribution.

TypeError: can't multiply sequence by non-int of type 'float'

I am getting this error at the np.dot for the Iris data set. Can you explain the solution ?

Following is traceback :
Traceback (most recent call last):
File "Perceptron.py", line 61, in
ppn.train(x, y)
File "Perceptron.py", line 24, in train
update = self.eta * (target - self.predict(xi))
File "Perceptron.py", line 35, in predict
return np.where(self.net_input(X) >= 0.0, 1, -1)
File "Perceptron.py", line 32, in net_input
return np.dot(X, self.w_[1:]) + self.w_[0]

Issues when installing Keras on Windows 10 64bit machines

Hi,

I had an issue when installing Keras on Windows 10 64bit machine (as described in ch13) but this did not work. I have posted the solution in a step by step here in this blog post:

install keras on windows 10 x64 bit machine
@rasbt : feel free to add it to the notes of the labs in GitHub.

Thanks.

Confusion in chapter 2

In chapter 2 you have some code for a simple perceptron model.

On page 27, you describe the code.

the net_input method simply calculates the vector product wTx

However, there is more than a simple vector product in the code:

def net_input(self, X):
    """Calculate net input"""
    return np.dot(X, self.w_[1:]) + self.w_[0]

In addition to the dot product, there is an addition. The text does not mention anything about what is this + self.w_[0]

Can you (or anyone) explain why that's there?

thanks,
-trevor

pip install SomePackge

Hello,

I currently facing the attached issue.

Is this package only available to Linux users? I'm on Windows.

How to Use the code files

Hi, I am extremely new to Python and I understand how to write basic commands and stuff.
I got the Code files for the book but I am not able to understand how to use them for learning.
All of them seem to be in Text format.
How can I use them as code to make a new file in which I can just have the Code instead of all the text.
I just wanted to see how the code runs but I am unable to understand what this code is and how to extract things which I want instead of having to remove all the "" and /n and other formatting elements.
Thanks.

Python Machine learning

mlxtend no longer has tf_classifier

I wanted to run your code that compares TensorFlow with SKLearn but it no longer works.
https://github.com/rasbt/python-machine-learning-book/blob/master/faq/tensorflow-vs-scikitlearn.md

In addition, your mlxtend package no longer has tf_classifier and consequently no TfSoftMaxRegression.

Would you have an updated resource by any chance?

Incorrect results printed for MLPGradientCheck

I was running MLPGradientCheck, but the results are different from example
When I tried to locate the problem, I found that a3 in MLPGradientCheck.fit is always "nan"
Is that normal? how could I fix it

Broken link in FAQ

General Questions section of FAQ

How do Data Scientists perform model selection? Is it different from Kaggle?

The web link is broken.

Thank you for the beautiful book.

Windows 10, ImportError: cannot import name 'plot_decision_regions'

Hello,

I was trying to execute the code:

%matplotlib inline
from sklearn.linear_model import LogisticRegression 
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from mlxtend.evaluate import plot_decision_regions

iris = load_iris()
y, X = iris.target, iris.data[:, [0, 2]]  # only use 2 features
lr = LogisticRegression(C=100.0, 
                        class_weight=None, 
                        dual=False, 
                        fit_intercept=True,
                        intercept_scaling=1, 
                        max_iter=100, 
                        multi_class='multinomial', 
                        n_jobs=1,
                        penalty='l2', 
                        random_state=1, 
                        solver='newton-cg', 
                        tol=0.0001,
                        verbose=0, 
                        warm_start=False)
lr.fit(X, y)
plot_decision_regions(X=X, y=y, clf=lr, legend=2)
plt.xlabel('sepal length')
plt.ylabel('petal length')
plt.show()

but it returned following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-9b78ac9a656a> in <module>()
      3 from sklearn.datasets import load_iris
      4 import matplotlib.pyplot as plt
----> 5 from mlxtend.evaluate import plot_decision_regions
      6 
      7 iris = load_iris()

ImportError: cannot import name 'plot_decision_regions'

I installed mlxtend package. What am I doing wrong? Could You help me? Thanks in advance!

Plotting iris data in Ch02 assumes that the data is in a particular order

In chapter 2, where the iris data is plotted on a scatterplot,

# extract sepal length and petal length
X = df.iloc[0:100, [0, 2]].values

# plot data
plt.scatter(X[:50, 0], X[:50, 1],
            color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],
            color='blue', marker='x', label='versicolor')

it is simply assumed that the first 50 rows belong to the label setosa, and the next 50 to versicolor. The scatterplot should be generated using the labels(which are in the 5th column of the dataset)

Errata update question

Regarding your remark:

[2015-10-20] Good news! I just heard back from the publisher; all the typos and errors which are listed below will be fixed by next week.

I bought the ebook yesterday (O'Reilly, not PACKT) and found some errors. Up to now, they are in the errata (v2), but not yet fixed in my fresh copy. Can you say something about the current state, are the updates for immediate PACKT customers only?

edit:
Interestingly, my copy stands the test on page viii (so I have Classifiers there), but for example not the one regarding the inverted 'y' variants (with and w/o caret) on page 22 and also the following errors (p. 23) are still present.

Couldn't get desired output from Adaptive linear neuron implementation

Hi,
I was trying out one of the example in Chapter2, under title: Implementing an adaptive linear neuron in Python (Link to notebook).
The problem is when I plot decision boundaries, whole area is shown red.

When I change output = self.activation(X) to output = self.predict(X)inside fit function, the problem seems to be gone.

Is there an issue with the code or the code is correct and I made some other mistake while implementing?

Thanks
Sohaib

Chapter 8: Shuffling the DataFrame in newer versions of pandas

Just a note in case it's helpful to anyone else - I seemed to be getting 100% accuracy with the on-line sentiment analysis classifier (pages 246-246), but it turned out to be because the code used to shuffle the dataset before exporting it to CSV on page 235 hadn't worked.

In the version of pandas I'm using (0.23.4), it looks like df.index.values is needed in order to get the indexes of a DataFrame as a list. So, this:

df = df.reindex(np.random.permutation(df.index))

now needs to be this:

df = df.reindex(np.random.permutation(df.index.values))

Hope that helps someone!

Chapter 2: confusion b/w perceptron code and SGD code

In the perceptron part of the code, I see:

for xi, target in zip(X, y):
  update = self.eta * (target - self.predict(xi))
  self.w_[1:] += update * xi
  self.w_[0] += update

In the SGD part I see something similar except that everytime the new gradient point is calculated, the data is shuffled:

X, y = self._shuffle(X, y)
for xi, target in zip(X, y):
  cost.append(self._update_weights(xi, target))

def _update_weights(self, xi, target):
  """Apply Adaline learning rule to update the weights"""
  output = self.net_input(xi)
  error = (target - output)
  self.w_[1:] += self.eta * xi.dot(error)
  self.w_[0] += self.eta * error

I do not see any difference between the two except for the shuffling part and the part that one is binary value and the other is a real value (SGD). Did I misunderstand how fundamentally the weights are calculated for SGD and simple perceptron model. Ofcourse if there was a mini batch implementation, the code would have looked a lot more like adaptive linear neurons. But since you are taking sample by sample, they are implemented similarly?

IndexError: too many indices for array in CH-5 PCA Plot Code

I am having an issue in executing the for generating the graph of the PCA in Chapter-5 of the book Python Machine Learning. I tried debugging but I am not able to understand what the problem in the code is.

Kindly provide support for this issue.

ValueError: operands could not be broadcast together with shapes (400,2) (400,)

Dear concerns : I am extracting features from wav , using PLP , this ( Pyhton 3.6 -Anaconda Spyder ) after execute i am facing error in this line

File "C:\ProgramData\Anaconda3\lib\site-packages\sidekit\frontend\features.py", line 399, in power_spectrum
ahan = framed[start:stop, :] * window

ValueError: operands could not be broadcast together with shapes (400,2) (400,)

#!usr/bin/python
import numpy.matlib
import scipy
from scipy.fftpack.realtransforms import dct
from sidekit.frontend.vad import pre_emphasis
from sidekit.frontend.io import *
from sidekit.frontend.normfeat import *
from sidekit.frontend.features import *
import scipy.io.wavfile as wav
import numpy as np



def readWavFile(wav):
        #given a path from the keyboard to read a .wav file
        #wav = raw_input('Give me the path of the .wav file you want to read: ')
        inputWav = 'C:/Speech_Processing/2-Speech_Signal_Processing_and_Classification-master/feature_extraction_techniques'+wav
        return inputWav
#reading the .wav file (signal file) and extract the information we need
def initialize(inputWav):
        rate , signal  = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency
        sig = wave.open(readWavFile(inputWav))
        # signal is the numpy 2D array with the date of the .wav file
        # len(signal) number of samples
        sampwidth = sig.getsampwidth()
        print ('The sample rate of the audio is: ',rate)
        print ('Sampwidth: ',sampwidth)
        return signal ,  rate
def PLP():
        folder = input('Give the name of the folder that you want to read data: ')
        amount = input('Give the number of samples in the specific folder: ')
        for x in range(1,int(amount)+1):
                wav = '/'+folder+'/'+str(x)+'.wav'
                print (wav)
                #inputWav = readWavFile(wav)
                signal,rate = initialize(wav)
                #returns PLP coefficients for every frame
                plp_features = plp(signal,rasta=True)
                meanFeatures(plp_features[0])
#compute the mean features for one .wav file (take the features for every frame and make a mean for the sample)
def meanFeatures(plp_features):
        #make a numpy array with length the number of plp features
        mean_features=np.zeros(len(plp_features[0]))
        #for one input take the sum of all frames in a specific feature and divide them with the number of frames
        for x in range(len(plp_features)):
                for y in range(len(plp_features[x])):
                        mean_features[y]+=plp_features[x][y]
        mean_features = (mean_features / len(plp_features))
        print (mean_features)

def main():
        PLP()

main()

SoftmaxRegression - zero_init_weight missing

Hello,
I think the function zero_init_weight is missing.
I searched the github site but did not find it.
Maybe this is another version of the softmax-regressor and here it is missing?

Best Regards, Thomas

Typo/Clarification on Ch02.ipynb

There is an Additional Note (1) section where it says: " If all the weights are initialized to 0, only the scale of the weight vector, not the direction."

Seems there is some missing meaning in that sentence. Was wondering if you could correct it please. Thank you very much!

AttributeError: 'SGDClassifier' object has no attribute 'max_iter'

Hello! Thank you for this amazing gift to everyone!

My issue is with Chapter 9's movie_classifier_with_update via python app.py. I am able to enter my sample review and get predicted class label and probability. The issue arises when I click "Correct"/"Incorrect" for the classification.

It is almost assuredly due to the issue of versions of Python (3.5 needed) and Sklearn (0.19 needed) as indicated here: https://www.pythonanywhere.com/forums/topic/11716/

It'd be nice to keep this current though and I will send a PR if I ever figure out how to update it for Python 3.6 and Sklearn 0.20!

Cannot run a line of code

When I run this script in python notebook:

https://github.com/rasbt/python-machine-learning-book/blob/master/code/optional-py-scripts/ch02.py

The last line (ada.partial_fit(X_std[0, :], y[0])) gives the error:

<main.AdalineSGD at 0x10a89fac8>

Error in chapter 8 code

def tokenizer(text):
    return text.split()

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

from nltk.corpus import stopwords
stop = stopwords.words('english')

X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Hi,
I get an error about "can't get attribute tokenizer_porter" ,
what's the problem you think?

Chapter 15: Padding modes figure

On page 500 (second edition: September 2017) there is a figure illustrating Full, Same and Valid padding and how the pixel patches map to the feature maps.

The feature map of the valid padding example is only 2x2. It specifies a 5x5 pixel input, 3x3 filter and a stride of 1. The feature map should be of size 3x3.

Implementation of AdalineSGD

Hi,

First of all, thanks for your nice book, Python Machine Learning

I began to read it right now and I am wondering one thing about the implementation of AdalineSGD mentioned in book

    def fit(self, X, y):
        
        self._initialize_weights(X.shape[1])
        self.cost_ = []
        for i in range(self.n_iter):
            if self.shuffle:
                X, y = self._shuffle(X, y)
            cost = []
            for xi, target in zip(X, y):
                cost.append(self._update_weights(xi, target))
            avg_cost = sum(cost) / len(y)
            self.cost_.append(avg_cost)
        return self

    def _update_weights(self, xi, target):
        """Apply Adaline learning rule to update the weights"""
        output = self.net_input(xi)
        error = (target - output)
        self.w_[1:] += self.eta * xi.dot(error)
        self.w_[0] += self.eta * error
        cost = 0.5 * error**2
        return cost

I think, the way to update self.w_[1:] in AdalineSGD is in fact the same as implementation of the batch AdalineGD, just implemented in different ways

            output = self.activation(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)

IMO, self.eta * X.T.dot(errors) operates on entire matrix X in AdalineGD, however AdalineSGD operates on row by row via for-loop (for xi, target in zip(X, y)) of the same X. It doesn't reflect the essential diff between AdalineGD and AdalineSGD as you mentioned in book

Chapter: Combining weak to strong learners via random forests [sample size]

Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance.

To me this implies that I should choose sample size n, that is smaller than N (original training set size).

In most implementations, including the RandomForestClassifier implementation in scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff

But the above got me confused: If we choose n = N, then aren't we overfitting unless the algorithm is bootstrapping aggressively - repeating the values many times over?

Chapter 2 (Rosenblatt Perceptron): "misclassifications per epochs" on p. 30 are misleading

First things first: I absolutely like how you motivate, introduce and implement the relevant concepts in your book.

I think there is a problem with the Rosenblatt perceptron learning description (evaluation) as presented in the Figure on page 30 in the book. The errors that are counted in the variable errors are the number of updates that are performed in one epoch. However, this number does not represent the number of misclassifications after each epoch. For instance, if you use your standard options but train only for one iteration, there will be two updates ("2 errors" according to your terminology), however, all items will be classified as -1 (Setosas), therefore, there are 50 misclassification and this classifier's error rate is actually 50%.

operands could not be broadcast together with shapes (5,5) (10,5)

I am trying to execute MLPGradientCheck.fit in chapter 12
But got an error
The code is copied from source code in Chapter 12
Could you please tell me where I am wrong and how to fix it?

ValueError: operands could not be broadcast together with shapes (200,) (30000,)

I am working on a finite element code in python. It is originally for the diffusion equation but I want to modify it for the wave equation and include a ricker source term. I tried adding the source term and it produces an error. Below is the code and error

from IPython import display
from matplotlib.tri import Triangulation, LinearTriInterpolator

deltat = 0.001
numIterations = 30
mass = numpy.zeros((NPOINTS,NPOINTS))
stiffness = numpy.zeros((NPOINTS,NPOINTS))
phi = numpy.zeros((NPOINTS,))
phi_old = numpy.zeros((NPOINTS,))

f0= 5 # Center frequency Ricker-wavelet
q0= 100 # Maximum amplitude Ricker-Wavelet
t=np.arange(0,numIterations,deltat) # Time vector

tau=np.pif0(t-1.5/f0)
q=q0*(1.0-2.0*tau2.0)*np.exp(-tau2)

xi = np.linspace(0, L, 200)
yi = np.linspace(0, H, 200)
Xi, Yi = np.meshgrid(xi, yi)

updateMatrix(mass,stiffness,phi)
mat = mass/deltat + stiffness
triang = Triangulation(points[:,0], points[:,1])

for iteration in range(1,numIterations+1):
phi_old = phi

rhs = numpy.dot(mass/deltat, phi_old)
rhs = rhs + q
phi = numpy.linalg.solve(mat,rhs)

interpolator = LinearTriInterpolator(triang, phi)
zi = interpolator(Xi, Yi)
fig1 = pylab.figure(1)
pylab.imshow(zi)

fig2 = pylab.figure(2)
xanal, yanal = analytical(numIterations*deltat)
pylab.plot(xanal,yanal,"-")
pylab.plot(Xi[100,:],zi[100,:])
fig2.savefig("comparison.png",format="PNG")

ValueError Traceback (most recent call last)
in
29
30 rhs = numpy.dot(mass/deltat, phi_old)
---> 31 rhs = rhs + q
32 phi = numpy.linalg.solve(mat,rhs)
33 interpolator = LinearTriInterpolator(triang, phi)

ValueError: operands could not be broadcast together with shapes (200,) (30000,)

ValueError: operands could not be broadcast together with shapes

Hello Kaggle! It is a guide for new kaggler

https://github.com/stevekwon211/Hello-Kaggle
It is a Kaggle Guide Document for someone who is new at Kaggle!

question about in [29]

dear sir..
i try to study Machine Learning through your book 『Python Machine Learning』, and it's very nick book!
i can't understand how you to set up 『param_grid』
i try to get the information from sklearn but it just say『dict or list of dictionaries』
even the sample , it just write 『param_grid=....』
So ... about the 『param_grid』 how do i set it up!
i am sorry my english is a little week!
i hope i can let you know my question and thank you very much!

i need to find somebody who can automate a trading strategy

im new to the site, if you can or cant if you could send me in the right direction it would greatly appreciated

ch.06(Tuning hyperparameters via grid search)

I'm now learning machine learning using the Japanese translation of this book, and when I run this program, I always get stuck on the part using sklearn.svm.

When the program do the part"gs=gs.fit(X_train,y_train)", it always show the past two graphs infinitely. I don't know the reason, so tell me what may be the cause.

My PC's spec:
Window10, python3.6.5, scikit-learn0.19.1

Performing hierarchical clustering on a distance matrix

I understood the concept of complete linkage .. however in the example you provided I did not understand the values in the table with columns 'row label 1', 'row label 2' etc ..

For example what do the numbers (0-7) under the first two columns : 'row label _' represent?
On the first step to create clusters, when you just have points, how do you go about creating clusters? You attempted to explain that via the example, but if you could expand on your example I would really appreciate it.

Add Code Examples

Just bought this book and I can't find the source code of the examples in the book. I bought on Amazon and gone to Packtpub page as suggested on the book but even the zip I download from them is only a mirror to this repository. Just images, no code for the examples in the book. It's really annoying to have to type every single example by hand.

rasbt / python-machine-learning-book Goto Github PK

python-machine-learning-book's Introduction

Python Machine Learning book code repository

IMPORTANT NOTE (09/21/2017):

Table of Contents and Code Notebooks

Equation Reference

Slides for Teaching

Additional Math and NumPy Resources

Citing this Book

Longer reviews

Links

Translations

Bonus Notebooks (not in the book)

SciPy 2016

PyData Chicago 2016

Translations

Examples and Applications by Readers

FAQ

General Questions

Questions about the Machine Learning Field

Questions about ML Concepts and Statistics

Cost Functions and Optimization

Regression Analysis

Tree models

Model evaluation

Logistic Regression

Neural Networks and Deep Learning

Other Algorithms for Supervised Learning

Unsupervised Learning

Semi-Supervised Learning

Ensemble Methods

Preprocessing, Feature Selection and Extraction

Naive Bayes

Other

Programming Languages and Libraries for Data Science and Machine Learning

Questions about the Book

Contact

python-machine-learning-book's People

Contributors

Stargazers

Watchers

Forkers

python-machine-learning-book's Issues

Recommend Projects

Recommend Topics

Recommend Org