Giter VIP home page Giter VIP logo

python-machine-learning-book's Introduction

Python Machine Learning book code repository

Google Group


IMPORTANT NOTE (09/21/2017):

This GitHub repository contains the code examples of the 1st Edition of Python Machine Learning book. If you are looking for the code examples of the 2nd Edition, please refer to this repository instead.


What you can expect are 400 pages rich in useful material just about everything you need to know to get started with machine learning ... from theory to the actual code that you can directly put into action! This is not yet just another "this is how scikit-learn works" book. I aim to explain all the underlying concepts, tell you everything you need to know in terms of best practices and caveats, and we will put those concepts into action mainly using NumPy, scikit-learn, and Theano.

You are not sure if this book is for you? Please checkout the excerpts from the Foreword and Preface, or take a look at the FAQ section for further information.


1st edition, published September 23rd 2015
Paperback: 454 pages
Publisher: Packt Publishing

Language: English
ISBN-10: 1783555130

ISBN-13: 978-1783555130
Kindle ASIN: B00YSILNL0



German ISBN-13: 978-3958454224
Japanese ISBN-13: 978-4844380603
Italian ISBN-13: 978-8850333974
Chinese (traditional) ISBN-13: 978-9864341405
Chinese (mainland) ISBN-13: 978-7111558804
Korean ISBN-13: 979-1187497035
Russian ISBN-13: 978-5970604090

Table of Contents and Code Notebooks

Simply click on the ipynb/nbviewer links next to the chapter headlines to view the code examples (currently, the internal document links are only supported by the NbViewer version). Please note that these are just the code examples accompanying the book, which I uploaded for your convenience; be aware that these notebooks may not be useful without the formulae and descriptive text.


  1. Machine Learning - Giving Computers the Ability to Learn from Data [dir] [ipynb] [nbviewer]
  2. Training Machine Learning Algorithms for Classification [dir] [ipynb] [nbviewer]
  3. A Tour of Machine Learning Classifiers Using Scikit-Learn [dir] [ipynb] [nbviewer]
  4. Building Good Training Sets – Data Pre-Processing [dir] [ipynb] [nbviewer]
  5. Compressing Data via Dimensionality Reduction [dir] [ipynb] [nbviewer]
  6. Learning Best Practices for Model Evaluation and Hyperparameter Optimization [dir] [ipynb] [nbviewer]
  7. Combining Different Models for Ensemble Learning [dir] [ipynb] [nbviewer]
  8. Applying Machine Learning to Sentiment Analysis [dir] [ipynb] [nbviewer]
  9. Embedding a Machine Learning Model into a Web Application [dir] [ipynb] [nbviewer]
  10. Predicting Continuous Target Variables with Regression Analysis [dir] [ipynb] [nbviewer]
  11. Working with Unlabeled Data – Clustering Analysis [dir] [ipynb] [nbviewer]
  12. Training Artificial Neural Networks for Image Recognition [dir] [ipynb] [nbviewer]
  13. Parallelizing Neural Network Training via Theano [dir] [ipynb] [nbviewer]

Equation Reference

[PDF] [TEX]

Slides for Teaching

A big thanks to Dmitriy Dligach for sharing his slides from his machine learning course that is currently offered at Loyola University Chicago.

Additional Math and NumPy Resources

Some readers were asking about Math and NumPy primers, since they were not included due to length limitations. However, I recently put together such resources for another book, but I made these chapters freely available online in hope that they also serve as helpful background material for this book:


Citing this Book

You are very welcome to re-use the code snippets or other contents from this book in scientific publications and other works; in this case, I would appreciate citations to the original source:

BibTeX:

@Book{raschka2015python,
 author = {Raschka, Sebastian},
 title = {Python Machine Learning},
 publisher = {Packt Publishing},
 year = {2015},
 address = {Birmingham, UK},
 isbn = {1783555130}
 }

MLA:

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.



Sebastian Raschka’s new book, Python Machine Learning, has just been released. I got a chance to read a review copy and it’s just as I expected - really great! It’s well organized, super easy to follow, and it not only offers a good foundation for smart, non-experts, practitioners will get some ideas and learn new tricks here as well.
– Lon Riesberg at Data Elixir

Superb job! Thus far, for me it seems to have hit the right balance of theory and practice…math and code!
Brian Thomas

I've read (virtually) every Machine Learning title based around Scikit-learn and this is hands-down the best one out there.
Jason Wolosonovich

The best book I've seen to come out of PACKT Publishing. This is a very well written introduction to machine learning with Python. As others have noted, a perfect mixture of theory and application.
Josh D.

A book with a blend of qualities that is hard to come by: combines the needed mathematics to control the theory with the applied coding in Python. Also great to see it doesn't waste paper in giving a primer on Python as many other books do just to appeal to the greater audience. You can tell it's been written by knowledgeable writers and not just DIY geeks.
Amazon Customer

Sebastian Raschka created an amazing machine learning tutorial which combines theory with practice. The book explains machine learning from a theoretical perspective and has tons of coded examples to show how you would actually use the machine learning technique. It can be read by a beginner or advanced programmer.

Longer reviews

If you need help to decide whether this book is for you, check out some of the "longer" reviews linked below. (If you wrote a review, please let me know, and I'd be happy to add it to the list).


Links

Translations



Bonus Notebooks (not in the book)


"Related Content" (not in the book)


SciPy 2016

We had such a great time at SciPy 2016 in Austin! It was a real pleasure to meet and chat with so many readers of my book. Thanks so much for all the nice words and feedback! And in case you missed it, Andreas Mueller and I gave an Introduction to Machine Learning with Scikit-learn; if you are interested, the video recordings of Part I and Part II are now online!

PyData Chicago 2016

I attempted the rather challenging task of introducing scikit-learn & machine learning in just 90 minutes at PyData Chicago 2016. The slides and tutorial material are available at "Learning scikit-learn -- An Introduction to Machine Learning in Python."


Note

I have set up a separate library, mlxtend, containing additional implementations of machine learning (and general "data science") algorithms. I also added implementations from this book (for example, the decision region plot, the artificial neural network, and sequential feature selection algorithms) with additional functionality.



Translations



Dear readers,
first of all, I want to thank all of you for the great support! I am really happy about all the great feedback you sent me so far, and I am glad that the book has been so useful to a broad audience.

Over the last couple of months, I received hundreds of emails, and I tried to answer as many as possible in the available time I have. To make them useful to other readers as well, I collected many of my answers in the FAQ section (below).

In addition, some of you asked me about a platform for readers to discuss the contents of the book. I hope that this would provide an opportunity for you to discuss and share your knowledge with other readers:

(And I will try my best to answer questions myself if time allows! :))

The only thing to do with good advice is to pass it on. It is never of any use to oneself.
— Oscar Wilde


Examples and Applications by Readers

Once again, I have to say (big!) THANKS for all the nice feedback about the book. I've received many emails from readers, who put the concepts and examples from this book out into the real world and make good use of them in their projects. In this section, I am starting to gather some of these great applications, and I'd be more than happy to add your project to this list -- just shoot me a quick mail!

FAQ

General Questions

Questions about the Machine Learning Field

Questions about ML Concepts and Statistics

Cost Functions and Optimization
Regression Analysis
Tree models
Model evaluation
Logistic Regression
Neural Networks and Deep Learning
Other Algorithms for Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Ensemble Methods
Preprocessing, Feature Selection and Extraction
Naive Bayes
Other
Programming Languages and Libraries for Data Science and Machine Learning

Questions about the Book

Contact

I am happy to answer questions! Just write me an email or consider asking the question on the Google Groups Email List.

If you are interested in keeping in touch, I have quite a lively twitter stream (@rasbt) all about data science and machine learning. I also maintain a blog where I post all of the things I am particularly excited about.

python-machine-learning-book's People

Contributors

alexanderkunkel avatar bachmann1234 avatar bikashdaga09 avatar emptyr1 avatar erikr avatar lxj616 avatar mtietze avatar naereen avatar neerajsarwan avatar nipunsadvilkar avatar rasbt avatar shashankgroovy avatar sourcesoft avatar sumitbando avatar timgates42 avatar timmartin19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-machine-learning-book's Issues

Reinforcement Learning - Where Art Thou?

Wonderful book; learning a ton! Question: In first chapter, you explain the three types of learning (supervised, unsupervised and reinforcement). Usually the third is not listed. So, I search your text for other material on RL but found none. Future chapter in next edition? Future book? In your other resources, are there links about RL with a scikit-learn style? Love Karpathy's blog on "Pong from Pixels".

python crushes when running theano code

Hi

I am trying to run the following code from the book in jupyter notebook with everything updated. However, every time python crushes and the kernel restarts. Everything is fine before this point. Any thought?
ps. using 32bit and gpu, tried dmatrix, no luck
chapter 13

import numpy as np
x = T.dmatrix(name='x')
x_sum = T.sum(x, axis=0)
calc_sum = theano.function(inputs=[x],outputs=x_sum)
ary = [[1,2,3],[1,2,3]]
print('column sum:',calc_sum(ary))

the Breast Cancer Wisconsin dataset is not available

In chapter6, the Breast Cancer Wisconsin dataset is not available now.
Maybe it is broken link.

currently

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

should be

df = pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

I'm sorry if I'm wrong.

Numpy Future Warning when using plot_decision_regions function

Sebastian,

I've been collecting my own data and have applied the plot_decision_regions function several times to my data but I am running into a problem with this new data. The problem is occurring here:

#plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

My enumerated object is: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
So 5 classifications hot encoded.

From what I understand, this list comprehension passes over my X_train_pca data five times and uses the boolean comparison y == cl to plot all my data points with five different colors as it passes through the markers and colormap.

Upon running, I get the warning:

FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],

The really weird part is the values in the array: X[y==cl, 0]
They now look like: [-0.4277726 -0.4277726 -0.44362509 ..., -0.4277726 -0.4277726 -0.4277726 ]
With shape (9784,) which is the original length of my X_train_pca data. (I believe it should be closer to about a fifth since most of my data is similar in length and I checked np.shape after the loop ran.)

To give a visual my data looks like this.

image

When it should be separated into colors with a spread looking like this.

image

I can't really think through the problem anymore probably due to a misunderstanding of what this future warning is trying to tell me. I am wondering if you have any ideas as to what might cause this behavior.

Kernal PCA [projecting new data points]

So you have this:

X, y = make_moons(n_samples=100, random_state=123)
alphas, lambdas =rbf_kernel_pca(X, gamma=15, n_components=1)

Then you take a sample from X:
x_new = X[25]

And then find the projection for the new sample from:

x_reproj = project_x(x_new, X, 
...       gamma=15, alphas=alphas, lambdas=lambdas)

But x_new was already a part of alphas and lambdas created using X. In other words, X already had x_new when the rbf_kernel_pca was applied. So should I be surprised that the projected value of x_new coincides exactly in the plots? I would have thought it might have been better to exclude x_new to derive alpha and lambda values and then apply project_x. Thoughts?

Missing code in chapter 11

Hi, below code is from the book but missing in the cell In[17] of ch11.ipynb.

row_clusters = linkage(df.values, method='complete', metric='euclidean')

capture

Cant import using pickle ch9

When i try to read back the classifier on page 254 i get the following error. I have done like in the book the whole way and things have worked find until now. Any idea what has gone wrong?

Im using ipython 4.2.0

AttributeError                            Traceback (most recent call last)
<ipython-input-4-f050da95a5cf> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/var/folders/yh/mm1bdmx9073_b15lw69b2qmh0000gn/T/py71220g7y''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/var/folders/yh/mm1bdmx9073_b15lw69b2qmh0000gn/T/py71220g7y''');exec(compile(__code, '''/Users/henke/Documents/code/python/python-ml/movieclassifier/main.py''', 'exec'));

/Users/henke/Documents/code/python/python-ml/movieclassifier/main.py in <module>()
      4 from vectorizer import vect
      5 
----> 6 clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))
      7 
      8 import numpy as np

AttributeError: Can't get attribute 'tokenizer' on <module '__main__'>

typo in chapter 12

In this file, the code loads the names as

labels_path = os.path.join(path, 
                           '%s-labels-idx1-ubyte' % kind)
images_path = os.path.join(path, 
                           '%s-images-idx3-ubyte' % kind)

However the linked .gz file has the names with a period, not a hyphen. It should be

labels_path = os.path.join(path, 
                           '%s-labels.idx1-ubyte' % kind)
images_path = os.path.join(path, 
                           '%s-images.idx3-ubyte' % kind)

https://www.reddit.com/r/learnpython/comments/6qc9t1/path_to_existing_file_in_root_folder_not_found_on/

Notebook error

Opening the first chapter file ch01.ipynb result in the following error:

"Unreadable Notebook: /home/antonio/libro-machine-learning/ch01.ipynb NotJSONError("Notebook does not appear to be JSON: '\n\n\n\n\n\n\n<html lang...")"

python version: 3.7 from anaconda distribution.

TypeError: can't multiply sequence by non-int of type 'float'

I am getting this error at the np.dot for the Iris data set. Can you explain the solution ?

Following is traceback :
Traceback (most recent call last):
File "Perceptron.py", line 61, in
ppn.train(x, y)
File "Perceptron.py", line 24, in train
update = self.eta * (target - self.predict(xi))
File "Perceptron.py", line 35, in predict
return np.where(self.net_input(X) >= 0.0, 1, -1)
File "Perceptron.py", line 32, in net_input
return np.dot(X, self.w_[1:]) + self.w_[0]

Confusion in chapter 2

In chapter 2 you have some code for a simple perceptron model.

On page 27, you describe the code.

the net_input method simply calculates the vector product wTx

However, there is more than a simple vector product in the code:

def net_input(self, X):
    """Calculate net input"""
    return np.dot(X, self.w_[1:]) + self.w_[0]

In addition to the dot product, there is an addition. The text does not mention anything about what is this + self.w_[0]

Can you (or anyone) explain why that's there?

thanks,
-trevor

pip install SomePackge

Hello,

I currently facing the attached issue.

Is this package only available to Linux users? I'm on Windows.

somepackage installation error

How to Use the code files

Hi, I am extremely new to Python and I understand how to write basic commands and stuff.
I got the Code files for the book but I am not able to understand how to use them for learning.
All of them seem to be in Text format.
How can I use them as code to make a new file in which I can just have the Code instead of all the text.
I just wanted to see how the code runs but I am unable to understand what this code is and how to extract things which I want instead of having to remove all the "" and /n and other formatting elements.
Thanks.
text_code

Incorrect results printed for MLPGradientCheck

I was running MLPGradientCheck, but the results are different from example
When I tried to locate the problem, I found that a3 in MLPGradientCheck.fit is always "nan"
Is that normal? how could I fix it

problem

Broken link in FAQ

In

General Questions section of FAQ

How do Data Scientists perform model selection? Is it different from Kaggle?

The web link is broken.

Thank you for the beautiful book.

Windows 10, ImportError: cannot import name 'plot_decision_regions'

Hello,

I was trying to execute the code:

%matplotlib inline
from sklearn.linear_model import LogisticRegression 
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from mlxtend.evaluate import plot_decision_regions

iris = load_iris()
y, X = iris.target, iris.data[:, [0, 2]]  # only use 2 features
lr = LogisticRegression(C=100.0, 
                        class_weight=None, 
                        dual=False, 
                        fit_intercept=True,
                        intercept_scaling=1, 
                        max_iter=100, 
                        multi_class='multinomial', 
                        n_jobs=1,
                        penalty='l2', 
                        random_state=1, 
                        solver='newton-cg', 
                        tol=0.0001,
                        verbose=0, 
                        warm_start=False)
lr.fit(X, y)
plot_decision_regions(X=X, y=y, clf=lr, legend=2)
plt.xlabel('sepal length')
plt.ylabel('petal length')
plt.show()

but it returned following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-9b78ac9a656a> in <module>()
      3 from sklearn.datasets import load_iris
      4 import matplotlib.pyplot as plt
----> 5 from mlxtend.evaluate import plot_decision_regions
      6 
      7 iris = load_iris()

ImportError: cannot import name 'plot_decision_regions'

I installed mlxtend package. What am I doing wrong? Could You help me? Thanks in advance!

Plotting iris data in Ch02 assumes that the data is in a particular order

In chapter 2, where the iris data is plotted on a scatterplot,

# extract sepal length and petal length
X = df.iloc[0:100, [0, 2]].values

# plot data
plt.scatter(X[:50, 0], X[:50, 1],
            color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],
            color='blue', marker='x', label='versicolor')

it is simply assumed that the first 50 rows belong to the label setosa, and the next 50 to versicolor. The scatterplot should be generated using the labels(which are in the 5th column of the dataset)

Errata update question

Regarding your remark:

[2015-10-20] Good news! I just heard back from the publisher; all the typos and errors which are listed below will be fixed by next week.

I bought the ebook yesterday (O'Reilly, not PACKT) and found some errors. Up to now, they are in the errata (v2), but not yet fixed in my fresh copy. Can you say something about the current state, are the updates for immediate PACKT customers only?

edit:
Interestingly, my copy stands the test on page viii (so I have Classifiers there), but for example not the one regarding the inverted 'y' variants (with and w/o caret) on page 22 and also the following errors (p. 23) are still present.

Couldn't get desired output from Adaptive linear neuron implementation

Hi,
I was trying out one of the example in Chapter2, under title: Implementing an adaptive linear neuron in Python (Link to notebook).
The problem is when I plot decision boundaries, whole area is shown red.
screen shot 2017-12-02 at 22 21 25

When I change output = self.activation(X) to output = self.predict(X)inside fit function, the problem seems to be gone.
screen shot 2017-12-02 at 22 21 00

Is there an issue with the code or the code is correct and I made some other mistake while implementing?

Thanks
Sohaib

Chapter 8: Shuffling the DataFrame in newer versions of pandas

Just a note in case it's helpful to anyone else - I seemed to be getting 100% accuracy with the on-line sentiment analysis classifier (pages 246-246), but it turned out to be because the code used to shuffle the dataset before exporting it to CSV on page 235 hadn't worked.

In the version of pandas I'm using (0.23.4), it looks like df.index.values is needed in order to get the indexes of a DataFrame as a list. So, this:

df = df.reindex(np.random.permutation(df.index))

now needs to be this:

df = df.reindex(np.random.permutation(df.index.values))

Hope that helps someone!

Chapter 2: confusion b/w perceptron code and SGD code

In the perceptron part of the code, I see:

for xi, target in zip(X, y):
  update = self.eta * (target - self.predict(xi))
  self.w_[1:] += update * xi
  self.w_[0] += update

In the SGD part I see something similar except that everytime the new gradient point is calculated, the data is shuffled:

X, y = self._shuffle(X, y)
for xi, target in zip(X, y):
  cost.append(self._update_weights(xi, target))
def _update_weights(self, xi, target):
  """Apply Adaline learning rule to update the weights"""
  output = self.net_input(xi)
  error = (target - output)
  self.w_[1:] += self.eta * xi.dot(error)
  self.w_[0] += self.eta * error

I do not see any difference between the two except for the shuffling part and the part that one is binary value and the other is a real value (SGD). Did I misunderstand how fundamentally the weights are calculated for SGD and simple perceptron model. Ofcourse if there was a mini batch implementation, the code would have looked a lot more like adaptive linear neurons. But since you are taking sample by sample, they are implemented similarly?

IndexError: too many indices for array in CH-5 PCA Plot Code

I am having an issue in executing the for generating the graph of the PCA in Chapter-5 of the book Python Machine Learning. I tried debugging but I am not able to understand what the problem in the code is.
issue1
Kindly provide support for this issue.

ValueError: operands could not be broadcast together with shapes (400,2) (400,)

Dear concerns : I am extracting features from wav , using PLP , this ( Pyhton 3.6 -Anaconda Spyder ) after execute i am facing error in this line

File "C:\ProgramData\Anaconda3\lib\site-packages\sidekit\frontend\features.py", line 399, in power_spectrum
ahan = framed[start:stop, :] * window

ValueError: operands could not be broadcast together with shapes (400,2) (400,)

#!usr/bin/python
import numpy.matlib
import scipy
from scipy.fftpack.realtransforms import dct
from sidekit.frontend.vad import pre_emphasis
from sidekit.frontend.io import *
from sidekit.frontend.normfeat import *
from sidekit.frontend.features import *
import scipy.io.wavfile as wav
import numpy as np



def readWavFile(wav):
        #given a path from the keyboard to read a .wav file
        #wav = raw_input('Give me the path of the .wav file you want to read: ')
        inputWav = 'C:/Speech_Processing/2-Speech_Signal_Processing_and_Classification-master/feature_extraction_techniques'+wav
        return inputWav
#reading the .wav file (signal file) and extract the information we need
def initialize(inputWav):
        rate , signal  = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency
        sig = wave.open(readWavFile(inputWav))
        # signal is the numpy 2D array with the date of the .wav file
        # len(signal) number of samples
        sampwidth = sig.getsampwidth()
        print ('The sample rate of the audio is: ',rate)
        print ('Sampwidth: ',sampwidth)
        return signal ,  rate
def PLP():
        folder = input('Give the name of the folder that you want to read data: ')
        amount = input('Give the number of samples in the specific folder: ')
        for x in range(1,int(amount)+1):
                wav = '/'+folder+'/'+str(x)+'.wav'
                print (wav)
                #inputWav = readWavFile(wav)
                signal,rate = initialize(wav)
                #returns PLP coefficients for every frame
                plp_features = plp(signal,rasta=True)
                meanFeatures(plp_features[0])
#compute the mean features for one .wav file (take the features for every frame and make a mean for the sample)
def meanFeatures(plp_features):
        #make a numpy array with length the number of plp features
        mean_features=np.zeros(len(plp_features[0]))
        #for one input take the sum of all frames in a specific feature and divide them with the number of frames
        for x in range(len(plp_features)):
                for y in range(len(plp_features[x])):
                        mean_features[y]+=plp_features[x][y]
        mean_features = (mean_features / len(plp_features))
        print (mean_features)

def main():
        PLP()

main()

SoftmaxRegression - zero_init_weight missing

Hello,
I think the function zero_init_weight is missing.
I searched the github site but did not find it.
Maybe this is another version of the softmax-regressor and here it is missing?

Best Regards, Thomas

Typo/Clarification on Ch02.ipynb

There is an Additional Note (1) section where it says: " If all the weights are initialized to 0, only the scale of the weight vector, not the direction."

Seems there is some missing meaning in that sentence. Was wondering if you could correct it please. Thank you very much!

AttributeError: 'SGDClassifier' object has no attribute 'max_iter'

Hello! Thank you for this amazing gift to everyone!

My issue is with Chapter 9's movie_classifier_with_update via python app.py. I am able to enter my sample review and get predicted class label and probability. The issue arises when I click "Correct"/"Incorrect" for the classification.

It is almost assuredly due to the issue of versions of Python (3.5 needed) and Sklearn (0.19 needed) as indicated here: https://www.pythonanywhere.com/forums/topic/11716/

It'd be nice to keep this current though and I will send a PR if I ever figure out how to update it for Python 3.6 and Sklearn 0.20!

image

Error in chapter 8 code

def tokenizer(text):
    return text.split()

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

from nltk.corpus import stopwords
stop = stopwords.words('english')

X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Hi,
I get an error about "can't get attribute tokenizer_porter" ,
what's the problem you think?

Chapter 15: Padding modes figure

On page 500 (second edition: September 2017) there is a figure illustrating Full, Same and Valid padding and how the pixel patches map to the feature maps.

The feature map of the valid padding example is only 2x2. It specifies a 5x5 pixel input, 3x3 filter and a stride of 1. The feature map should be of size 3x3.

Implementation of AdalineSGD

Hi,

First of all, thanks for your nice book, Python Machine Learning

I began to read it right now and I am wondering one thing about the implementation of AdalineSGD mentioned in book

    def fit(self, X, y):
        
        self._initialize_weights(X.shape[1])
        self.cost_ = []
        for i in range(self.n_iter):
            if self.shuffle:
                X, y = self._shuffle(X, y)
            cost = []
            for xi, target in zip(X, y):
                cost.append(self._update_weights(xi, target))
            avg_cost = sum(cost) / len(y)
            self.cost_.append(avg_cost)
        return self

    def _update_weights(self, xi, target):
        """Apply Adaline learning rule to update the weights"""
        output = self.net_input(xi)
        error = (target - output)
        self.w_[1:] += self.eta * xi.dot(error)
        self.w_[0] += self.eta * error
        cost = 0.5 * error**2
        return cost        

I think, the way to update self.w_[1:] in AdalineSGD is in fact the same as implementation of the batch AdalineGD, just implemented in different ways

            output = self.activation(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)

IMO, self.eta * X.T.dot(errors) operates on entire matrix X in AdalineGD, however AdalineSGD operates on row by row via for-loop (for xi, target in zip(X, y)) of the same X. It doesn't reflect the essential diff between AdalineGD and AdalineSGD as you mentioned in book

Chapter: Combining weak to strong learners via random forests [sample size]

Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance.

To me this implies that I should choose sample size n, that is smaller than N (original training set size).

In most implementations, including the RandomForestClassifier implementation in scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff

But the above got me confused: If we choose n = N, then aren't we overfitting unless the algorithm is bootstrapping aggressively - repeating the values many times over?

Chapter 2 (Rosenblatt Perceptron): "misclassifications per epochs" on p. 30 are misleading

First things first: I absolutely like how you motivate, introduce and implement the relevant concepts in your book.

I think there is a problem with the Rosenblatt perceptron learning description (evaluation) as presented in the Figure on page 30 in the book. The errors that are counted in the variable errors are the number of updates that are performed in one epoch. However, this number does not represent the number of misclassifications after each epoch. For instance, if you use your standard options but train only for one iteration, there will be two updates ("2 errors" according to your terminology), however, all items will be classified as -1 (Setosas), therefore, there are 50 misclassification and this classifier's error rate is actually 50%.

image

ValueError: operands could not be broadcast together with shapes (200,) (30000,)

I am working on a finite element code in python. It is originally for the diffusion equation but I want to modify it for the wave equation and include a ricker source term. I tried adding the source term and it produces an error. Below is the code and error

from IPython import display
from matplotlib.tri import Triangulation, LinearTriInterpolator

deltat = 0.001
numIterations = 30
mass = numpy.zeros((NPOINTS,NPOINTS))
stiffness = numpy.zeros((NPOINTS,NPOINTS))
phi = numpy.zeros((NPOINTS,))
phi_old = numpy.zeros((NPOINTS,))

f0= 5 # Center frequency Ricker-wavelet
q0= 100 # Maximum amplitude Ricker-Wavelet
t=np.arange(0,numIterations,deltat) # Time vector

tau=np.pif0(t-1.5/f0)
q=q0*(1.0-2.0*tau2.0)*np.exp(-tau2)

xi = np.linspace(0, L, 200)
yi = np.linspace(0, H, 200)
Xi, Yi = np.meshgrid(xi, yi)

updateMatrix(mass,stiffness,phi)
mat = mass/deltat + stiffness
triang = Triangulation(points[:,0], points[:,1])

for iteration in range(1,numIterations+1):
phi_old = phi

rhs = numpy.dot(mass/deltat, phi_old)
rhs = rhs + q
phi = numpy.linalg.solve(mat,rhs)

interpolator = LinearTriInterpolator(triang, phi)
zi = interpolator(Xi, Yi)
fig1 = pylab.figure(1)
pylab.imshow(zi)

fig2 = pylab.figure(2)
xanal, yanal = analytical(numIterations*deltat)
pylab.plot(xanal,yanal,"-")
pylab.plot(Xi[100,:],zi[100,:])
fig2.savefig("comparison.png",format="PNG")


ValueError Traceback (most recent call last)
in
29
30 rhs = numpy.dot(mass/deltat, phi_old)
---> 31 rhs = rhs + q
32 phi = numpy.linalg.solve(mat,rhs)
33 interpolator = LinearTriInterpolator(triang, phi)

ValueError: operands could not be broadcast together with shapes (200,) (30000,)

question about in [29]

dear sir..
i try to study Machine Learning through your book 『Python Machine Learning』, and it's very nick book!
i can't understand how you to set up 『param_grid』
i try to get the information from sklearn but it just say『dict or list of dictionaries』
even the sample , it just write 『param_grid=....』
So ... about the 『param_grid』 how do i set it up!
i am sorry my english is a little week!
i hope i can let you know my question and thank you very much!

ch.06(Tuning hyperparameters via grid search)

I'm now learning machine learning using the Japanese translation of this book, and when I run this program, I always get stuck on the part using sklearn.svm.

When the program do the part"gs=gs.fit(X_train,y_train)", it always show the past two graphs infinitely. I don't know the reason, so tell me what may be the cause.

My PC's spec:
Window10, python3.6.5, scikit-learn0.19.1

Performing hierarchical clustering on a distance matrix

I understood the concept of complete linkage .. however in the example you provided I did not understand the values in the table with columns 'row label 1', 'row label 2' etc ..

  1. For example what do the numbers (0-7) under the first two columns : 'row label _' represent?
  2. On the first step to create clusters, when you just have points, how do you go about creating clusters? You attempted to explain that via the example, but if you could expand on your example I would really appreciate it.

Add Code Examples

Just bought this book and I can't find the source code of the examples in the book. I bought on Amazon and gone to Packtpub page as suggested on the book but even the zip I download from them is only a mirror to this repository. Just images, no code for the examples in the book. It's really annoying to have to type every single example by hand.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.