probml / pml-book Goto Github PK
View Code? Open in Web Editor NEW"Probabilistic Machine Learning" - a book series by Kevin Murphy
License: MIT License
"Probabilistic Machine Learning" - a book series by Kevin Murphy
License: MIT License
Page 161
Equation 6.42
The definition for the forwards KL divergence in equation 6.42 on page 161 shows the reverse KL divergence (which is shown in equation 6.43).
Thanks for releasing the updated book :)
Text links them together. In eq 1.16 there is w5x1x2 term but in figure there is not. This is on page 9 of pdf.
page 11, section 1.3, 3rd paragraph, the text says:
"need to collect large labeled datasets for training, which can often be time consuming and expensive".
From my experience the third under appreciated factor is the accuracy of labels. This is especially true if managers are cutting corners by rushing the labeling and/or using cheap and inexperienced labor. The labeling inaccuracy is often not easily estimated. It is too easily to be blind-sided by label inaccuracy and end up with a poor model without realizing it until it's too late.
This problem can be magnified if some classes are vastly under represented.
Print Page 589, PDF Page 619: "A natural approach to transfer learning meta-learning"
Print Page 590, PDF Page 620: "N-way K-shot classification , in which the system is expected to learn to classify KN classes using just NK training examples of each class.".
Based on the figure 19.7 and the next example I think it should be rewritten as suggested.
On Page 713, with section "Graph encoder network:"
"R^(N×N) × R^(N×D) → R^(D×L)" should be "R^(N×N) × R^(N×D) → R^(N×L)" or I did not get it fully.
Page 334, Table 11.1 has "Poserior" instead of "Posterior".
On page 3, last sentence of first paragraph, this:
certain kinds of function
should be:
certain kinds of functions
Draft: December 31, 2020
Page 14, section 1.4.1
The problem of local maxima isn't mentioned.
E.g., In chess, a piece sacrifice could be the right first step to a winning combination, but if the reward function is weighted too heavily for material advantage, the sacrifice path might not ever get explored. On the other hand, if reward function is weighted too lightly for material advantage, the RL would waste too time exploring blunders.
Not mentioned is why logistic regression is better than linear regression when doing a binary classification.
(This question does come up in interview questions, so I think it's worth mentioning....)
It can be argued that Gauss was the first person to do Machine Learning, since he developed Least Squares to predict astroid orbit:
https://en.wikipedia.org/wiki/Least_squares#The_method
I think it would be nice to add a historical foot note to this achievement.
It may be useful to mention that LS has the advantage of having a closed form solution (because quadratic is differentiable), where as L1 does not, which is why LS been held in favor for so long despite having trouble with outliers.
release: December 31, 2020
print page number: 790
chapter: C.1.3.1
there is a minor typo on the second line:
instead of "or does not old"
should be "or does not hold"
Dec 28, 2020
PDF Pg 39, text Pg 9
eqn 1.15 should not contain \mathbf{w}^T
PDF Pg 135, text Pg 105
missing space: "theparameter space"
p559 (589 of PDF) 2020-01-03 Draft
3rd paragraph:
(For regression, this is a scalar; for classification, it can be the logits or
class probabilities.)
It could be clearer as to what "this" and "it" is. (I presume theta?)
Hi Kevin, I am sending my comments for Chapter 7 Bayesian statistics [Part 1]
cerno_comments_on_pmlv2_c7_part1.pdf
Thanks!
Cheers,
Peter
Hi, there is a minor typo on page 753 (line 4) in the release as of December 31, 2020:
A basis
Should be, of course, linearly :)
Thank you for uploading the book! It's incredible. How often are you going to reupload the corrected version of the PDF?
Page 9, section 1.2.2.2
"This is a simple example of feature preprocessing". I think it'w worthwhile mention that it's also called "feature engineering", which isn't mention anywhere in the book.
Also add the phrase "feature engineering" to the index.
Hi Kevin, I am sending my comments for Chapter 5 Optimization algorithms
cerno_comments_on_pmlv2_c5.pdf
Thanks!
Cheers,
Peter
The following class by Alfredo Canziani and Yann Le Cun has lots of good pytorch demos:
https://github.com/Atcold/pytorch-Deep-Learning
It would be useful to have JAX versions of these (using flax or haiku for the DNN DSL).
I know that NLL is defined on page 84, but I think it needs to be explained again in D.2.2 since appendices are supposed to be somewhat standalone.
A few points about Bayes stuff section 2.2 and 2.2.1, pp 21-23
I think you should mention the Frequentist approach to probability. One aspect is that Bayes will hazard an estimate to an event that has no prior examples, where a Frequentist wouldn't dare. This would be something like election results.
Sometimes, the terms in the Bayes formulas are estimates, so there can be a lot of uncertainty in the answer. E.g. how do you get an accurate value for the P(H=1)? This can be hard.
It would be nice to show the numeric values substituted in for the equations 2.6 and 2.9.
The thing that trips up people about Bayes is differences in prior probability can radically change the result. A would be a good to recalculate the COVID example in 2.2.1 for different values of (PH=1). For instance, early in the COVID pandemic, the P(H=1) might be 0.01 or 0.001. Then the probably of FP of a test will become very large. This unintuitive result catches a lot of people by surprise. [added] I do see you show this in Exercise 2.1.
I calculate for p(H=1)=0.01 that the probability of infection drops to 26%. If p(H=1)=0.001 (the disease is very rare), then probability of infection drops to 3%.
This is true even though the test is highly accurate.
In section 2.4.6, page 40
There is another reason as to why Bayes is still relevant.
I encountered this in my professional career: if the amount of data is huge (and the features aren't very predictive), Bayes can beat other forms of ML because it's much quicker to calculate the probabilities during evaluation. (This surprised me!)
This is also true for Bayes vs. Deep Learning which can be painfully slow, even with GPUs.
Here is a list of the 74 scripts used in vol 1 that need converting:
https://docs.google.com/spreadsheets/d/1y0dmDcQnPFXZi05RFo4-MhfXYs5yQIZZy0b9wPexhig/edit#gid=0
Here is a list of TF notebooks that need translation to JAX (using flax, or Haiku, or whatever)
https://docs.google.com/spreadsheets/d/1KBVhgiS6CtWdqNVvkXc3Fo7LhFP_0b0bxuYS_VSlHNE/edit?usp=sharing
"A second problem is that the magnitude of the fc’s scores are not calibrated with each other (see ??), so it is hard to compare them."
print page 392 pdf 422 "convolutional neural networks (CNN), which are designed to work with variable-sized and images;"
print 393 pdf 423 "However, suppose we replacing replace the Heaviside function"
print 400 pdf 430 "it can be used to model financial data, we as well as the global temperature of the earth"
p84, 114 of PDF, 2020-01-03 Draft, last sentence of the page:
minimizes the KL divergence
should have a reference to 8.1.6.1, p 241 (271 of PDF)
Also add the ref to KL divergence on p84 to the index
Your book does seem to use standard notation and symbols, so I'm not having trouble.
However, for utter newbies, it will be harder going for them without a table of notations and symbols used in your book.
Edition: Dec 31, 2020
Print pages: 9-10
PDF pages: 39-40
§1.2.2.2 says:
In Fig. 1.7(b), we see that using D = 2 results in a much better fit.
However, the Fig. 1.7(b) image is a polynomial of degree 14, or at least that's what the title above it says. Assuming the figure images are arranged as intended, I think the section text meant to refer to 1.7(a), which is of degree 2, and is not currently referenced by the text.
Also, the caption under Fig. 1.7 reads, in part:
(a-b) Polynomial of degrees 1 and 14"
The polynomial in 1.7(a) is of degree 2 (and is labeled as such above the graph). I assume the caption should instead read:
(a-b) Polynomial of degrees
12 and 14
release: December 31, 2020
print page number: 794
chapter: C.2.2.3
there is typo on the third line:
instead of "such that p"
should be "such that Pr"
Hi Kevin, I am sending my comments for Chapter 6 Information theory
cerno_comments_on_pmlv2_c6.pdf
Thanks!
Cheers,
Peter
AI Ethics is mentioned in section 1.5.
One thing that is not mentioned, is the danger of (unconsciously) picking a biased training set (which is not mentioned in Section 1.2).
That is exemplified by Google's image recognition that misidentified pictures of Blacks as apes.
See:
https://www.theguardian.com/technology/2018/jan/12/google-racism-ban-gorilla-black-people
eg figure 1.4 refers to https://github.com/probml/pyprobml/blob/master/notebooks/iris_dtree.ipynb
but should instead refer to https://github.com/probml/pyprobml/blob/master/book1/trees/iris_dtree.ipynb
Book Name: Probabilistic Machine Learning: An Introduction
Book date stamp: 2020-12-28
pdf page number: 749
print page number: 719
In section A.2.1 first sentence, last word should be
"""
A.2.1 Functions
A function f: X->Y ... for each x \in Y.
"""
Add RKHS section to ch 17
Not sure what version, but downloaded and printed around 1st Feb. Caveat: I didn't know much about this topic, so some suggestions are more to do with my understanding than anything being wrong. Some are really pedantic as well. Change or ignore as you see fit.
Section 5.5 boolean --> Boolean
Section 5.5.1.1 Here, you are minimizing \theta_1^2+\theta_2^2-1 but in the figure just \theta_1^2+\theta_2^2. I know the solution is the same, but its nonetheless inconsistent.
Section 5.5.3 The description of how to convert to standard form could use some extra work. It's not make explicit where \mathbf{A} comes from (presumably an aggregation of equations 5.106 and 5.107 but it wouldn't do any harm to say that). Plus you do not explain how to make
Section 5.5.3.1 In the worse case --> In the worst-case scenario
Section 5.5.3.1 "There are various..." This sentence is a bit of a non-sequitur and should probably be connected to the previous sentence to identify that you are saying that there do exist methods that are more efficient than the Simplex method.
Section 5.5.4 "From the geometry of the problem..." Well, this is true, but what about when we are in 100 dimensional space and the geometry is not obvoius?
Section 5.5.6 It took me a while to figure out that NLL meant negative log likelihood having just dropped into the book here.
Section 5.6 It wasn't clear to me why we scale by \eta
Equation 5.120 The function f() is not defined. Possibly you mean \mathcal{L}? The
Figure 5.15 I might be mistaken but I think you are maximizing the function in this figure vs. minimizing in Figure 5.14 which is a bit confusing.
Equation 5.131 I did not understand why there is a factor of a 1/2 on the RHS
Equation 5.132 I think the first case should be \theta-\lambda. Something wrong anyway as this is not symmetric.
Section 5.6.3 I thought the discussion of the straight-through estimator seemed a little out of place and broke the flow. Consider moving to end of section, shortening or dropping completely.
Realistically, I probably can't read the whole document with this level of detail, but if you know there are sections that are not well read (perhaps near the end of the book or new parts that you have added) then send me a message and I'll try to find time to focus on these.
In section 1.3, pp 11-13, does not discuss the problem of guessing how many clusters there.
"then we might want to split the top right into (at least) two subclusters." (p 12, section 1.3.1).
This is (obviously) a hard problem. Maybe it's mentioned somewhere else in the book (so there should be a reference to that), but I haven't finished reading this very fine book yet!
12/30/2020 version.
Hello,
(Thankyou for the great resource)
I would like to point out the statement about posterior distribution on Page 22, Print Version (Just before Covid example). Since, Bayes and posterior are very important concept for the whole book, it this might be confusing for some people(like me which are not so clever),
It says
Multiplying the prior by the likelihood for each value of H, and then normalizing so the result
sums to one, gives the posterior distribution p(H = h|Y = y); this represents our new belief
state about the possible values of H
I may be completely wrong but I am not sure how this is true. We multiply likelihood by each value of H to get marginal likelihood (denominator, P(Y=y)) instead and then divide by this marginal likelihood to get posterior of some p(H = h|Y = y). We dont " multiply by each value of H in numerator(likelihood*prior) and then normalize"
Multiplying the prior by the "likelihood for each value of H", and then normalizing so the result
sums to one gives us
I mean, for some specific h, to get posterior for this h i.e p(H = h|Y = y), we do not need to multiply all value of H before normalising, am i right?
Section 6.1, 153 (183 of the PDF) Draft 2020-01-03
says:
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack
of predictability, associated with a random variable drawn from a given distribution, as we discuss
below.
For myself, I am always amused by the irony that this lack of predictability (entropy) is also related to information. That is something that is completely predictable has no information.
Since the chapter is entitled "Information Theory", I think how Entropy is related to Information deserves an explanation.
In fact, even though the Chapter about Information Theory, the chapter doesn't describe much what Information is. This will puzzle the newbie who would go by the popular definition of "information" and not the very precise one you intend.
(perhaps it's better described in your second volume?)
P47, section 3.1
In the last sentence between equations 3.16 and 3.17
log(p=p/1 - p)
is a bit unclear.
It's obviously: p/(1-p), but it did give me pause. (Easy fix in LaTex.)
Fig 6.2a p154, pdf p184 Draft 2020-01-03
People who don't know much about DNA may not realize the bases are sequential horizontally and that the columns are to compare similarity. To clarify, I suggest changing (1st paragraph, 2nd sentence)
(e.g., from different species)
to
(e.g., each row is a sequence from a different species)
In addition, it would help if you had the sequence numbers at the bottom of the figure (but I'm not sure how to squeeze in 10, 11...)
a t a g c c g g t a c g g c a
t t a g c t g c a a c c g c a
t c a g c c a c t a g a g c a
a t a a c c g c g a c c g c a
t t a g c c g c t a a g g t a
t a a g c c t c g t a c g t a
t t a g c c g t t a c g g c c
a t a t c c g g t a c a g t a
a t a g c a g g t a c c g a a
a c a t c c g t g a c g g a a
1 2 3 4 5 6 7 8 9
You might even want to color the letters in 6.2a to match the colors in 6.2b
I confessed I was confused by text for Fig 6.2b:
The overall
vertical axis represents the information content of that location measured in bits (i.e., using log base 2).
Deterministic distributions (with an entropy of 0) have height 2, and uniform distributions (with an entropy
of 2) have height 0
I thought that an entropy of 0 had no information (completely deterministic), but you have bits (height) being 2. (e.g. position 3 is always A). On the other hand, I do understand that for a position to be highly conserved (i.e. doesn't change so that the function of the motif is preserved through evolution) is provides "information" about the importance of the position in the sequence. E.g. positions 3,5, and 13 are critical (no variation), and positions 10 and 15 are highly important (mostly the same base.). I do see that this is explained a bit more on the next page (155), first paragraph, where you mention that the hight of the bar is 2-H_t. Maybe because information content isn't really defined in this chapter (as far as I can see.)
Finally, I note that fig 6.2b is not color blind friendly. Prof. Jerome Friedman redid the figures in later printings of Elements of Statistical Learning to account for that.
On page 7, last paragraph, there is a quote from Kant. It would be better to also mention which work it came from. It seems that various sources (articles on Poker) claim that this quote is from:
Critique of Pure Reason
But it doesn't appear to be here: https://www.gutenberg.org/files/4280/4280-h/4280-h.htm
So, I don't know where it really comes from or if it is indeed something that Kant said. I've searched these works (from their online PDFs) and couldn't find it:
From December 31, 2020 edition.
p.113 Fig 5.5): Mistakes about matrix
b = [-14,-6] (in your formalism you have + b.theta)
k(A) = 30.234 for A = [20; 5; 5; 2], and k(A) = 1:8541 for A = [20; 5; 5; 16]
on page 575, it says: "XGBoost assumes the user has preprocessed them into one-hot vectors"
This may change by the time you go to print. Categorical variables are being experimented with:
https://github.com/dmlc/xgboost/releases/tag/v1.3.0
Experimental support for direct splits with categorical features
Currently, XGBoost requires users to one-hot-encode categorical variables. This has adverse performance implications, as the creation of many dummy variables results into higher memory consumption and may require fitting deeper trees to achieve equivalent model accuracy.
The 1.3.0 release of XGBoost contains an experimental support for direct handling of categorical variables in test nodes. Each test node will have the condition of form feature_value \in match_set, where the match_set on the right hand side contains one or more matching categories. The matching categories in match_set represent the condition for traversing to the right child node. Currently, XGBoost will only generate categorical splits with only a single matching category ("one-vs-rest split"). In a future release, we plan to remove this restriction and produce splits with multiple matching categories in match_set.
The categorical split requires the use of JSON model serialization. The legacy binary serialization method cannot be used to save (persist) models with categorical splits.
Note. This feature is currently highly experimental. Use it at your own risk. See the detailed list of limitations at #5949.
In addition, the user doesn't have to explicitly use one-hot encoding for XGBoost (at least with the H2O.ai version). It gets converted behind the scenes:
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html
categorical_encoding: Specify one of the following encoding schemes for handling categorical features:
auto or AUTO: Allow the algorithm to decide. In XGBoost, the algorithm will automatically perform one_hot_internal encoding. (default)
one_hot_internal or OneHotInternal: On the fly N+1 new cols for categorical features with N levels
one_hot_explicit or OneHotExplicit: N+1 new columns for categorical features with N levels
binary or Binary: No more than 32 columns per categorical feature
label_encoder or LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.)
sort_by_response or SortByResponse: Reorders the levels by the mean response (for example, the level with lowest response -> 0, the level with second-lowest response -> 1, etc.). This is useful, for example, when you have more levels than nbins_cats, and where the top level splits now have a chance at separating the data with a split.
enum_limited or EnumLimited: Automatically reduce categorical levels to the most prevalent ones during training and only keep the T (10) most frequent levels, and then internally do one hot encoding in the case of XGBoost.
I haven't gotten through your book, so I'm not sure if you mention that one of the problems with one-hot encoding, is that it causes an explosion of features, which can cause an effective dilution of the data, in that the information that the one-hot encoded values are mutually exclusive but not "seen" by many (all?) algorithms.
This file https://github.com/probml/pyprobml/blob/master/scripts/svm_classifer_2d.py
(which goes with Fig 17.19) is missing.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.