iamtrask / grokking-deep-learning Goto Github PK
View Code? Open in Web Editor NEWthis repository accompanies the book "Grokking Deep Learning"
this repository accompanies the book "Grokking Deep Learning"
In chap8, the code of Batch gradient descent is confusing.
for j in range(iterations):
error, correct_cnt = (0.0, 0)
for i in range(int(len(images) / batch_size)):
batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))
#...
for k in range(batch_size):
# ...
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
In short, I think the code should update the weights only x times where x equals the number of batch in each iteration rather that n times where n equals the number of training samples.
In Chapter 8, the value of alpha in the dropout example is 0.005. In the batch gradient descent example, the text says that alpha is 20 times larger than before. 0.005 * 20 = 0.1
However, the value of alpha in the code example is 0.001.
I think it should be 0.1.
There seems to be small mistake in the Predicting Movie review code. Here is the code
x,y = (input_dataset[i],target_dataset[i])
layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) #embed + sigmoid
layer_2 = sigmoid(np.dot(layer_1,weights_1_2)) # linear + softmax
layer_2_delta = layer_2 - y # compare pred with truth
layer_1_delta = layer_2_delta.dot(weights_1_2.T) #backprop
weights_0_1[x] -= layer_1_delta * alpha
weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha
Error
In the forward pass, the code apples sigmoid activation function.
Therefore when we calculate layer_1_delta - should we not multiple with derivative of sigmoid?
My understanding was that either we should not apply sigmoid function on layer_1. If we are applying the sigmoid function then in backprop we should multiply with its derivatives.
weight_deltas
are calculated in this way:
[ [input[0] * delta[0], input[0] * delta[1], input[0] * delta[2]],
[input[1] * delta[0], input[1] * delta[1], input[1] * delta[2]],
[input[2] * delta[0], input[2] * delta[1], input[2] * delta[2]] ]
but should be transposed:
[ [input[0] * delta[0], input[1] * delta[0], input[2] * delta[0]],
[input[0] * delta[1], input[1] * delta[1], input[2] * delta[1]],
[input[0] * delta[2], input[1] * delta[2], input[2] * delta[2]] ]
otherwise weights
are updated incorrectly.
Current code:
import numpy as np
def outer_prod(a, b):
# just a matrix of zeros
out = np.zeros((len(a), len(b)))
for i in range(len(a)):
for j in range(len(b)):
out[i][j] = a[i] * b[j]
return out
weight_deltas = outer_prod(input,delta)
PR should fix the issue: #22
In the first cell, the variable weight_deltas
is not defined.
The call happens here
for i in range(len(weights)):
weights[i] -= alpha * weight_deltas[i]
Hey as i see there a actually many mistakes in the book. I found several parts where I just get confused and have to guess or rely on trying. Pls Adrew, as you yourself are into ethical ML and AI. Where is the place to report that things? Here there is seemingly no response. The attempt to do a very intuitive book about DL is wonderful, just its not fair to spend money and time on something, someone else hasn't spend so much time on.
There is a reference to a variable that was never created in the code under the embedding layer.
Specifically
if(self.creation_op == "index_select"):
new_grad = np.zeros_like(self.creators[0].data)
indices_ = **self.index_select_indices**.data.flatten()
grad_ = grad.data.reshape(len(indices_), -1)
for i in range(len(indices_)):
new_grad[indices_[i]] += grad_[i]
self.creators[0].backward(Tensor(new_grad))
No where else in the code is a index_select_indices variable created to my knowledge unless I am not seeing something.
Related code shows a reference to new as show below.
def index_select(self, indices):
if(self.autograd):
**new** = Tensor(self.data[indices.data],
autograd=True,
creators=[self],
creation_op="index_select")
**new**.index_select_indices = indices
return **new**
return Tensor(self.data[indices.data])
In illustration step number 5 - COMPARE + LEARN
if(e_up < e_up):
weight += lr
should be:
if(e_up < e_dn):
weight += lr
The code for the CNN is pretty nifty, but with the hidden_size there seems to be a problem
hidden_size = ((input_rows - kernel_rows) *
(inpute_cols - kernel_cols)) * num_kernels
I think the code should change to
hidden_size = ((input_rows - kernel_rows) *
(inpute_cols - kernel_cols)) * num_labels
num_labels
would signify the symmetry of the weights as the hidden layer would not really care about the number of kernels as that is handled with the n linear structs already.
See if any of these links make any sense for you to expand your coursework offerings a little bit:
huggingface/tokenizers#69
https://github.com/Kayzaks/HackingNeuralNetworks
I might be reading it incorrectly, but it looks like you don't apply the activation function to the final output layer? (should that be applied, in this context?)
On pg 183 (also chapter10 - Intro to convolutional Neural Networks - learning edges and corners.ipynb), I believe that
kernels -= alpha * k_update
should read
kernels += alpha * k_update
8x8 image with a 3x3 kernel, we get 6x6 output, which means 8-3+1=6.
when using
for row_start in range(layer_0.shape[1]-kernel_rows)
, it discards the last pixel in the row.
What do you think?
Code formatting in Chapter 11, section "Predicting Movie Reviews" is broken. I was not able to reproduce the results in the book, p. 196, and assume that there is some error in my loop structure. Could could you update the formatting of the example?
Notably in chapter 8, the backpropagation through activation function gradients appear off: if you target the derivative of an activation function for a given input σ'(x), shouldn't you use that input for the gradient instead of the output y = σ(x)?
Example: if you calculate
layer_1 = relu(np.dot(layer_0,weights_0_1))
in the forward direction, then propagating backward would require
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(np.dot(layer_0,weights_0_1))
i.e. the input at the activation function, and not as suggested
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
After all, applying relu2deriv(relu(x))
would yield (x>=0)x>=0, the identity function and actually not change anything.
The effects on training are not too big, but it does impact overfitting, the amount of loss and in fact some of the narrative.
I tried this, mi input size is 1, 1000 and my output size is 1, 100
import random as r
x = np.array(range(1000))
y = x * 12 + 15
y = y + np.random.randn(*y.shape)
x = x.reshape(-1 , 1)
y = y.reshape(-1 , 1)
data = Tensor(x, autograd=True)
target = Tensor(y, autograd=True)
w = list()
w.append(Tensor(np.random.rand(1,1), autograd=True))
w.append(Tensor(np.random.rand(1), autograd=True))
for i in range(10):
pred = data.mm(w[0]) + w[1].expand(0,1000)
loss = ((pred - target)*(pred - target)).sum(0)
loss.backward(Tensor(np.ones_like(loss.data)))
for w_ in w:
w_.data -= w_.grad.data * 0.1
w_.grad.data *= 0
print(loss)
OUTPUT
[4.20028134e+10]
[1.86120338e+26]
[8.24726275e+41]
[3.65448202e+57]
[1.61935411e+73]
[7.17559347e+88]
[3.17960978e+104]
[1.40893132e+120]
[6.2431795e+135]
[2.7664436e+151]
Hey @iamtrask
In chapter 10 , we were said to reuse weights but I cannot possibly comprehend how did the network reused it?
The network looks same as chapter 9s network, in what way are the weights being reused ?
Is It because all the sub regions of the every image was sharing the kernel
Hi, curious if there's any plans to release a 2nd edition?
error in Gradient Descent Learning with Multiple Inputs
not using weight_deltas=ele_mul(delta,input)
using
for i in range(len(weights)):
weight_deltas=ele_mul(delta,input)
weights[i] -= alpha * weight_deltas[i]```
This is from v12.
lr = 0.1
p_up = neural_network(input, weight + lr)
should be
lr = 0.01
p_up = neural_network(input, weight + lr)
in the third division of the example code.
In Chapter 12, why are we giving a previous layer delta to embedded layer
embed_idx = sent[layer_idx]
embed[embed_idx] -= layers[layer_idx]['hidden_delta'] * alpha / float(len(sent))
shouldn't it be
embed[embed_idx] -= layer['hidden_delta'] * alpha / float(len(sent))
Also what is the need of updating embeddings of last word in the sentence which is being predicted by the network ?
I have the size len(vocab)=62. I think it small for 512.
Running first code snippet from master produce this error:
'weight_deltas' is not defined
Also function ele_mul is defined but never used
It looks like in Chapter 15, section "Secure Aggregation", there is a deepcopy
made of the model both in receiving function train_and_encrypt
and by the caller:
def train_and_encrypt(model, input, target, pubkey):
new_model = train(copy.deepcopy(model), input, target, iterations=1)
Caller:
bob_encrypted_model = train_and_encrypt(copy.deepcopy(model),
bob[0], bob[1], public_key)
No big deal, but I thought I will mention it anyway.
I'm a 77 year old beginner. I know Excel VBA, but not Python. On page 7 it instructs me to install jupyter.org or Juptyer and numpy.org for NumPy. I tried doing this but it gives me an instruction "pip install jupyterlab" and then jupyter lab to launch lab. Then pip install notebook and jupyter notebook to launch notebook. I'm assuming all of this is done in Python and I must first install Phthon. Any help will be appreciated.
bs
is not defined and should probably be batch_size
def train(model, input_data, target_data, batch_size=500, iterations=5):
criterion = MSELoss()
optim = SGD(parameters=model.get_parameters(), alpha=0.01)
n_batches = int(len(input_data) / batch_size)
for iter in range(iterations):
iter_loss = 0
for b_i in range(n_batches):
# padding token should stay at 0
model.weight.data[w2i['<unk>']] *= 0
input = Tensor(input_data[b_i*bs:(b_i+1)*bs], autograd=True)
target = Tensor(target_data[b_i*bs:(b_i+1)*bs], autograd=True)
pred = model.forward(input).sum(1).sigmoid()
loss = criterion.forward(pred,target)
loss.backward()
optim.step()
iter_loss += loss.data[0] / bs
sys.stdout.write("\r\tLoss:" + str(iter_loss / (b_i+1)))
print()
return model
Hi,
Can you please add a LICENSE file to the repo?
Thank you
If I understand correctly delta
on page 81 in this code:
input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input,weight)
error = (pred - true) ** 2
delta = pred - true # assuming this is the partial of L w.r.t. pred
weight_deltas = ele_mul(delta,input)
Seems to represent the partial of the Loss function w.r.t. pred
. However doing this out on paper reveals the actual partial is 2(pred-true)
. Am I misunderstanding what's happening?
It is not clear to me why we are passing all ones tensors when calling backward()
on a tensor.
Section "Backpropagation in Code":
There is:
layer_2_delta = (walk_vs_stop[i:i+1] - layer_2)
I belive it should be:
layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
In chapter 8 , after adding Batch gradient descent you missed a variable "images"
Hello, Andrew!
Thank you very much for your book! It helps me a lot on the way to change my profession.
I am having difficulty understanding the code in Chapter 6 of the Putting
it all Together section. It was difficult for me was to understand arrays in arrays and a lot of matrix transposition operations.
I've reproduced the code myself several times and tried using regular vectors. And I only needed the transpose operation once for the last calculation of weight_1
.
I will provide the code below. I got the same result. The code seems easier to read to me. Tell me please, am I on the right way?
Or I misunderstood something important and this is bad code style?
I used constant values of the weights so that I could better understand the operation of the algorithm and in order to be able to duplicate the weights in the original algorithm and compare the results.
Sorry for my English.
Thank you in advance.
weights_1 = np.array([ [ -0.16595599, 0.44064899, -0.99977125, -0.39533485 ],
[ -0.70648822, -0.81532281, -0.62747958, -0.30887855 ],
[ -0.20646505, 0.07763347, -0.16161097, 0.370439 ] ] )
weights_2 = np.array([ -0.5910955, 0.75623487, -0.94522481, 0.34093502 ])
street_lights = np.array( [ [ 1, 0, 1 ],
[ 0, 1, 1 ],
[ 0, 0, 1 ],
[ 1, 1, 1 ]])
walk_vs_stop = np.array([1, 1, 0, 0])
for iteration in range(60):
sum_error = 0
for i in range(len(street_lights)):
layer_0 = street_lights[i]
layer_1 = relu(np.dot(layer_0, weights_1))
layer_2 = np.dot(layer_1, weights_2)
sum_error += (layer_2 - walk_vs_stop[i]) ** 2
delta_2 = layer_2 - walk_vs_stop[i]
delta_1 = np.dot(delta_2, weights_2) * relu2deriv(layer_1)
weights_2 -= alpha * layer_1.dot(delta_2)
weights_1 -= alpha * np.dot(np.array([layer_0]).T, np.array([delta_1]))
print(sum_error)
Would it be ok if I did a re-implementation in another programming language of the code in the book ?
I don't see a licence, so i'm not quite sure if that would be ok. I would obviously reference and link to this repo and the book itself.
Thank you for your time !
Hello,
I have finished the Chapter 4. But I have a question regarding weight_delta. It has a value of delta * input.
In the book it says that weight_delta is the derivative of the error, right ?
On page 60 for example, the error is error = ((0.5*weight ) - 0.8) ** 2
When I give this error function to Wolfram Alpha, it gives me the derivative of 0.5 * x - 0.8 (where x = weight).
So, in general, the derivative of error should be input * weight - goal_pred.
So, why they use delta * input for weight_delta if weight_delta is the derivative ???
In Chapter 3, in section Predicting with Multiple Inputs & Outputs code,
def vect_mat_mul(vect,matrix):
assert(len(vect) == len(matrix)
the aim is to check that the number of inputs is equal to the number of weights for every output. But in fact len(matrix)
corresponds to the number of outputs not number of weights. To check for the number of weights, we should write something like:
def vect_mat_mul(vect,matrix):
assert(len(vect) == len(matrix[0])
for example
for i in range(len(vect)):
output[i] = w_sum(vect,matrix[i])
it should be:
for i in range(len(matrix)):
output[i] = w_sum(vect,matrix[i])
because we perform the weighted sum for each output, not for each input.
Hi, I don't understand the following code in Chapter 8, why does it turn the original label, a (1000,) tuple into (1000,10) two-dimensional array ? What the purpose of doing that ?
Can someone cast a light on it ? Thanks a lot
images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])
one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
one_hot_labels[i][l] = 1
labels = one_hot_labels
I've been reading the book and strictly following the code examples. But I think there's something wrong with the code in chapter 10, when training a model using CNN to recognize the MNIST images. In the last part of the code when updating the weights:
layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size*layer_2.shape[0])
layer_1_delta = layer_2_delta.dot(weights_1_2.T)*tanh2deriv(layer_1)
layer_1_delta*=dropout_mask
weights_1_2 += alpha*layer_1.T.dot(layer_2_delta)
l1d_reshape = layer_1_delta.reshape(kernel_output.shape)
k_update = flattened_input.T.dot(l1d_reshape)
kernels -= alpha*k_update
I'm gently surprised because according to what I have previously learned in the book, the layer_x_deltas should be calculating the negetive derivatives of the loss functions, so with the last line, I think it should be
kernels += alpha*k_update
After modifying this, I try it on my own computer. The output:
I:0 Train-Acc: 0.132
I:1 Train-Acc: 0.174
I:2 Train-Acc: 0.191
I:3 Train-Acc: 0.215
I:4 Train-Acc: 0.241
I:5 Train-Acc: 0.249
I:6 Train-Acc: 0.296
I:7 Train-Acc: 0.31
I:8 Train-Acc: 0.37
I:9 Train-Acc: 0.358
I:10 Train-Acc: 0.408
I:11 Train-Acc: 0.438
I:12 Train-Acc: 0.465
I:13 Train-Acc: 0.479
I:14 Train-Acc: 0.528
I:15 Train-Acc: 0.548
I:16 Train-Acc: 0.533
I:17 Train-Acc: 0.569
I:18 Train-Acc: 0.574
I:19 Train-Acc: 0.605
I:20 Train-Acc: 0.605
...
But with the original code, I get:
I:0 Train-Acc: 0.055
I:1 Train-Acc: 0.037
I:2 Train-Acc: 0.037
I:3 Train-Acc: 0.04
I:4 Train-Acc: 0.046
I:5 Train-Acc: 0.068
I:6 Train-Acc: 0.083
I:7 Train-Acc: 0.096
I:8 Train-Acc: 0.127
I:9 Train-Acc: 0.148
I:10 Train-Acc: 0.181
I:11 Train-Acc: 0.209
I:12 Train-Acc: 0.238
I:13 Train-Acc: 0.286
I:14 Train-Acc: 0.274
I:15 Train-Acc: 0.257
I:16 Train-Acc: 0.243
I:17 Train-Acc: 0.112
I:18 Train-Acc: 0.035
I:19 Train-Acc: 0.026
I:20 Train-Acc: 0.022
After modifying, the accuracy of the training set increases much rapidly than with the original "-=". However, it puzzles me that after 300 times of iteration, both models get an accuracy about 86%. So what's the difference? Does the code have a typo or I just simply have misunderstood it?
I posted a question about this on stackoverflow. I have not typed the code wrongly. So what's wrong?
This is from v12.
if(error > e_dn || error > e_up):
if (e_dn < e_up):
weight -= lr
if (e_up < e_up):
weight += lr
should be
if(error > e_dn || error > e_up):
if (e_dn < e_up):
weight -= lr
if (e_up < e_dn):
weight += lr
in the final division of the example code.
I believe layer_1_delta should be calculated below
layer_1_input = np.dot(layer_0,weights_0_1)
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1_input)
instead of
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
even though they return the same value.
Where does .14 come from?
layer_2_delta=(layer_2-walk_stop[0:1])
walk_stop = np.array([[ 1, 1, 0, 0]]).T
Wouldn't this just be -.02 - 1 === -1.02?
In the Homomorphically encrypted federated learning section
The providing code are as follows:
1. model = Embedding(vocab_size=len(vocab), dim=1)
2. model.weight.data *= 0
3.
4. # note that in production the n_length should be at least 1024
5. public_key, private_key = phe.generate_paillier_keypair(n_length=128)
6.
7. def train_and_encrypt(model, input, target, pubkey):
8. new_model = train(copy.deepcopy(model), input, target, iterations=1)
9.
10. encrypted_weights = list()
11. for val in new_model.weight.data[:,0]:
12. encrypted_weights.append(public_key.encrypt(val))
13. ew = np.array(encrypted_weights).reshape(new_model.weight.data.shape)
14.
15. return ew
16.
17. for i in range(3):
18. print("\nStarting Training Round...")
19. print("\tStep 1: send the model to Bob")
20. bob_encrypted_model = train_and_encrypt(copy.deepcopy(model),
21. bob[0], bob[1], public_key)
22.
23. print("\n\tStep 2: send the model to Alice")
24. alice_encrypted_model=train_and_encrypt(copy.deepcopy(model),
25. alice[0],alice[1],public_key)
26.
27. print("\n\tStep 3: Send the model to Sue")
28. sue_encrypted_model = train_and_encrypt(copy.deepcopy(model),
29. sue[0], sue[1], public_key)
30.
31. print("\n\tStep 4: Bob, Alice, and Sue send their")
32. print("\tencrypted models to each other.")
33. aggregated_model = bob_encrypted_model + \
34. alice_encrypted_model + \
35. sue_encrypted_model
36.
37. print("\n\tStep 5: only the aggregated model")
38. print("\tis sent back to the model owner who")
39. print("\t can decrypt it.")
40. raw_values = list()
41. for val in sue_encrypted_model.flatten():
42. raw_values.append(private_key.decrypt(val))
43. new = np.array(raw_values).reshape(model.weight.data.shape)/3
44. model.weight.data = new
45.
46. print("\t% Correct on Test Set: " + \
47. str(test(model, test_data, test_target)*100))
And I think the sue_encrypted_model
in 41 line should be aggregated_model
?
``
for k in range(batch_size):
correct_cnt+=int(np.argmax(layer_2[k:k+1]==np.argmax(labels[batch_start+k:batch_start+k+1]))
layer_2_delta = (labels[batch_start:batch_end]-layer_2)/batch_size
layer_1_delta = layer_2_delta.dot(weights_1_2.T)* relu2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
#####################################################
In the above code , why are we computing the values of layer_1_delta and layer_2_delta again and again...should not one iteration suffice ..what is the purpose..this is the code in regularization chapter for mnist digit classification with mini batched SGD...I changed some
####################################################
``
layer_2_delta = (labels[batch_start:batch_end]-layer_2)/batch_size
layer_1_delta = layer_2_delta.dot(weights_1_2.T)* relu2deriv(layer_1)
weights_1_2 += (batch_size-1)*alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += (batch_size-1)*alpha * layer_0.T.dot(layer_1_delta)
layer_1_delta *= dropout_mask
for k in range(batch_size):
correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))
##############################
this seems much faster and reaches the same bench marks
###############################
``
Hi,
Excellent stuff, I just wanted to point out that you used "input" as a variable name, it's actually an inbuilt names reserved that perform an action, it let's you get input from a user.
I know in this case your not likely to use it elsewhere and you can override the inbuilt names but it's not advised.
You can tell a name is reserved usually because it changes color, in this color scheme it turns green.
In chapter 6, Creating a matrix or two in python topic there is a typo
The mistake:
error = (goal_prediction - prediction) ** 2
what it should be:
error = (prediction - goal_prediction) ** 2
we can't see any difference in output because mean squared error is used.
Learning the whole Dataset! topic also has the same mistake
Why there was needed to define "new" and call it once after defining new.
I think the Tensor data in this case is multiplied to the "other". But why it is different than "add"?
Hello
I cloned the Jupyter files this evening and when running through them I came across an error in the "Gradient Descent Learning with Multiple Inputs" script in Chapter5.ipynb
https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter5.ipynb
NameError: name 'weight_deltas' is not defined
I believe the following line is currently missing from the script:
weight_deltas = ele_mul(delta, input)
With this line added I got a result:
Weights:[0.1119, 0.20091, -0.09832]
Weight Deltas:[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]
Thanks.
Please see https://github.com/suyash/grokking-deep-learning-rs
I have implemented all the exercises in Rust. Rust is a new programming language with a focus on safety, speed and concurrency, while having a minimal runtime and zero garbage collection.
In the Rust community, it is common to dual license code with Apache 2 and MIT. I have done the same. Hopefully that's okay.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.