iamtrask / grokking-deep-learning Goto Github PK

View Code? Open in Web Editor NEW

7.2K 7.2K 1.6K 21.5 MB

this repository accompanies the book "Grokking Deep Learning"

Jupyter Notebook 100.00%

grokking-deep-learning's People

Contributors

Stargazers

Watchers

Forkers

frankhinek valeman embracelife architectureofthings jeremyyeung premjithb jhendric98 shubhampachori12110095 rio93 xelaos merajat yujia-liu zhouyonglong allensmile alibaheri joergrosenkranz yanghaha11514 handsomeboy cfsmile watthell234 niuwk thedataboi omarsar itisjoshi nniinnoo gandalfvn raghavendra-gali whoishaider jomaminoza rpatel3105 vinodkandula abdelpakey pratikchhapolika shawprasenjit limorl madhivarman yangdaiyu123 nidecai butterfly420 aifullstack ganwy2017 simon-lzw openube mike-smith jurjsorinliviu a-jatin graoke tonyle9 mauricemickenstr c-r-p richgit101 johndpope karan002 jiayong garftalk zeitgeberh ding-zhao hereismari nonlining utanapishtim andreas-koukorinis sjl421 pariyat muthupaa oludash02 ankit481 sakhtar1979 raghothams jazzman37 harishsiitd ebunt hulalazz pmojiri skoundin nirupam1sharma todun srepho arunkumarramanan nd1511 kevintrannz neilbryant gautamjain1009 jingweiz gabeochieng meumesmo stjordanis kywang singhsukhendra rotemfogel omcar17 jaedukseo vapodymov galvezz jdetras tedb0dy direkshan-digital ishacusp diskandarnerd emezac babatunde360

grokking-deep-learning's Issues

Chap8: the way of mini-batch gradient descent updating weights

In chap8, the code of Batch gradient descent is confusing.

for j in range(iterations):
    error, correct_cnt = (0.0, 0)
    for i in range(int(len(images) / batch_size)):
        batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))
        #...
        for k in range(batch_size):
            # ...
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

In short, I think the code should update the weights only x times where x equals the number of batch in each iteration rather that n times where n equals the number of training samples.

Chapter 8: Batch gradient descent. Wrong alpha value

In Chapter 8, the value of alpha in the dropout example is 0.005. In the batch gradient descent example, the text says that alpha is 20 times larger than before. 0.005 * 20 = 0.1

However, the value of alpha in the code example is 0.001.

I think it should be 0.1.

Ch11 Predicting Movie Reviews - error in back propagation code

There seems to be small mistake in the Predicting Movie review code. Here is the code

        x,y = (input_dataset[i],target_dataset[i])
        layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) #embed + sigmoid
        layer_2 = sigmoid(np.dot(layer_1,weights_1_2)) # linear + softmax
          
        layer_2_delta = layer_2 - y # compare pred with truth
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) #backprop

        weights_0_1[x] -= layer_1_delta * alpha
        weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha

Error
In the forward pass, the code apples sigmoid activation function.
Therefore when we calculate layer_1_delta - should we not multiple with derivative of sigmoid?
My understanding was that either we should not apply sigmoid function on layer_1. If we are applying the sigmoid function then in backprop we should multiply with its derivatives.

Chapter 5: weight_deltas calculation in case of multiple inputs and multiple outputs

weight_deltas are calculated in this way:

[ [input[0] * delta[0], input[0] * delta[1], input[0] * delta[2]],
  [input[1] * delta[0], input[1] * delta[1], input[1] * delta[2]],
  [input[2] * delta[0], input[2] * delta[1], input[2] * delta[2]] ]

but should be transposed:

[ [input[0] * delta[0], input[1] * delta[0], input[2] * delta[0]],
  [input[0] * delta[1], input[1] * delta[1], input[2] * delta[1]],
  [input[0] * delta[2], input[1] * delta[2], input[2] * delta[2]] ]

otherwise weights are updated incorrectly.

Current code:

import numpy as np
def outer_prod(a, b):
    
    # just a matrix of zeros
    out = np.zeros((len(a), len(b)))

    for i in range(len(a)):
        for j in range(len(b)):
            out[i][j] = a[i] * b[j]
    return out

weight_deltas = outer_prod(input,delta)

PR should fix the issue: #22

NameError in Chapter 4 notebook

In the first cell, the variable weight_deltas is not defined.
The call happens here

for i in range(len(weights)):
    weights[i] -= alpha * weight_deltas[i]

lots of mistakes and illogical ordering? is this a pre-script?

Hey as i see there a actually many mistakes in the book. I found several parts where I just get confused and have to guess or rely on trying. Pls Adrew, as you yourself are into ethical ML and AI. Where is the place to report that things? Here there is seemingly no response. The attempt to do a very intuitive book about DL is wonderful, just its not fair to spend money and time on something, someone else hasn't spend so much time on.

Chapter13 - Intro to Automatic Differentiation - index_select_indices variable was never created previous to use

There is a reference to a variable that was never created in the code under the embedding layer.

Specifically

            if(self.creation_op == "index_select"):
                new_grad = np.zeros_like(self.creators[0].data)
                indices_ = **self.index_select_indices**.data.flatten()
                grad_ = grad.data.reshape(len(indices_), -1)
                for i in range(len(indices_)):
                    new_grad[indices_[i]] += grad_[i]
                self.creators[0].backward(Tensor(new_grad))

No where else in the code is a index_select_indices variable created to my knowledge unless I am not seeing something.

Related code shows a reference to new as show below.

def index_select(self, indices):

    if(self.autograd):
        **new** = Tensor(self.data[indices.data],
                     autograd=True,
                     creators=[self],
                     creation_op="index_select")
        **new**.index_select_indices = indices
        return **new**
    return Tensor(self.data[indices.data])

Chapter-4-page-53

In illustration step number 5 - COMPARE + LEARN

if(e_up < e_up):
 weight += lr

should be:

if(e_up < e_dn):
 weight += lr

Chapter 10: CNN

The code for the CNN is pretty nifty, but with the hidden_size there seems to be a problem

hidden_size = ((input_rows - kernel_rows) * 
                         (inpute_cols - kernel_cols)) * num_kernels

I think the code should change to

hidden_size = ((input_rows - kernel_rows) *
                         (inpute_cols - kernel_cols)) * num_labels

num_labels would signify the symmetry of the weights as the hidden layer would not really care about the number of kernels as that is handled with the n linear structs already.

Ending Notes Suggestions

See if any of these links make any sense for you to expand your coursework offerings a little bit:

huggingface/tokenizers#69
https://github.com/Kayzaks/HackingNeuralNetworks

Activate layer 2 output using Relu()?

I might be reading it incorrectly, but it looks like you don't apply the activation function to the final output layer? (should that be applied, in this context?)

ch 10 kernel weight update typo

On pg 183 (also chapter10 - Intro to convolutional Neural Networks - learning edges and corners.ipynb), I believe that

kernels -= alpha * k_update
should read
kernels += alpha * k_update

chapter 10, is the `for row_start in range(layer_0.shape[1]-kernel_rows)` correct?

8x8 image with a 3x3 kernel, we get 6x6 output, which means 8-3+1=6.

when using
for row_start in range(layer_0.shape[1]-kernel_rows)
, it discards the last pixel in the row.

What do you think?

Chapter 11: code formatting is broken

Code formatting in Chapter 11, section "Predicting Movie Reviews" is broken. I was not able to reproduce the results in the book, p. 196, and assume that there is some error in my loop structure. Could could you update the formatting of the example?

Inactive Activation gradients

Notably in chapter 8, the backpropagation through activation function gradients appear off: if you target the derivative of an activation function for a given input σ'(x), shouldn't you use that input for the gradient instead of the output y = σ(x)?
Example: if you calculate
layer_1 = relu(np.dot(layer_0,weights_0_1))
in the forward direction, then propagating backward would require
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(np.dot(layer_0,weights_0_1))
i.e. the input at the activation function, and not as suggested
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
After all, applying relu2deriv(relu(x)) would yield (x>=0)x>=0, the identity function and actually not change anything.
The effects on training are not too big, but it does impact overfitting, the amount of loss and in fact some of the narrative.

Chapter 13 - I have a problem trying to implement the autograd in a simple linear regresion

I tried this, mi input size is 1, 1000 and my output size is 1, 100

import random as r

x = np.array(range(1000))
y = x * 12 + 15
y = y + np.random.randn(*y.shape)

x = x.reshape(-1 , 1)
y = y.reshape(-1 , 1)

data = Tensor(x, autograd=True)
target = Tensor(y, autograd=True)

w = list()
w.append(Tensor(np.random.rand(1,1), autograd=True))
w.append(Tensor(np.random.rand(1), autograd=True))
for i in range(10):

  pred = data.mm(w[0]) + w[1].expand(0,1000)
  loss = ((pred - target)*(pred - target)).sum(0)

  loss.backward(Tensor(np.ones_like(loss.data)))
  for w_ in w:
    w_.data -= w_.grad.data * 0.1
    w_.grad.data *= 0
  print(loss)

OUTPUT

[4.20028134e+10]
[1.86120338e+26]
[8.24726275e+41]
[3.65448202e+57]
[1.61935411e+73]
[7.17559347e+88]
[3.17960978e+104]
[1.40893132e+120]
[6.2431795e+135]
[2.7664436e+151]

Chapter 10

Hey @iamtrask
In chapter 10 , we were said to reuse weights but I cannot possibly comprehend how did the network reused it?
The network looks same as chapter 9s network, in what way are the weights being reused ?
Is It because all the sub regions of the every image was sharing the kernel

2nd edition?

Hi, curious if there's any plans to release a 2nd edition?

chapter 5-bugfix

error in Gradient Descent Learning with Multiple Inputs

not using weight_deltas=ele_mul(delta,input)

using 
for i in range(len(weights)):
    weight_deltas=ele_mul(delta,input)
    weights[i] -= alpha * weight_deltas[i]```

Error: Learning rate value (Section 4.6)

This is from v12.

lr = 0.1

p_up = neural_network(input, weight + lr)

should be

lr = 0.01

p_up = neural_network(input, weight + lr)

in the third division of the example code.

Weight updates in rnn

In Chapter 12, why are we giving a previous layer delta to embedded layer
embed_idx = sent[layer_idx]

embed[embed_idx] -= layers[layer_idx]['hidden_delta'] * alpha / float(len(sent))

shouldn't it be

embed[embed_idx] -= layer['hidden_delta'] * alpha / float(len(sent))

Also what is the need of updating embeddings of last word in the sentence which is being predicted by the network ?

small mistakes in Chapter 3

Hey, trask. There is a something wrong in Chapter 3 In[13], it should be 5 rows & 6 columns, right？

what is the size of the dictionary in chapter 14

I have the size len(vocab)=62. I think it small for 512.

Chapter 5: Gradient Descent Learning with Multiple Inputs

Running first code snippet from master produce this error:

'weight_deltas' is not defined

Also function ele_mul is defined but never used

Chapter 15, section "Secure Aggregation" -- extra deepcopy?

It looks like in Chapter 15, section "Secure Aggregation", there is a deepcopy made of the model both in receiving function train_and_encrypt and by the caller:

def train_and_encrypt(model, input, target, pubkey):
    new_model = train(copy.deepcopy(model), input, target, iterations=1)

Caller:

bob_encrypted_model = train_and_encrypt(copy.deepcopy(model), 
                                            bob[0], bob[1], public_key)

No big deal, but I thought I will mention it anyway.

Help Please

I'm a 77 year old beginner. I know Excel VBA, but not Python. On page 7 it instructs me to install jupyter.org or Juptyer and numpy.org for NumPy. I tried doing this but it gives me an instruction "pip install jupyterlab" and then jupyter lab to launch lab. Then pip install notebook and jupyter notebook to launch notebook. I'm assuming all of this is done in Python and I must first install Phthon. Any help will be appreciated.

Chapter 15 small issue in train

bs is not defined and should probably be batch_size

def train(model, input_data, target_data, batch_size=500, iterations=5):
    
    criterion = MSELoss()
    optim = SGD(parameters=model.get_parameters(), alpha=0.01)
    
    n_batches = int(len(input_data) / batch_size)
    for iter in range(iterations):
        iter_loss = 0
        for b_i in range(n_batches):

            # padding token should stay at 0
            model.weight.data[w2i['<unk>']] *= 0 
            input = Tensor(input_data[b_i*bs:(b_i+1)*bs], autograd=True)
            target = Tensor(target_data[b_i*bs:(b_i+1)*bs], autograd=True)

            pred = model.forward(input).sum(1).sigmoid()
            loss = criterion.forward(pred,target)
            loss.backward()
            optim.step()

            iter_loss += loss.data[0] / bs

            sys.stdout.write("\r\tLoss:" + str(iter_loss / (b_i+1)))
        print()
    return model

License

Hi,

Can you please add a LICENSE file to the repo?

Thank you

Question about delta in chapter 5

If I understand correctly delta on page 81 in this code:

input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input,weight)
error = (pred - true) ** 2
delta = pred - true # assuming this is the partial of L w.r.t. pred
weight_deltas = ele_mul(delta,input)

Seems to represent the partial of the Loss function w.r.t. pred. However doing this out on paper reveals the actual partial is 2(pred-true). Am I misunderstanding what's happening?

Chapter 13 - Passing all ones tensors for backprop

It is not clear to me why we are passing all ones tensors when calling backward() on a tensor.

Chapter 6: layer_2_delta calculation

Section "Backpropagation in Code":
There is:
layer_2_delta = (walk_vs_stop[i:i+1] - layer_2)
I belive it should be:
layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])

Missed variable

In chapter 8 , after adding Batch gradient descent you missed a variable "images"

Chapter 6 - Array in array vs use vector

Hello, Andrew!

Thank you very much for your book! It helps me a lot on the way to change my profession.

I am having difficulty understanding the code in Chapter 6 of the Putting it all Together section. It was difficult for me was to understand arrays in arrays and a lot of matrix transposition operations.

I've reproduced the code myself several times and tried using regular vectors. And I only needed the transpose operation once for the last calculation of weight_1.

I will provide the code below. I got the same result. The code seems easier to read to me. Tell me please, am I on the right way?
Or I misunderstood something important and this is bad code style?

I used constant values of the weights so that I could better understand the operation of the algorithm and in order to be able to duplicate the weights in the original algorithm and compare the results.

Sorry for my English.

Thank you in advance.

weights_1 = np.array([ [ -0.16595599,  0.44064899, -0.99977125, -0.39533485 ],
                       [ -0.70648822, -0.81532281, -0.62747958, -0.30887855 ],
                       [ -0.20646505, 0.07763347, -0.16161097,  0.370439 ] ] )

weights_2 = np.array([ -0.5910955, 0.75623487, -0.94522481, 0.34093502 ])

street_lights = np.array( [ [ 1, 0, 1 ],
                            [ 0, 1, 1 ],
                            [ 0, 0, 1 ],
                            [ 1, 1, 1 ]])

walk_vs_stop = np.array([1, 1, 0, 0])

for iteration in range(60):
    sum_error = 0
    for i in range(len(street_lights)):        
        layer_0 = street_lights[i]        
        layer_1 = relu(np.dot(layer_0, weights_1))
        layer_2 = np.dot(layer_1, weights_2)
        
                     
        sum_error += (layer_2 - walk_vs_stop[i]) ** 2        
        
        delta_2 = layer_2  - walk_vs_stop[i]
        delta_1 = np.dot(delta_2, weights_2) * relu2deriv(layer_1)
        
        
        weights_2 -= alpha * layer_1.dot(delta_2)        
        weights_1 -= alpha * np.dot(np.array([layer_0]).T, np.array([delta_1]))
       
    print(sum_error)

Authorization to re-implement in another language

Would it be ok if I did a re-implementation in another programming language of the code in the book ?

I don't see a licence, so i'm not quite sure if that would be ok. I would obviously reference and link to this repo and the book itself.

Thank you for your time !

chapter8 relu2deriv

run relu2deriv() maybe not is effect,or it just demo for this case?

i think it has not any assist when it run for backpropagation,

any idea or suggest, thx

Where is book code? I am manning official Tech Proof reviewer

Chapter 4 - How is weight_delta computed ?

Hello,

I have finished the Chapter 4. But I have a question regarding weight_delta. It has a value of delta * input.
In the book it says that weight_delta is the derivative of the error, right ?
On page 60 for example, the error is error = ((0.5*weight ) - 0.8) ** 2

When I give this error function to Wolfram Alpha, it gives me the derivative of 0.5 * x - 0.8 (where x = weight).
So, in general, the derivative of error should be input * weight - goal_pred.

So, why they use delta * input for weight_delta if weight_delta is the derivative ???

Mistake in Ch3. code

In Chapter 3, in section Predicting with Multiple Inputs & Outputs code,

def vect_mat_mul(vect,matrix):
    assert(len(vect) == len(matrix)

the aim is to check that the number of inputs is equal to the number of weights for every output. But in fact len(matrix) corresponds to the number of outputs not number of weights. To check for the number of weights, we should write something like:

def vect_mat_mul(vect,matrix):
    assert(len(vect) == len(matrix[0])

for example

Also in the part of the code:

for i in range(len(vect)):
    output[i] = w_sum(vect,matrix[i])

it should be:

for i in range(len(matrix)):
    output[i] = w_sum(vect,matrix[i])

because we perform the weighted sum for each output, not for each input.

Chapter 8 examples, why do they all turn label into one_hot_labels

Hi, I don't understand the following code in Chapter 8, why does it turn the original label, a (1000,) tuple into (1000,10) two-dimensional array ? What the purpose of doing that ?

Can someone cast a light on it ? Thanks a lot

images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

Something wrong with the code in chapter 10

I've been reading the book and strictly following the code examples. But I think there's something wrong with the code in chapter 10, when training a model using CNN to recognize the MNIST images. In the last part of the code when updating the weights:

layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size*layer_2.shape[0])
layer_1_delta = layer_2_delta.dot(weights_1_2.T)*tanh2deriv(layer_1)
layer_1_delta*=dropout_mask
weights_1_2 += alpha*layer_1.T.dot(layer_2_delta)
l1d_reshape = layer_1_delta.reshape(kernel_output.shape)
k_update = flattened_input.T.dot(l1d_reshape)
kernels -= alpha*k_update

I'm gently surprised because according to what I have previously learned in the book, the layer_x_deltas should be calculating the negetive derivatives of the loss functions, so with the last line, I think it should be

kernels += alpha*k_update

After modifying this, I try it on my own computer. The output:

I:0 Train-Acc: 0.132
I:1 Train-Acc: 0.174
I:2 Train-Acc: 0.191
I:3 Train-Acc: 0.215
I:4 Train-Acc: 0.241
I:5 Train-Acc: 0.249
I:6 Train-Acc: 0.296
I:7 Train-Acc: 0.31
I:8 Train-Acc: 0.37
I:9 Train-Acc: 0.358
I:10 Train-Acc: 0.408
I:11 Train-Acc: 0.438
I:12 Train-Acc: 0.465
I:13 Train-Acc: 0.479
I:14 Train-Acc: 0.528
I:15 Train-Acc: 0.548
I:16 Train-Acc: 0.533
I:17 Train-Acc: 0.569
I:18 Train-Acc: 0.574
I:19 Train-Acc: 0.605
I:20 Train-Acc: 0.605
...

But with the original code, I get:

I:0 Train-Acc: 0.055
I:1 Train-Acc: 0.037
I:2 Train-Acc: 0.037
I:3 Train-Acc: 0.04
I:4 Train-Acc: 0.046
I:5 Train-Acc: 0.068
I:6 Train-Acc: 0.083
I:7 Train-Acc: 0.096
I:8 Train-Acc: 0.127
I:9 Train-Acc: 0.148
I:10 Train-Acc: 0.181
I:11 Train-Acc: 0.209
I:12 Train-Acc: 0.238
I:13 Train-Acc: 0.286
I:14 Train-Acc: 0.274
I:15 Train-Acc: 0.257
I:16 Train-Acc: 0.243
I:17 Train-Acc: 0.112
I:18 Train-Acc: 0.035
I:19 Train-Acc: 0.026
I:20 Train-Acc: 0.022

After modifying, the accuracy of the training set increases much rapidly than with the original "-=". However, it puzzles me that after 300 times of iteration, both models get an accuracy about 86%. So what's the difference? Does the code have a typo or I just simply have misunderstood it?
I posted a question about this on stackoverflow. I have not typed the code wrongly. So what's wrong?

Error: Hot and cold learning conditional (Section 4.6)

This is from v12.

if(error > e_dn || error > e_up):
  if (e_dn < e_up):
    weight -= lr
  if (e_up < e_up):
    weight += lr

should be

if(error > e_dn || error > e_up):
  if (e_dn < e_up):
    weight -= lr
  if (e_up < e_dn):
    weight += lr

in the final division of the example code.

Chapter 6 - layer_1_delta

I believe layer_1_delta should be calculated below

layer_1_input = np.dot(layer_0,weights_0_1)
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1_input)

instead of

layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)

even though they return the same value.

Page 128, Ch. 6

Where does .14 come from?

layer_2_delta=(layer_2-walk_stop[0:1])

walk_stop = np.array([[ 1, 1, 0, 0]]).T

Wouldn't this just be -.02 - 1 === -1.02?

small mistake in Chapter 15

In the Homomorphically encrypted federated learning section
The providing code are as follows:


1. model = Embedding(vocab_size=len(vocab), dim=1)
2. model.weight.data *= 0
3. 
4. # note that in production the n_length should be at least 1024
5. public_key, private_key = phe.generate_paillier_keypair(n_length=128)
6. 
7. def train_and_encrypt(model, input, target, pubkey):
8.     new_model = train(copy.deepcopy(model), input, target, iterations=1)
9. 
10.     encrypted_weights = list()
11.     for val in new_model.weight.data[:,0]:
12.         encrypted_weights.append(public_key.encrypt(val))
13.     ew = np.array(encrypted_weights).reshape(new_model.weight.data.shape)
14. 
15.     return ew
16. 
17. for i in range(3):
18.     print("\nStarting Training Round...")
19.     print("\tStep 1: send the model to Bob")
20.     bob_encrypted_model = train_and_encrypt(copy.deepcopy(model),
21.                                             bob[0], bob[1], public_key)
22. 
23.     print("\n\tStep 2: send the model to Alice")
24.     alice_encrypted_model=train_and_encrypt(copy.deepcopy(model),
25.                                             alice[0],alice[1],public_key)
26. 
27.     print("\n\tStep 3: Send the model to Sue")
28.     sue_encrypted_model = train_and_encrypt(copy.deepcopy(model),
29.                                             sue[0], sue[1], public_key)
30. 
31.     print("\n\tStep 4: Bob, Alice, and Sue send their")
32.     print("\tencrypted models to each other.")
33.     aggregated_model = bob_encrypted_model + \
34.                        alice_encrypted_model + \
35.                        sue_encrypted_model
36. 
37.     print("\n\tStep 5: only the aggregated model")
38.     print("\tis sent back to the model owner who")
39.     print("\t can decrypt it.")
40.     raw_values = list()
41.     for val in sue_encrypted_model.flatten():
42.         raw_values.append(private_key.decrypt(val))
43.     new = np.array(raw_values).reshape(model.weight.data.shape)/3
44.     model.weight.data = new
45. 
46.     print("\t% Correct on Test Set: " + \
47.               str(test(model, test_data, test_target)*100))

And I think the sue_encrypted_model in 41 line should be aggregated_model ?

Repeated iterations

``
for k in range(batch_size):

        correct_cnt+=int(np.argmax(layer_2[k:k+1]==np.argmax(labels[batch_start+k:batch_start+k+1]))
        layer_2_delta = (labels[batch_start:batch_end]-layer_2)/batch_size
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)* relu2deriv(layer_1)
        layer_1_delta *= dropout_mask
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

#####################################################
In the above code , why are we computing the values of layer_1_delta and layer_2_delta again and again...should not one iteration suffice ..what is the purpose..this is the code in regularization chapter for mnist digit classification with mini batched SGD...I changed some
####################################################

    layer_2_delta = (labels[batch_start:batch_end]-layer_2)/batch_size
    layer_1_delta = layer_2_delta.dot(weights_1_2.T)* relu2deriv(layer_1)
    weights_1_2 += (batch_size-1)*alpha * layer_1.T.dot(layer_2_delta)
    weights_0_1 += (batch_size-1)*alpha * layer_0.T.dot(layer_1_delta)
    layer_1_delta *= dropout_mask
    for k in range(batch_size):
        correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

##############################

this seems much faster and reaches the same bench marks
###############################

Just a suggestion

Hi,
Excellent stuff, I just wanted to point out that you used "input" as a variable name, it's actually an inbuilt names reserved that perform an action, it let's you get input from a user.
I know in this case your not likely to use it elsewhere and you can override the inbuilt names but it's not advised.
You can tell a name is reserved usually because it changes color, in this color scheme it turns green.

chapter 6

In chapter 6, Creating a matrix or two in python topic there is a typo
The mistake:
error = (goal_prediction - prediction) ** 2
what it should be:
error = (prediction - goal_prediction) ** 2
we can't see any difference in output because mean squared error is used.

Learning the whole Dataset! topic also has the same mistake

Chapter 13. P.244: Why the backprop is different for "mul"?

Why there was needed to define "new" and call it once after defining new.
I think the Tensor data in this case is multiplied to the "other". But why it is different than "add"?

NameError: name 'weight_deltas' is not defined in Chapter5.ipynb "Gradient Descent Learning with Multiple Inputs"

Hello

I cloned the Jupyter files this evening and when running through them I came across an error in the "Gradient Descent Learning with Multiple Inputs" script in Chapter5.ipynb
https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter5.ipynb

NameError: name 'weight_deltas' is not defined

I believe the following line is currently missing from the script:

weight_deltas = ele_mul(delta, input)

With this line added I got a result:

Weights:[0.1119, 0.20091, -0.09832]
Weight Deltas:[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]

Thanks.

Please add a LICENSE + A Rust Implementation

Please see https://github.com/suyash/grokking-deep-learning-rs

I have implemented all the exercises in Rust. Rust is a new programming language with a focus on safety, speed and concurrency, while having a minimal runtime and zero garbage collection.

In the Rust community, it is common to dual license code with Apache 2 and MIT. I have done the same. Hopefully that's okay.