tuananhhedspibk / deeplearning.ai Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 2 KB

deep-learning e-learning

deeplearning.ai's Introduction

Hello World

I'm Anh, from 🇻🇳 and be a 👨‍💻, like to build a system by DDD.

Currently have interested in System Design 👷 🏗️ , Terraform and AWS.

Feel free to connect with my

Github statistic

Tech Toolbox 🧰

Database 🗃️

Infrastructure

Programming Language 💻

UI 🎨

Mobile 📱

ORM ⚙️

Tool ⚒️

Testing 🧪

Deep Learning 🤖

Architecture 🏗️

Clean architeture + DDD

CQRS, Event Sourcing

deeplearning.ai's People

Contributors

Stargazers

Watchers

deeplearning.ai's Issues

Normalizing Inputs

Subtract means
Normalize variance

If x1: 1...1000, x2: 0...1, W1 and W2 will have very different values. After normalizing, we can use small learning_rate, make it faster to optimize.

Ref: https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=9

Basic Recipe for Machine Learning

We have the tools those help us reduce "bias" without hurting "variance" and opposite.

Ref: https://www.youtube.com/watch?v=C1N_PDHuJ6Q&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=3

Broadcasting in Python

axis = 0 means you want to calculate vertically, axis = 1 means horizontally.

Ref: https://www.youtube.com/watch?v=tKcLaGdvabM&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=21

Random Initialization

If we initialize the parameter with 0, the hidden units will make the same influence to the output layer. The hidden units will be symmetric.

Why we choose 0.01, instead 100 or 1000. Because z = Wx + b, if W is large, z will be large. In the case we use tanh function as activation function the output a will be the point at flat part of tanh function so gradient descent or learning will be slow.

Ref: https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=35

Forward and Backward Propagation

da[L]: the derivative of Cost function.

Ref: https://www.youtube.com/watch?v=qzPQ8cEsVK8&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=41

Logistic Regression Gradient Descent

We modify the parameter w1, w2, b to reduce the L(a, y): Lost value.

Ref: https://www.youtube.com/watch?v=z_xiwjEdAC4&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=15

Vectorizing Logistic Regression's Gradient Computation

Ref: https://www.youtube.com/watch?v=2BkqApHKwn0&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=20

Computation Graph

Computation graph organizes computation from left to right (the blue arrows).

Ref: https://www.youtube.com/watch?v=hCP1vGoCdYU&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=13

Getting Matrix Dimensions Right

With python broadcasting, (n[1], 1) -> (n[1], m) - m: number of the training examples.

Ref: https://www.youtube.com/watch?v=yslMo3hSbqE&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=38

Gradient Checking Implementation Notes

if dθapprox is far from dθ. Check the value of the i that dθ[i]approx is far from dθ[i] and for example you can find that db[i] computing is error.

Ref: https://www.youtube.com/watch?v=4Ct3Yujl1dk&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=14

Regularization

Lambda: Regularization parameter.

Ref: https://www.youtube.com/watch?v=6g0t3Phly2M&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=4

Other Regularization Methods

When training for a long time, w will closer to 0 if initialize by np.random.seed(0).
Orthorgonalization: one task at one time (Optimize w,b to minimize J or Not over-fitting).

Ref: https://www.youtube.com/watch?v=BOCLq2gpcGU&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=8

Forward Propagation in a Deep Network

W[l], b[l] is parameters for layer (l).
A[0]: training examples are stacked in column by column way. The same with Z[2] (z[2] (1), z[2] (2), ....)

Ref: https://www.youtube.com/watch?v=a8i2eJin0lY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=37

Gradient Descent on m Examples

Ref: https://www.youtube.com/watch?v=KKfZLXcF-aE&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=16

Vectorizing Across Multiple Examples

Ref: https://www.youtube.com/watch?v=xy5MOQpx3aQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=28

Why Regularization Reduces Overfitting

Set lambda is too large. w[l] ~ 0, our NN will be closer to logistic regression (high variance --> high bias)

With tanh function if Z small enough our graph is so linear.
When debug gradient descent: plot the graph (cost function J and iterations)

Ref: https://www.youtube.com/watch?v=NyG-7nRpsW8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=5

Derivatives

Ref:

Backpropagation Intuition

foo, dfoo (w, dw) always have the same dimensional.

Ref: https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=34

Gradient Descent For Neural Networks

n[0] input units, n[1] hidden units, n[2] output units.

keepdims in np.sum keeps dimension in (n, m) convention NOT (n, 1).

Ref: https://www.youtube.com/watch?v=7bLEWDZng_M&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=33

Deep L-Layer Neural Network

Ref: https://www.youtube.com/watch?v=2gw5tE2ziqA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=36

Understanding Mini-Batch Gradient Descent

The reason that we have some noise in our graph for cost J in mini-batch gradient descent is for example with X{1} and Y{1} it's easy for mini-batch, but with X{2}, Y{2} it's harder because some missing label in Y{2} makes cost higher.

Stochastic gradient descent is lately converge.

Mini-batch size from 64 -> 512 (power of 2) is more common.

Ref: https://www.youtube.com/watch?v=-_4Zi8fCZO4&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=16

A primer in machine learning

Vectorizing Logistic Regression

Stack the entire examples horizontally. When we add a real number with a vector, python will expand the real number into a vector automatically (Broadcasting).

Ref: https://www.youtube.com/watch?v=okpqeEUdEkY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=19

Logistic Regression Cost Function

[x(i), y(i)]: cặp ví dụ, nhãn thứ i.
Giá trị của y^ thuộc [0, 1].
Loss function dùng cho 1 ví dụ duy nhất. Cost function dùng để đánh giá trên toàn bộ tập ví dụ. Chúng ta sẽ cố gắng tìm cặp (w, b) để minimize cost function.

Ref: https://www.youtube.com/watch?v=SHEPb1JHw5o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=9

Neural Network Representations

Don't count input layer in Neural Network

Ref: https://www.youtube.com/watch?v=CcRkHl75Z-Y&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=26

Why Deep Representations?

Ex: In face recognition. First layer is for feature detecting, edge detecting by looking at very small pixel in the picture. In second layer we group the edge together to get the parts in the face (nose, chin, eye, ...). In the third layer by grouping the parts together we will get the face completely.

Ref: https://www.youtube.com/watch?v=5dWp1mw_XNk&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=39

Vanishing/Exploding Gradients

In deep NN, L is large ==> 1.5^L will be large too. If we change to 0.5 ==> L large -> 0.5^L will be closer to 0.
I: identity matrix (ma trận đơn vị). Make gradient long time to learn anything ==> training is so difficult.

Ref: https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=10

Gradient Checking

Ref: https://www.youtube.com/watch?v=QrzApibhohY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=13

np.sum

Sum of array elements in a given axis

np.sum([[0, 1], [0, 5]], axis=0) # axis = 0: vertically --> [0, 6]
np.sum([[0, 1], [0, 5]], axis=1) # axis = 1: horizontally --> [1, 5]
np.sum([[0, 1], [0, 5]]) # Not specify axis: sum of all elements in array. --> 6

Image Ref: https://qiita.com/Phoeboooo/items/b464b7df3c64a33caf94

Logistic Regression

wTx + b maybe larger than 1 or smaller than 0, so we use sigmoid function to regularization the output (must be between [0, 1])

Ref: https://www.youtube.com/watch?v=hjrYrynGWGA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=8

Weight Initialization in a Deep Network

n: number of input features. With ReLU activation function, you use 2/n as variance(n) <=> var(n). Init W not too much larger than 1 or not too much smaller than 1 so it helps us to prevent vanishing explode too quickly

Ref: https://www.youtube.com/watch?v=s2coXdufOzE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=11

Gradient Descent

Sử dụng Gradient Descent để tìm ra cặp (w, b) phù hợp cho Logistic Regression sao cho giá trị Cost function (J(w, b)) nhỏ nhất.
Hàm J(w, b) là convex function (Hàm lồi).
Ban đầu ta sẽ khởi tạo giá trị của (w, b) - giả sử là (w0, b0). Sử dụng gradient descent, ta sẽ đi dần xuống điểm nhỏ nhất của J(w, b)

Alpha là learning rate - chỉ độ lớn của bước đi của gradient descent.
dJw / dw: Derivative - thể hiện cập nhật đối với w, nó còn là độ dốc của tiếp tuyến với đồ thị của hàm J(w, b)

Ref: https://www.youtube.com/watch?v=uJryes5Vk1o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=10

Derivatives Of Activation Functions

Ref: https://www.youtube.com/watch?v=P7_jFxTtJEo&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=32

Parameters vs Hyperparameters

Hyper-parameters control the parameters.

Ref: https://www.youtube.com/watch?v=VTE2KlfoO3Q&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=42

Computing Neural Network Output

Ref: https://www.youtube.com/watch?v=rMOdrD61IoU&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=27

Train/Dev/Test Sets

It's impossible to guess the best hyper-parameters for the first time you train your model

Use "dev" set to test algorithms on it and see which algorithm works better. With less dataset, ratio likes 60/20/20 is good but with million records dataset the ratio "98/1/1" is better.

The pictures are uploaded from app have worst resolution than pictures from internet.

Ref: https://www.youtube.com/watch?v=1waHlpKiNyY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc

Explanation For Vectorized Implementation

If you stack up input as columns, the output will be stacked up as columns.

Ref: https://www.youtube.com/watch?v=kkWRbIb42Ms&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=29

Numerical Approximations of Gradients

g(z): derivative (đạo hàm)

Ref: https://www.youtube.com/watch?v=y1xoI7mBtOc&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=12

Exponentially Weighted Averages

If beta larger, the weight for theta(t) is smaller so it adapts slowly to the changes of the temperature. With smaller beta we have averaged over small window, we will have a lot of noise, but it adapts quickly to temperatures changes.

Ref: youtube.com/watch?v=lAq96T8FkTw&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=17

Mini Batch Gradient Descent

Batch gradient descent: process the entire training set, all at the same time

Use vectorization to calculate gradient descent for all of the examples in each mini-batches.

Ref: https://www.youtube.com/watch?v=4qJaSmvhxi8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=15

What does this have to do with the brain?

Ref: https://www.youtube.com/watch?v=2zgon7XfN4I&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=43

Why Non-linear Activation Functions

You can use linear activation function in case of REGRESSION (predict the house price)

Ref: https://www.youtube.com/watch?v=NkOv_k7r6no&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=31

Building Blocks of a Deep Neural Network

Ref: https://www.youtube.com/watch?v=B7-iPbddhsw&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=40

Activation Functions

Use tanh function makes it easier for training because, tanh function have mean 0 (you can center your data between 0 instead of 0.5). But with Binary Classification you can use sigmoid function for the output layer (to calculate y^ = {0, 1}).

But tanh and sigmoid functions slope will be small when input is too large or too small.

We also have the ReLU function:

a = max(0, z)

Conclusion:

Binary Classification ===> SIGMOID function for OUTPUT LAYER.
ReLU (Derivative = 0 when z < 0), Tanh ===> Use in HIDDEN LAYER. Use ReLU make your model learn faster than Tanh or Sigmoid.

Ref: https://www.youtube.com/watch?v=Xvg00QnyaIY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=30

Neural Network Overview

[i] !== (i) - [i]: layer thứ i trong neural network, (i) - x(i) ví dụ học thứ (i).

da, dz dùng cho backward calculation.

Ref: https://www.youtube.com/watch?v=fXOsFF95ifk&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=25

Vectorization

Instead of using two for loops, we just using one for loop only.

Ref: https://www.youtube.com/watch?v=qsIrQi0fzbY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=17

Understanding Dropout

keep_prob value should be depended on the size of weight matrix. So we have: "Different keep_probs for different layers".
Dropout is usually used in Computer vision because the input size is big.
Remember that "Dropout" is a regularization technique to prevent over-fitting.

Downside of Dropout is cost function J is not well defined.

Ref: https://www.youtube.com/watch?v=ARq74QuavAo&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=7

Dropout Regularization

With Dropout we go through each of layers and set the probability for eliminating a node in NN. For example: with coin toss, we have probability for keeping and removing node is 0.5/0.5.

d3: dropout vector for layer 3
10% shut off can be understood by keep_prob. With node shut off, a[3] will be reduced so in order to make z[4] not changes, we must divide a[3] to keep_prob. This dropout technique make sure that our expected value a[3] will not be changed. It make test time easier because prevent scaling problem.

At the test time, we don't use dropout.

Ref: https://www.youtube.com/watch?v=D8PJAL-MZv8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=6

Bias/Variance

High bias: simple classifier.
High variance: complicated classifier.

Look at training set error to check "high bias" problem. Look at dev set error to check "high variance" problem. Sometimes bayes error maybe high (blur images, ...)

Line in purple is "high bias" and "high variance", because the "linear part" is under-fitting - high bias, the curve part is over-fitting - high variance.