deeplearning.ai's Issues
Parameters vs Hyperparameters
Hyper-parameters control the parameters.
Ref: https://www.youtube.com/watch?v=VTE2KlfoO3Q&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=42
Understanding Mini-Batch Gradient Descent
The reason that we have some noise in our graph for cost J in mini-batch gradient descent is for example with X{1} and Y{1} it's easy for mini-batch, but with X{2}, Y{2} it's harder because some missing label in Y{2} makes cost higher.
Stochastic gradient descent is lately converge.
Mini-batch size from 64 -> 512 (power of 2) is more common.
Ref: https://www.youtube.com/watch?v=-_4Zi8fCZO4&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=16
Why Deep Representations?
Ex: In face recognition. First layer is for feature detecting, edge detecting by looking at very small pixel in the picture. In second layer we group the edge together to get the parts in the face (nose, chin, eye, ...). In the third layer by grouping the parts together we will get the face completely.
Ref: https://www.youtube.com/watch?v=5dWp1mw_XNk&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=39
Gradient Descent For Neural Networks
n[0] input units, n[1] hidden units, n[2] output units.
keepdims in np.sum keeps dimension in (n, m)
convention NOT (n, 1)
.
Ref: https://www.youtube.com/watch?v=7bLEWDZng_M&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=33
Why Regularization Reduces Overfitting
Set lambda is too large. w[l] ~ 0, our NN will be closer to logistic regression (high variance --> high bias)
With tanh function if Z small enough our graph is so linear.
When debug gradient descent: plot the graph (cost function J and iterations)
Ref: https://www.youtube.com/watch?v=NyG-7nRpsW8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=5
Deep L-Layer Neural Network
Vectorizing Logistic Regression's Gradient Computation
Gradient Descent on m Examples
Derivatives Of Activation Functions
Backpropagation Intuition
foo, dfoo (w, dw) always have the same dimensional.
Ref: https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=34
Derivatives with computation graph
Regularization
Lambda: Regularization parameter.
Ref: https://www.youtube.com/watch?v=6g0t3Phly2M&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=4
Logistic Regression
wTx + b
maybe larger than 1 or smaller than 0, so we use sigmoid
function to regularization the output (must be between [0, 1]
)
Ref: https://www.youtube.com/watch?v=hjrYrynGWGA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=8
Other Regularization Methods
When training for a long time, w will closer to 0 if initialize by np.random.seed(0).
Orthorgonalization: one task at one time (Optimize w,b to minimize J or Not over-fitting).
Ref: https://www.youtube.com/watch?v=BOCLq2gpcGU&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=8
Logistic Regression Cost Function
[x(i), y(i)]: cặp ví dụ, nhãn thứ i.
Giá trị của y^ thuộc [0, 1].
Loss function dùng cho 1 ví dụ duy nhất. Cost function dùng để đánh giá trên toàn bộ tập ví dụ. Chúng ta sẽ cố gắng tìm cặp (w, b) để minimize cost function.
Ref: https://www.youtube.com/watch?v=SHEPb1JHw5o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=9
Getting Matrix Dimensions Right
With python broadcasting, (n[1], 1) -> (n[1], m)
- m: number of the training examples.
Ref: https://www.youtube.com/watch?v=yslMo3hSbqE&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=38
Gradient Descent
Sử dụng Gradient Descent để tìm ra cặp (w, b) phù hợp cho Logistic Regression sao cho giá trị Cost function (J(w, b)) nhỏ nhất.
Hàm J(w, b) là convex function (Hàm lồi).
Ban đầu ta sẽ khởi tạo giá trị của (w, b) - giả sử là (w0, b0). Sử dụng gradient descent, ta sẽ đi dần xuống điểm nhỏ nhất của J(w, b)
Alpha là learning rate - chỉ độ lớn của bước đi của gradient descent.
dJw / dw: Derivative - thể hiện cập nhật đối với w, nó còn là độ dốc của tiếp tuyến với đồ thị của hàm J(w, b)
Ref: https://www.youtube.com/watch?v=uJryes5Vk1o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=10
Gradient Checking Implementation Notes
if dθapprox is far from dθ. Check the value of the i that dθ[i]approx is far from dθ[i] and for example you can find that db[i] computing is error.
Ref: https://www.youtube.com/watch?v=4Ct3Yujl1dk&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=14
Building Blocks of a Deep Neural Network
Forward Propagation in a Deep Network
W[l], b[l] is parameters for layer (l).
A[0]: training examples are stacked in column by column way. The same with Z[2] (z[2] (1), z[2] (2), ....)
Ref: https://www.youtube.com/watch?v=a8i2eJin0lY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=37
Random Initialization
If we initialize the parameter with 0, the hidden units will make the same influence to the output layer. The hidden units will be symmetric.
Why we choose 0.01
, instead 100
or 1000
. Because z = Wx + b
, if W is large, z will be large. In the case we use tanh
function as activation function the output a
will be the point at flat part
of tanh
function so gradient descent
or learning
will be slow.
Ref: https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=35
Numerical Approximations of Gradients
g(z): derivative (đạo hàm)
Ref: https://www.youtube.com/watch?v=y1xoI7mBtOc&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=12
Weight Initialization in a Deep Network
n: number of input features. With ReLU activation function, you use 2/n as variance(n) <=> var(n). Init W not too much larger than 1 or not too much smaller than 1 so it helps us to prevent vanishing explode too quickly
Ref: https://www.youtube.com/watch?v=s2coXdufOzE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=11
Bias/Variance
High bias: simple classifier.
High variance: complicated classifier.
Look at training set error to check "high bias" problem. Look at dev set error to check "high variance" problem. Sometimes bayes error maybe high (blur images, ...)
Line in purple is "high bias" and "high variance", because the "linear part" is under-fitting - high bias, the curve part is over-fitting - high variance.
Ref: https://www.youtube.com/watch?v=SjQyLhQIXSM&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=2
Vectorization
Instead of using two for loops, we just using one for loop only.
Ref: https://www.youtube.com/watch?v=qsIrQi0fzbY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=17
Basic Recipe for Machine Learning
We have the tools those help us reduce "bias" without hurting "variance" and opposite.
Ref: https://www.youtube.com/watch?v=C1N_PDHuJ6Q&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=3
Train/Dev/Test Sets
It's impossible to guess the best hyper-parameters for the first time you train your model
Use "dev" set to test algorithms on it and see which algorithm works better. With less dataset, ratio likes 60/20/20 is good but with million records dataset the ratio "98/1/1" is better.
The pictures are uploaded from app have worst resolution than pictures from internet.
Ref: https://www.youtube.com/watch?v=1waHlpKiNyY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc
Why Non-linear Activation Functions
You can use linear activation function in case of REGRESSION (predict the house price)
Ref: https://www.youtube.com/watch?v=NkOv_k7r6no&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=31
Forward and Backward Propagation
da[L]: the derivative of Cost function.
Ref: https://www.youtube.com/watch?v=qzPQ8cEsVK8&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=41
A primer in machine learning
Activation Functions
Use tanh
function makes it easier for training because, tanh
function have mean 0 (you can center your data between 0 instead of 0.5). But with Binary Classification
you can use sigmoid
function for the output layer (to calculate y^ = {0, 1}).
But tanh
and sigmoid
functions slope will be small when input is too large or too small.
We also have the ReLU
function:
a = max(0, z)
Conclusion:
- Binary Classification ===> SIGMOID function for OUTPUT LAYER.
- ReLU (Derivative = 0 when z < 0), Tanh ===> Use in HIDDEN LAYER. Use ReLU make your model learn faster than Tanh or Sigmoid.
Ref: https://www.youtube.com/watch?v=Xvg00QnyaIY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=30
Broadcasting in Python
axis = 0 means you want to calculate vertically, axis = 1 means horizontally.
Ref: https://www.youtube.com/watch?v=tKcLaGdvabM&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=21
Exponentially Weighted Averages
If beta larger, the weight for theta(t) is smaller so it adapts slowly to the changes of the temperature. With smaller beta we have averaged over small window, we will have a lot of noise, but it adapts quickly to temperatures changes.
Ref: youtube.com/watch?v=lAq96T8FkTw&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=17
Neural Network Representations
Don't count input layer in Neural Network
Ref: https://www.youtube.com/watch?v=CcRkHl75Z-Y&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=26
Normalizing Inputs
- Subtract means
- Normalize variance
If x1: 1...1000, x2: 0...1, W1 and W2 will have very different values. After normalizing, we can use small learning_rate, make it faster to optimize.
Ref: https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=9
Computing Neural Network Output
Computation Graph
Computation graph organizes computation from left to right (the blue arrows).
Ref: https://www.youtube.com/watch?v=hCP1vGoCdYU&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=13
Vanishing/Exploding Gradients
In deep NN, L is large ==> 1.5^L will be large too. If we change to 0.5 ==> L large -> 0.5^L will be closer to 0.
I: identity matrix (ma trận đơn vị). Make gradient long time to learn anything ==> training is so difficult.
Ref: https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=10
What does this have to do with the brain?
Logistic Regression Gradient Descent
We modify the parameter w1, w2, b to reduce the L(a, y): Lost value.
Ref: https://www.youtube.com/watch?v=z_xiwjEdAC4&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=15
Neural Network Overview
[i] !== (i) - [i]: layer thứ i trong neural network, (i) - x(i) ví dụ học thứ (i).
da, dz dùng cho backward calculation.
Ref: https://www.youtube.com/watch?v=fXOsFF95ifk&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=25
Dropout Regularization
With Dropout we go through each of layers and set the probability for eliminating a node in NN. For example: with coin toss, we have probability for keeping and removing node is 0.5/0.5.
d3: dropout vector for layer 3
10% shut off can be understood by keep_prob. With node shut off, a[3] will be reduced so in order to make z[4] not changes, we must divide a[3] to keep_prob. This dropout technique make sure that our expected value a[3] will not be changed. It make test time easier because prevent scaling problem.
At the test time, we don't use dropout.
Ref: https://www.youtube.com/watch?v=D8PJAL-MZv8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=6
Vectorizing Across Multiple Examples
np.sum
Sum of array elements in a given axis
np.sum([[0, 1], [0, 5]], axis=0) # axis = 0: vertically --> [0, 6]
np.sum([[0, 1], [0, 5]], axis=1) # axis = 1: horizontally --> [1, 5]
np.sum([[0, 1], [0, 5]]) # Not specify axis: sum of all elements in array. --> 6
Image Ref: https://qiita.com/Phoeboooo/items/b464b7df3c64a33caf94
Vectorizing Logistic Regression
Stack the entire examples horizontally. When we add a real number with a vector, python will expand the real number into a vector automatically (Broadcasting).
Ref: https://www.youtube.com/watch?v=okpqeEUdEkY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=19
Gradient Checking
Derivatives
Explanation For Vectorized Implementation
If you stack up input as columns, the output will be stacked up as columns.
Ref: https://www.youtube.com/watch?v=kkWRbIb42Ms&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=29
Understanding Dropout
keep_prob value should be depended on the size of weight matrix. So we have: "Different keep_probs for different layers".
Dropout is usually used in Computer vision because the input size is big.
Remember that "Dropout" is a regularization technique to prevent over-fitting.
Downside of Dropout is cost function J is not well defined.
Ref: https://www.youtube.com/watch?v=ARq74QuavAo&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=7
Mini Batch Gradient Descent
Batch gradient descent: process the entire training set, all at the same time
Use vectorization to calculate gradient descent for all of the examples in each mini-batches.
Ref: https://www.youtube.com/watch?v=4qJaSmvhxi8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.