I'm Anh, from 🇻🇳 and be a 👨💻, like to build a system by DDD.
Currently have interested in System Design 👷 🏗️ and Database.
I'm Anh, from 🇻🇳 and be a 👨💻, like to build a system by DDD.
Currently have interested in System Design 👷 🏗️ and Database.
W[l], b[l] is parameters for layer (l).
A[0]: training examples are stacked in column by column way. The same with Z[2] (z[2] (1), z[2] (2), ....)
Ref: https://www.youtube.com/watch?v=a8i2eJin0lY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=37
With python broadcasting, (n[1], 1) -> (n[1], m)
- m: number of the training examples.
Ref: https://www.youtube.com/watch?v=yslMo3hSbqE&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=38
keep_prob value should be depended on the size of weight matrix. So we have: "Different keep_probs for different layers".
Dropout is usually used in Computer vision because the input size is big.
Remember that "Dropout" is a regularization technique to prevent over-fitting.
Downside of Dropout is cost function J is not well defined.
Ref: https://www.youtube.com/watch?v=ARq74QuavAo&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=7
Stack the entire examples horizontally. When we add a real number with a vector, python will expand the real number into a vector automatically (Broadcasting).
Ref: https://www.youtube.com/watch?v=okpqeEUdEkY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=19
n[0] input units, n[1] hidden units, n[2] output units.
keepdims in np.sum keeps dimension in (n, m)
convention NOT (n, 1)
.
Ref: https://www.youtube.com/watch?v=7bLEWDZng_M&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=33
If we initialize the parameter with 0, the hidden units will make the same influence to the output layer. The hidden units will be symmetric.
Why we choose 0.01
, instead 100
or 1000
. Because z = Wx + b
, if W is large, z will be large. In the case we use tanh
function as activation function the output a
will be the point at flat part
of tanh
function so gradient descent
or learning
will be slow.
Ref: https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=35
Sử dụng Gradient Descent để tìm ra cặp (w, b) phù hợp cho Logistic Regression sao cho giá trị Cost function (J(w, b)) nhỏ nhất.
Hàm J(w, b) là convex function (Hàm lồi).
Ban đầu ta sẽ khởi tạo giá trị của (w, b) - giả sử là (w0, b0). Sử dụng gradient descent, ta sẽ đi dần xuống điểm nhỏ nhất của J(w, b)
Alpha là learning rate - chỉ độ lớn của bước đi của gradient descent.
dJw / dw: Derivative - thể hiện cập nhật đối với w, nó còn là độ dốc của tiếp tuyến với đồ thị của hàm J(w, b)
Ref: https://www.youtube.com/watch?v=uJryes5Vk1o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=10
wTx + b
maybe larger than 1 or smaller than 0, so we use sigmoid
function to regularization the output (must be between [0, 1]
)
Ref: https://www.youtube.com/watch?v=hjrYrynGWGA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=8
Use tanh
function makes it easier for training because, tanh
function have mean 0 (you can center your data between 0 instead of 0.5). But with Binary Classification
you can use sigmoid
function for the output layer (to calculate y^ = {0, 1}).
But tanh
and sigmoid
functions slope will be small when input is too large or too small.
We also have the ReLU
function:
a = max(0, z)
Conclusion:
Ref: https://www.youtube.com/watch?v=Xvg00QnyaIY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=30
g(z): derivative (đạo hàm)
Ref: https://www.youtube.com/watch?v=y1xoI7mBtOc&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=12
foo, dfoo (w, dw) always have the same dimensional.
Ref: https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=34
n: number of input features. With ReLU activation function, you use 2/n as variance(n) <=> var(n). Init W not too much larger than 1 or not too much smaller than 1 so it helps us to prevent vanishing explode too quickly
Ref: https://www.youtube.com/watch?v=s2coXdufOzE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=11
if dθapprox is far from dθ. Check the value of the i that dθ[i]approx is far from dθ[i] and for example you can find that db[i] computing is error.
Ref: https://www.youtube.com/watch?v=4Ct3Yujl1dk&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=14
Ex: In face recognition. First layer is for feature detecting, edge detecting by looking at very small pixel in the picture. In second layer we group the edge together to get the parts in the face (nose, chin, eye, ...). In the third layer by grouping the parts together we will get the face completely.
Ref: https://www.youtube.com/watch?v=5dWp1mw_XNk&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=39
Lambda: Regularization parameter.
Ref: https://www.youtube.com/watch?v=6g0t3Phly2M&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=4
If beta larger, the weight for theta(t) is smaller so it adapts slowly to the changes of the temperature. With smaller beta we have averaged over small window, we will have a lot of noise, but it adapts quickly to temperatures changes.
Ref: youtube.com/watch?v=lAq96T8FkTw&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=17
da[L]: the derivative of Cost function.
Ref: https://www.youtube.com/watch?v=qzPQ8cEsVK8&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=41
With Dropout we go through each of layers and set the probability for eliminating a node in NN. For example: with coin toss, we have probability for keeping and removing node is 0.5/0.5.
d3: dropout vector for layer 3
10% shut off can be understood by keep_prob. With node shut off, a[3] will be reduced so in order to make z[4] not changes, we must divide a[3] to keep_prob. This dropout technique make sure that our expected value a[3] will not be changed. It make test time easier because prevent scaling problem.
At the test time, we don't use dropout.
Ref: https://www.youtube.com/watch?v=D8PJAL-MZv8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=6
If you stack up input as columns, the output will be stacked up as columns.
Ref: https://www.youtube.com/watch?v=kkWRbIb42Ms&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=29
Computation graph organizes computation from left to right (the blue arrows).
Ref: https://www.youtube.com/watch?v=hCP1vGoCdYU&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=13
If x1: 1...1000, x2: 0...1, W1 and W2 will have very different values. After normalizing, we can use small learning_rate, make it faster to optimize.
Ref: https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=9
Don't count input layer in Neural Network
Ref: https://www.youtube.com/watch?v=CcRkHl75Z-Y&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=26
Batch gradient descent: process the entire training set, all at the same time
Use vectorization to calculate gradient descent for all of the examples in each mini-batches.
Ref: https://www.youtube.com/watch?v=4qJaSmvhxi8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=15
[i] !== (i) - [i]: layer thứ i trong neural network, (i) - x(i) ví dụ học thứ (i).
da, dz dùng cho backward calculation.
Ref: https://www.youtube.com/watch?v=fXOsFF95ifk&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=25
Sum of array elements in a given axis
np.sum([[0, 1], [0, 5]], axis=0) # axis = 0: vertically --> [0, 6]
np.sum([[0, 1], [0, 5]], axis=1) # axis = 1: horizontally --> [1, 5]
np.sum([[0, 1], [0, 5]]) # Not specify axis: sum of all elements in array. --> 6
Image Ref: https://qiita.com/Phoeboooo/items/b464b7df3c64a33caf94
Set lambda is too large. w[l] ~ 0, our NN will be closer to logistic regression (high variance --> high bias)
With tanh function if Z small enough our graph is so linear.
When debug gradient descent: plot the graph (cost function J and iterations)
Ref: https://www.youtube.com/watch?v=NyG-7nRpsW8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=5
[x(i), y(i)]: cặp ví dụ, nhãn thứ i.
Giá trị của y^ thuộc [0, 1].
Loss function dùng cho 1 ví dụ duy nhất. Cost function dùng để đánh giá trên toàn bộ tập ví dụ. Chúng ta sẽ cố gắng tìm cặp (w, b) để minimize cost function.
Ref: https://www.youtube.com/watch?v=SHEPb1JHw5o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=9
High bias: simple classifier.
High variance: complicated classifier.
Look at training set error to check "high bias" problem. Look at dev set error to check "high variance" problem. Sometimes bayes error maybe high (blur images, ...)
Line in purple is "high bias" and "high variance", because the "linear part" is under-fitting - high bias, the curve part is over-fitting - high variance.
Ref: https://www.youtube.com/watch?v=SjQyLhQIXSM&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=2
You can use linear activation function in case of REGRESSION (predict the house price)
Ref: https://www.youtube.com/watch?v=NkOv_k7r6no&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=31
When training for a long time, w will closer to 0 if initialize by np.random.seed(0).
Orthorgonalization: one task at one time (Optimize w,b to minimize J or Not over-fitting).
Ref: https://www.youtube.com/watch?v=BOCLq2gpcGU&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=8
Instead of using two for loops, we just using one for loop only.
Ref: https://www.youtube.com/watch?v=qsIrQi0fzbY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=17
The reason that we have some noise in our graph for cost J in mini-batch gradient descent is for example with X{1} and Y{1} it's easy for mini-batch, but with X{2}, Y{2} it's harder because some missing label in Y{2} makes cost higher.
Stochastic gradient descent is lately converge.
Mini-batch size from 64 -> 512 (power of 2) is more common.
Ref: https://www.youtube.com/watch?v=-_4Zi8fCZO4&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=16
axis = 0 means you want to calculate vertically, axis = 1 means horizontally.
Ref: https://www.youtube.com/watch?v=tKcLaGdvabM&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=21
We have the tools those help us reduce "bias" without hurting "variance" and opposite.
Ref: https://www.youtube.com/watch?v=C1N_PDHuJ6Q&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=3
We modify the parameter w1, w2, b to reduce the L(a, y): Lost value.
Ref: https://www.youtube.com/watch?v=z_xiwjEdAC4&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=15
It's impossible to guess the best hyper-parameters for the first time you train your model
Use "dev" set to test algorithms on it and see which algorithm works better. With less dataset, ratio likes 60/20/20 is good but with million records dataset the ratio "98/1/1" is better.
The pictures are uploaded from app have worst resolution than pictures from internet.
Ref: https://www.youtube.com/watch?v=1waHlpKiNyY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc
Hyper-parameters control the parameters.
Ref: https://www.youtube.com/watch?v=VTE2KlfoO3Q&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=42
In deep NN, L is large ==> 1.5^L will be large too. If we change to 0.5 ==> L large -> 0.5^L will be closer to 0.
I: identity matrix (ma trận đơn vị). Make gradient long time to learn anything ==> training is so difficult.
Ref: https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=10
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.