C1 NN & DL Flashcards
ReLU
Rectified Linear Unit
Activation function, breakthrough since sigmoid which had ~0 gradient whereas ReLU does slow training
Neuron and notation
x -> o -> y
Labeled data: m (x, y)
Neural Network
Several node layers densely connected, figure out relations
NN types and applications
Standard NN - structured data
CNN - unstructured data (image)
RNN - temporal data (audio, language)
Binary classification
(x, y), x in R^nx, y in {0,1} m training examples M = Mtrain (vs Mtest) = {(x1,y1),...(xm, ym)} X = [x1 ... xm] in R^(nx*m) Y = [y1 ... ym] in R^(1*m)
Logistic regression (problem)
Algorithm for binary classification Predict ŷ = P(y=1 | x), x in R^nx Parameters w in R^nx, b in R ŷ = σ(wTx+b) = σ(z) linear function 0 < ŷ < 1 (0.5 at z=0), σ(z)=1/(1+e^-z) Need to learn ŷ(i) ≈ y(i)
Logistic regression (loss and cost functions)
Given ŷ = P(y=1 | x) for {(x1, y1),…,(xm,ym)}, we want ŷ≈y
Loss/error function to minimise ℒ(ŷ,y) tells how good ŷ is, applied to a single training sample.
We want to maximize If y=1: P(y | x) = ŷ, if y=0: P(y | x) = 1 - ŷ
So P(y | x) = ŷ^y(1-ŷ)^(1-y) (log strictly monodically increasing function, maximising x <=> maximising log x)
logP(y | x) = ylogŷ + (1-y)log(1-ŷ)
ℒ(ŷ,y) = -(ylogŷ + (1-y)log(1-ŷ)) (- because we minimise the loss)
Cost function defines the cost of the parameters
P(Y | X) = ∏(i=1 > m) P(y(i) | x(i))
logP(Y | X) = ∑logP(y(i) | x(i))
J(w,b) = 1/m ∑ℒ(ŷ(i),y(i)) (scaling factor/to minimize)
(assuming training examples idd - identically independently distributed, and log∏x = ∑logx)
We look for w, b minimising J(w,b)
Gradient descent
J is convex, hence there is a global optimum Repeat w = w - ɑ ∂J(w,b)/∂w b = b - ɑ ∂J(w,b)/∂b With ɑ learning rate
Derivatives (general, log)
f'(a) = lim(h>0) [f(a+h) - f(a)]/[(a+h)-a)] (how much a small push on x changes y) log'(x) = 1/x
Computational Graph for J(a,b,c) = 3(a + bc)
Forward vs backward propagation.
Useful to optimise variable J (intermediate variables), left to right pass to compute J, right to left to efficiently compute derivative using chain rule.
u = bc, v = a+u, J = 3v
dJ/du = dJ/dv*dJ/du
Gradient Descent for Logistic Regression
xi, wi, b -> z = w.T+b -> ŷ = σ(z) -> ℒ(ŷ,y) ∂L/∂ŷ = -y/ŷ + (1-y)/(1-ŷ) ∂L/∂z = ∂L/∂ŷ.∂ŷ/∂z = ŷ(1-ŷ)∂L/∂ŷ = ŷ - y ∂L/∂wi = ∂L/∂z.∂z/∂wi = xi*(ŷ - y) ∂L/∂b = ∂L/∂z.∂z/∂b = ŷ - y ∂J(w,b)/∂wi = 1/m ∑ℒ(ŷ(i),y(i)) Algorithm: (a=ŷ, avoid explicit for loops!) z = w.TX + b A = σ(z) dz = A-Y dw = 1/m X.dz.T db = 1/m np.sum(dz) w = w - ɑ.dw b = b - ɑ.db (Single iteration)
Numpy (broadcasting, shapes, rank)
broadcasting: (m,n) +-*/ (1,n) -> (m,n)
for the smaller array, each dim = or 1.
[a b c] may be rank 1 shape (3,). Should .reshape(1,3) to make a row vector [a b c]! (otherwise .T won’t work for e.g.)
Neural networks representation
Input layer, hidden layer, output layer (2 layers, output not considered) X = a^[0] a^[1] = column(a1[1] ... a4[1]) a^[2] = ^y Number of nodes nx=n[0], n[1]... Parameters: - W[1], b[1] (n[1],nx), (n[1],1) - W[2], b[2] (n[2],nx), (n[2],1)
Neural network output (one sample x)
For each node i: - zi[1] = wi[1]Tx+bi[1] - ai[1] = σ(zi[1]) (Picture...) Stacked per layer (nodes): - z[1] = w[1]x+b[1], a[1] = σ(z[1]) - z[2] = w[2]a[1]+b[2], a[2] = σ(z[2]) where: - w[i] = col(w1[i]T ... wnx[1]T) - x = a[0] = col(x1 ... xnx) - b[i] = col(b1[i] .. bnx[i]) - z[i] = col(z1[i] ... znx[i])
Neural network output (vectorized)
m samples, a[layer](sample) X = [x(1) ... x(m)] (nx,m) Z[1] = W[1]X+b[1], A[1]=σ(Z[1]) Z[2] = W[2]A[1]+b[2], A[2]=σ(Z[2]) Where: Z[i]=[z[i](1) ... z[i](m)] (n[i],m) A[i]=[a[i](1) ... a[i](m)] (n[i],m)