C1 NN & DL Flashcards Preview

DL.AI S1 > C1 NN & DL > Flashcards

Flashcards in C1 NN & DL Deck (21)
Loading flashcards...
1

ReLU

Rectified Linear Unit
Activation function, breakthrough since sigmoid which had ~0 gradient whereas ReLU does slow training

2

Neuron and notation

x -> o -> y
Labeled data: m (x, y)

3

Neural Network

Several node layers densely connected, figure out relations

4

NN types and applications

Standard NN - structured data
CNN - unstructured data (image)
RNN - temporal data (audio, language)

5

Binary classification

(x, y), x in R^nx, y in {0,1}
m training examples
M = Mtrain (vs Mtest) = {(x1,y1),...(xm, ym)}
X = [x1 ... xm] in R^(nx*m)
Y = [y1 ... ym] in R^(1*m)

6

Logistic regression (problem)

Algorithm for binary classification
Predict ŷ = P(y=1 | x), x in R^nx
Parameters w in R^nx, b in R
ŷ = σ(wTx+b) = σ(z) linear function 0 < ŷ < 1 (0.5 at z=0), σ(z)=1/(1+e^-z)
Need to learn ŷ(i) ≈ y(i)

7

Logistic regression (loss and cost functions)

Given ŷ = P(y=1 | x) for {(x1, y1),...,(xm,ym)}, we want ŷ≈y
Loss/error function to minimise ℒ(ŷ,y) tells how good ŷ is, applied to a single training sample.
We want to maximize If y=1: P(y | x) = ŷ, if y=0: P(y | x) = 1 - ŷ
So P(y | x) = ŷ^y(1-ŷ)^(1-y) (log strictly monodically increasing function, maximising x <=> maximising log x)
logP(y | x) = ylogŷ + (1-y)log(1-ŷ)
ℒ(ŷ,y) = -(ylogŷ + (1-y)log(1-ŷ)) (- because we minimise the loss)
Cost function defines the cost of the parameters
P(Y | X) = ∏(i=1 > m) P(y(i) | x(i))
logP(Y | X) = ∑logP(y(i) | x(i))
J(w,b) = 1/m ∑ℒ(ŷ(i),y(i)) (scaling factor/to minimize)
(assuming training examples idd - identically independently distributed, and log∏x = ∑logx)
We look for w, b minimising J(w,b)

8

Gradient descent

J is convex, hence there is a global optimum
Repeat
w = w - ɑ ∂J(w,b)/∂w
b = b - ɑ ∂J(w,b)/∂b
With ɑ learning rate

9

Derivatives (general, log)

f'(a) = lim(h>0) [f(a+h) - f(a)]/[(a+h)-a)] (how much a small push on x changes y)
log'(x) = 1/x

10

Computational Graph for J(a,b,c) = 3(a + bc)

Forward vs backward propagation.
Useful to optimise variable J (intermediate variables), left to right pass to compute J, right to left to efficiently compute derivative using chain rule.
u = bc, v = a+u, J = 3v
dJ/du = dJ/dv*dJ/du

11

Gradient Descent for Logistic Regression

xi, wi, b -> z = w.T+b -> ŷ = σ(z) -> ℒ(ŷ,y)
∂L/∂ŷ = -y/ŷ + (1-y)/(1-ŷ)
∂L/∂z = ∂L/∂ŷ.∂ŷ/∂z = ŷ(1-ŷ)∂L/∂ŷ = ŷ - y
∂L/∂wi = ∂L/∂z.∂z/∂wi = xi*(ŷ - y)
∂L/∂b = ∂L/∂z.∂z/∂b = ŷ - y
∂J(w,b)/∂wi = 1/m ∑ℒ(ŷ(i),y(i))
Algorithm: (a=ŷ, avoid explicit for loops!)
z = w.TX + b
A = σ(z)
dz = A-Y
dw = 1/m X.dz.T
db = 1/m np.sum(dz)
w = w - ɑ.dw
b = b - ɑ.db
(Single iteration)

12

Numpy (broadcasting, shapes, rank)

broadcasting: (m,n) +-*/ (1,n) -> (m,n)
for the smaller array, each dim = or 1.
[a b c] may be rank 1 shape (3,). Should .reshape(1,3) to make a row vector [a b c]! (otherwise .T won't work for e.g.)

13

Neural networks representation

Input layer, hidden layer, output layer (2 layers, output not considered)
X = a^[0]
a^[1] = column(a1[1] ... a4[1])
a^[2] = ^y
Number of nodes nx=n[0], n[1]...
Parameters:
- W[1], b[1] (n[1],nx), (n[1],1)
- W[2], b[2] (n[2],nx), (n[2],1)

14

Neural network output (one sample x)

For each node i:
- zi[1] = wi[1]Tx+bi[1]
- ai[1] = σ(zi[1])
(Picture...)
Stacked per layer (nodes):
- z[1] = w[1]x+b[1], a[1] = σ(z[1])
- z[2] = w[2]a[1]+b[2], a[2] = σ(z[2])
where:
- w[i] = col(w1[i]T ... wnx[1]T)
- x = a[0] = col(x1 ... xnx)
- b[i] = col(b1[i] .. bnx[i])
- z[i] = col(z1[i] ... znx[i])

15

Neural network output (vectorized)

m samples, a[layer](sample)
X = [x(1) ... x(m)] (nx,m)
Z[1] = W[1]X+b[1], A[1]=σ(Z[1])
Z[2] = W[2]A[1]+b[2], A[2]=σ(Z[2])
Where:
Z[i]=[z[i](1) ... z[i](m)] (n[i],m)
A[i]=[a[i](1) ... a[i](m)] (n[i],m)

16

Activation functions

Sigmoid (0/1, 0-0.5), a=1/(1+e^-z)
Hyperbolic tangent tanh(z)=(e^z-e^-z)/(e^z+e^-z) shifted version (-1/1, 0-0)
ReLU a=max(0,z), not diff at 0
Leaky ReLU a=max(0.01z, z)

17

Why non-linear activation fct

Linear useless because composition is linear (no new function discovered). Can be used at output layer to predict R (e.g. price).

18

Derivatives activation fct

Sigmoid: a(1-a)
Tanh: 1-a^2
ReLU: 0 if z<0, 1 otherwise (undef 0)
Leaky ReLU: 0.01 if z<0, 1 otherwise (undef 0)

19

Backpropagation neural network

dZ[2]=A[2]-Y, dW[2]=1/m.dZ[2]A[1]T, db[2]=1/mΣdZ[2]
dZ[1]=W[2]TdZ[2]*g[1]’(Z[1]), dW[1]=1/m.dZ[1]XT, db[1]=1/mΣdZ[1]

Numpy sum: keepdims=True
Use computation graph with L(a,y)

20

Gradient descent neural network

Cost function J(w[1],b[1],w[2],b[2]) = 1/mΣL(^y,y)

Compute predictions ^y(i)
dparams...
Update param=param-αparam

21

Random initialization

For NN 0 (w) does not work, hidden units symmetric will calculate the same function. Usually we use small random values to avoid gradient ~0 with sig/tanh
np.random.randn((2,2))*0.01