C1 NN & DL Flashcards

1
Q

ReLU

A

Rectified Linear Unit

Activation function, breakthrough since sigmoid which had ~0 gradient whereas ReLU does slow training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Neuron and notation

A

x -> o -> y

Labeled data: m (x, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Neural Network

A

Several node layers densely connected, figure out relations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

NN types and applications

A

Standard NN - structured data
CNN - unstructured data (image)
RNN - temporal data (audio, language)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Binary classification

A
(x, y), x in R^nx, y in {0,1}
m training examples
M = Mtrain (vs Mtest) = {(x1,y1),...(xm, ym)}
X = [x1 ... xm] in R^(nx*m)
Y = [y1 ... ym] in R^(1*m)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Logistic regression (problem)

A
Algorithm for binary classification
Predict ŷ = P(y=1 | x), x in R^nx
Parameters w in R^nx, b in R
ŷ = σ(wTx+b) = σ(z) linear function 0 < ŷ < 1 (0.5 at z=0), σ(z)=1/(1+e^-z)
Need to learn ŷ(i) ≈ y(i)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Logistic regression (loss and cost functions)

A

Given ŷ = P(y=1 | x) for {(x1, y1),…,(xm,ym)}, we want ŷ≈y
Loss/error function to minimise ℒ(ŷ,y) tells how good ŷ is, applied to a single training sample.
We want to maximize If y=1: P(y | x) = ŷ, if y=0: P(y | x) = 1 - ŷ
So P(y | x) = ŷ^y(1-ŷ)^(1-y) (log strictly monodically increasing function, maximising x <=> maximising log x)
logP(y | x) = ylogŷ + (1-y)log(1-ŷ)
ℒ(ŷ,y) = -(ylogŷ + (1-y)log(1-ŷ)) (- because we minimise the loss)
Cost function defines the cost of the parameters
P(Y | X) = ∏(i=1 > m) P(y(i) | x(i))
logP(Y | X) = ∑logP(y(i) | x(i))
J(w,b) = 1/m ∑ℒ(ŷ(i),y(i)) (scaling factor/to minimize)
(assuming training examples idd - identically independently distributed, and log∏x = ∑logx)
We look for w, b minimising J(w,b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Gradient descent

A
J is convex, hence there is a global optimum
Repeat
w = w - ɑ ∂J(w,b)/∂w
b = b - ɑ ∂J(w,b)/∂b
With ɑ learning rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Derivatives (general, log)

A
f'(a) = lim(h>0) [f(a+h) - f(a)]/[(a+h)-a)] (how much a small push on x changes y)
log'(x) = 1/x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Computational Graph for J(a,b,c) = 3(a + bc)

A

Forward vs backward propagation.
Useful to optimise variable J (intermediate variables), left to right pass to compute J, right to left to efficiently compute derivative using chain rule.
u = bc, v = a+u, J = 3v
dJ/du = dJ/dv*dJ/du

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gradient Descent for Logistic Regression

A
xi, wi, b -> z = w.T+b -> ŷ = σ(z) -> ℒ(ŷ,y)
∂L/∂ŷ = -y/ŷ + (1-y)/(1-ŷ)
∂L/∂z = ∂L/∂ŷ.∂ŷ/∂z = ŷ(1-ŷ)∂L/∂ŷ = ŷ - y
∂L/∂wi = ∂L/∂z.∂z/∂wi = xi*(ŷ - y)
∂L/∂b = ∂L/∂z.∂z/∂b = ŷ - y
∂J(w,b)/∂wi = 1/m ∑ℒ(ŷ(i),y(i))
Algorithm: (a=ŷ, avoid explicit for loops!)
 z = w.TX + b
 A = σ(z)
 dz = A-Y
 dw = 1/m X.dz.T
 db = 1/m np.sum(dz)
 w = w - ɑ.dw
 b = b - ɑ.db
(Single iteration)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Numpy (broadcasting, shapes, rank)

A

broadcasting: (m,n) +-*/ (1,n) -> (m,n)
for the smaller array, each dim = or 1.
[a b c] may be rank 1 shape (3,). Should .reshape(1,3) to make a row vector [a b c]! (otherwise .T won’t work for e.g.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Neural networks representation

A
Input layer, hidden layer, output layer (2 layers, output not considered) 
X = a^[0]
a^[1] = column(a1[1] ... a4[1])
a^[2] = ^y
Number of nodes nx=n[0], n[1]...
Parameters:
- W[1], b[1] (n[1],nx), (n[1],1)
- W[2], b[2] (n[2],nx), (n[2],1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Neural network output (one sample x)

A
For each node i:
 - zi[1] = wi[1]Tx+bi[1]
 - ai[1] = σ(zi[1])
(Picture...)
Stacked per layer (nodes):
 - z[1] = w[1]x+b[1], a[1] = σ(z[1])
 - z[2] = w[2]a[1]+b[2], a[2] = σ(z[2])
where:
 - w[i] = col(w1[i]T ... wnx[1]T)
 - x = a[0] = col(x1 ... xnx)
 - b[i] = col(b1[i] .. bnx[i])
 - z[i] = col(z1[i] ... znx[i])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Neural network output (vectorized)

A
m samples, a[layer](sample)
X = [x(1) ... x(m)] (nx,m)
Z[1] = W[1]X+b[1], A[1]=σ(Z[1])
Z[2] = W[2]A[1]+b[2], A[2]=σ(Z[2])
Where:
 Z[i]=[z[i](1) ... z[i](m)] (n[i],m)
 A[i]=[a[i](1) ... a[i](m)] (n[i],m)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Activation functions

A

Sigmoid (0/1, 0-0.5), a=1/(1+e^-z)
Hyperbolic tangent tanh(z)=(e^z-e^-z)/(e^z+e^-z) shifted version (-1/1, 0-0)
ReLU a=max(0,z), not diff at 0
Leaky ReLU a=max(0.01z, z)

17
Q

Why non-linear activation fct

A

Linear useless because composition is linear (no new function discovered). Can be used at output layer to predict R (e.g. price).

18
Q

Derivatives activation fct

A

Sigmoid: a(1-a)
Tanh: 1-a^2
ReLU: 0 if z<0, 1 otherwise (undef 0)
Leaky ReLU: 0.01 if z<0, 1 otherwise (undef 0)

19
Q

Backpropagation neural network

A

dZ[2]=A[2]-Y, dW[2]=1/m.dZ[2]A[1]T, db[2]=1/mΣdZ[2]
dZ[1]=W[2]TdZ[2]*g[1]’(Z[1]), dW[1]=1/m.dZ[1]XT, db[1]=1/mΣdZ[1]

Numpy sum: keepdims=True
Use computation graph with L(a,y)

20
Q

Gradient descent neural network

A

Cost function J(w[1],b[1],w[2],b[2]) = 1/mΣL(^y,y)

Compute predictions ^y(i)
dparams…
Update param=param-αparam

21
Q

Random initialization

A

For NN 0 (w) does not work, hidden units symmetric will calculate the same function. Usually we use small random values to avoid gradient ~0 with sig/tanh
np.random.randn((2,2))*0.01