DL Fundamentals Flashcards Preview

AML > DL Fundamentals > Flashcards

Flashcards in DL Fundamentals Deck (23)
Loading flashcards...
1

Representational learning

Engineering representations is hard - requires technical and domain expertise

- Representational learning is a set of method that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification (LeCun et al., 2014)

2

List some activation functions

g(z) = 1 / (1 + e^(-z))

3

Loss functions

1. Squared error
2. Log loss

4

Squared error loss function

1/2 * (M(d) - t)^2

5

Log loss function

-((t * log(M(d)) + (1-t)*(log(1-M(d)))

6

Difference between loss and cost function

- Loss function = measure of the prediction error on a single training instance
- Cost function = measure of the average prediction error across a set of training instances
- Cost functions allows us to add in regularization

7

Gradient Descent Algorithm

1. Choose random weights
2. Until convergence
- Set all gradients to zero
- For each training instance
- Calculate the model output
- Calculate loss
- Update gradient sum for for each weight and bias
- Update weights and bias using the weight update rule

8

Backpropagation Algorithm

- Werbos (1974)
- Not widely used until 1986

1. Initialize the weights to random small values (fanin)
2. FF phase - Feed input data through the network from the inputs to the outputs
3. Update the training error for the network based (based on target values for all output nodes)
4. Error propagation phase - Feed error values back through the network, adjusting the weights along the way
5. Repeat from 2 until the error values are sufficiently small or some other stopping condition

9

Forward pass algo

Require: L, network depth
Require: W[i], i is an element of {1...L}, weight matrices for each layer
Require: b[i], i is an element of {1...L}, bias terms for each . layer
Require: d, input descriptive features
Require: t, target features

a[0] = d
for I = 1 to L:
z[I] = W[I]*a[I-1] + b
a[I] = g[I]z[I]

M(d) = a[L]
Calculate L(M(d), t)

10

Backward Propagation algo

Require: A forward pass of network
Require: L, network depth
Require: W[i], i is an element of {1...L}, weight matrices for each layer
Require: b[i], i is an element of {1...L}, bias terms for each layer
Require: t, target features

Calculate da[L] #derivate of loss function
for I=L to 1:
dz[I] = da[I] * g[I]'(z[I])
dW[I] = dz[I] * a[I-1]T
db[I] = dz[I]
da[I-1] = W[I]T*dz[I]

11

Stochastic/Online GD

-Choose random weights
-Until convergence
- Shuffle all training instances
- For each training instance:
* Perform f/w pass
* Perform b/w pass
* Update weights and biases using update rule

12

Stochastic GD Update Rule

W[i] = W[i] - αdW[i]
b[i] = b[i] - αdb[i]

13

GD Batch

-Choose random weights
-Until convergence
- Set all gradient sums to 0
- For each training instance:
* Perform f/w pass
* Perform b/w pass
* Update gradient sum for each weight and bias term
- Update weights and biases using update rule

14

Batch GD Update rule

W[i] = W[i] - α(1/m)Σ(from j=0 to M) dW[i]j
b[i] = b[i] - α(1/m)Σ(from j=0 to M)db[i]j

15

GD Mini-batch

-Choose random weights
-Until convergence
- Divide the training set into mini-batches of size s
- For each mini-batch D(mb)
* Set all gradient sums = 0
* For each training instance in D(mb)
- Perform f/w pass
- Perform b/w pass
- Update gradient sum for each weight and bias t
* Update weight and bias term using update rule

16

GD Mini-batch update rule

W[i] = W[i] - α(1/s)Σ(from j=0 to S) dW[i]j
b[i] = b[i] - α(1/s)Σ(from j=0 to S)db[i]j

17

Stochastic GD - Advantages

- Easy to implement
- Fast learning

18

Stochastic GD - Disadvantages

Noisy gradient signal
- Computationally expensive

19

Batch GD - Advantages

- Computationally efficient
- Stable gradient signal

20

Batch GD - Disadvantages

- Requires gradient accumulation
- Premature convergence
- Involves loading large datasets into memory
-Thus can be slow

21

Mini-batch GD - Advantages

- Relatively computationally efficient
- Does not require full datasets to be loaded into memory
- Stable gradient signal

22

Mini-batch GD - Disadvantages

- Gradient accumulation
- Another hyper-parameter - minibatch size

23

Talk abut representation learning in the context of classification tasks

- Higher layers of representation amplifies aspects of the input that are important for discrimination and suppress irrelevant variations

Hinton nature paper: key aspect of deep learning is that layers of features are not designed by human engineers, they are learned from data using general-purpose learning procedures