Flashcards in DL Fundamentals Deck (23)

Loading flashcards...

1

## Representational learning

###
Engineering representations is hard - requires technical and domain expertise

- Representational learning is a set of method that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification (LeCun et al., 2014)

2

## List some activation functions

### g(z) = 1 / (1 + e^(-z))

3

## Loss functions

###
1. Squared error

2. Log loss

4

## Squared error loss function

### 1/2 * (M(d) - t)^2

5

## Log loss function

### -((t * log(M(d)) + (1-t)*(log(1-M(d)))

6

## Difference between loss and cost function

###
- Loss function = measure of the prediction error on a single training instance

- Cost function = measure of the average prediction error across a set of training instances

- Cost functions allows us to add in regularization

7

## Gradient Descent Algorithm

###
1. Choose random weights

2. Until convergence

- Set all gradients to zero

- For each training instance

- Calculate the model output

- Calculate loss

- Update gradient sum for for each weight and bias

- Update weights and bias using the weight update rule

8

## Backpropagation Algorithm

###
- Werbos (1974)

- Not widely used until 1986

1. Initialize the weights to random small values (fanin)

2. FF phase - Feed input data through the network from the inputs to the outputs

3. Update the training error for the network based (based on target values for all output nodes)

4. Error propagation phase - Feed error values back through the network, adjusting the weights along the way

5. Repeat from 2 until the error values are sufficiently small or some other stopping condition

9

## Forward pass algo

###
Require: L, network depth

Require: W[i], i is an element of {1...L}, weight matrices for each layer

Require: b[i], i is an element of {1...L}, bias terms for each . layer

Require: d, input descriptive features

Require: t, target features

a[0] = d

for I = 1 to L:

z[I] = W[I]*a[I-1] + b

a[I] = g[I]z[I]

M(d) = a[L]

Calculate L(M(d), t)

10

## Backward Propagation algo

###
Require: A forward pass of network

Require: L, network depth

Require: W[i], i is an element of {1...L}, weight matrices for each layer

Require: b[i], i is an element of {1...L}, bias terms for each layer

Require: t, target features

Calculate da[L] #derivate of loss function

for I=L to 1:

dz[I] = da[I] * g[I]'(z[I])

dW[I] = dz[I] * a[I-1]T

db[I] = dz[I]

da[I-1] = W[I]T*dz[I]

11

## Stochastic/Online GD

###
-Choose random weights

-Until convergence

- Shuffle all training instances

- For each training instance:

* Perform f/w pass

* Perform b/w pass

* Update weights and biases using update rule

12

## Stochastic GD Update Rule

###
W[i] = W[i] - αdW[i]

b[i] = b[i] - αdb[i]

13

## GD Batch

###
-Choose random weights

-Until convergence

- Set all gradient sums to 0

- For each training instance:

* Perform f/w pass

* Perform b/w pass

* Update gradient sum for each weight and bias term

- Update weights and biases using update rule

14

## Batch GD Update rule

###
W[i] = W[i] - α(1/m)Σ(from j=0 to M) dW[i]j

b[i] = b[i] - α(1/m)Σ(from j=0 to M)db[i]j

15

## GD Mini-batch

###
-Choose random weights

-Until convergence

- Divide the training set into mini-batches of size s

- For each mini-batch D(mb)

* Set all gradient sums = 0

* For each training instance in D(mb)

- Perform f/w pass

- Perform b/w pass

- Update gradient sum for each weight and bias t

* Update weight and bias term using update rule

16

## GD Mini-batch update rule

###
W[i] = W[i] - α(1/s)Σ(from j=0 to S) dW[i]j

b[i] = b[i] - α(1/s)Σ(from j=0 to S)db[i]j

17

## Stochastic GD - Advantages

###
- Easy to implement

- Fast learning

18

## Stochastic GD - Disadvantages

###
Noisy gradient signal

- Computationally expensive

19

## Batch GD - Advantages

###
- Computationally efficient

- Stable gradient signal

20

## Batch GD - Disadvantages

###
- Requires gradient accumulation

- Premature convergence

- Involves loading large datasets into memory

-Thus can be slow

21

## Mini-batch GD - Advantages

###
- Relatively computationally efficient

- Does not require full datasets to be loaded into memory

- Stable gradient signal

22

## Mini-batch GD - Disadvantages

###
- Gradient accumulation

- Another hyper-parameter - minibatch size

23