Week 5: Deep Discriminative Neural Networks Flashcards

1
Q

Deep Neural Networks

A

Neural Networks with more than 3 layers, typically much more than 3 layers. Deep Neural Networks are better at providing a hierarchy of representation with high levels of abstraction. In recurrent neural networks, connections take the output of every hidden neuron and feed it as additional input at the next iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Vanishing Gradient Problem

A

When doing backpropagation with , error tends to get smaller as we move backward through the hidden layers. This means that neurons in earlier layers learn much more slowly than neurons in later layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Exploding Gradient Problem

A

If weights are initialised to, or learn, large values, then the gradient gets larger as we move back through the hidden layers. As such, neurons in earlier layers often make large, random changes in their weights. Later layers can’t learn due to the constantly changing output of earlier layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Activation Functions w/ Non-Vanishing Derivatives

A

These functions have non-zero derivatives, allowing for better backpropagation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Backpropagation Stabilisation

A

To stabilise backpropagation, the following techniques are used:

  • Activation functions with non-vanishing gradients.
  • Better ways to initialise weights
  • Adaptive variations of standard backpropagation
  • Batch normalisation
  • Skip connections
  • Making sure datasets are very large and labelled
  • Making sure there are large computational resources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Rectified Linear Unit (ReLU)

A

\varphi(net_j) = net_j if net_j \ge 0, 0 if net_j < 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Leaky Rectified Linear Unit (LReLU)

A

\varphi(net_j) = net_j if net_j \ge 0, a \times net_j if net_j < 0, (0 < a < 1, with a usually being very close to 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parametric Rectified Linear Unit (PReLU)

A

\varphi(net_j) = net_j if net_j \ge 0, a \times net_j if net_j < 0, a is a learned parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Xavier/Glorot Weight Initialisation

A

Activation Function: sigmoid/tanh

Given m inputs to n outputs, choose weights from a uniform distribution with the range

(- \sqrt{6/(m+n)}, \sqrt{6/(m+n)}), results with a normal distribution with mean = 0, std = \sqrt{2/(m+n)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Kaiming/He Weight Initialisation

A

Activation Function: ReLU/ LReLU / PReLU

Given m inputs to n outputs, choose weights from a uniform distribution with the range

(- \sqrt{6/(m)}, \sqrt{6/(m)}), results with a normal distribution with mean = 0, std = \sqrt{2/(m)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Adaptive Versions of Backpropagation

A

Regular backpropagation can in the following manners:

  • Gradient too low leads to slow learning
  • Gradient too large leads to failure to find optimal parameters
  • Learning rate too low leads to many iterations for little performance gain.
  • Learning rate too large leads to failure to find optimal parameters.

Introducing momentum allows the gradient to climb out of local minimums. This is done by varying the learning rate during training. Increase the learning rate if cost is decreasing, decrease the learning rate when cost is increasing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Backpropagation Algorithms with Adaptive Learning

A

AdaGrad, RMSprop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Backpropagation Algorithms with Adaptive Learning and Momentum

A

ADAM, Nadam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Batch Normalisation

A

The output of each neuron is scaled so that is has a mean close to 0 and a standard deviation close to 1.

BN(x) = \beta + \gamma \frac{x - E(x)}{\sqrt{Var(x) + \epsilon}}

\beta, \gamma are learnt parameters through backpropagation

\epsilon is a constant used to prevent division-by-0 errors

Batch normalisation can be applied before or after an activation function. The result is the limiting of activation functions to range where gradients are non-zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Skip Connections

A

These are connections that skip 1 or more layers of the network. Also called residual networks. This lets gradients by-pass parts of the network where gradients have vanished. Networks become shallower, but this may be temporary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Convolutional Neural Network

A

It’s a Deep Neural Network that has at least 1 layer with a transfer function implemented by convolution/cross-correlation, as opposed to vector multiplication. This allows for better tolerance to spatial location and temporal location. As such, neuron’s weights are defined as an array, called a mask/filter/kernel.

17
Q

Padding

A

This is when 0’s are added to the mask to allow for better convolution at the edges. This increases the size of the output.

18
Q

Stride

A

This defines the number of pixels moved between calculations of the output. This decreases the size of the output.

19
Q

Pooling

A

This is when a collection of locations are merged together, can be done by using the maximum value (Max Pooling) or using the average (Average Pooling). This decreases the size of the output.

20
Q

Output Dimension Formula

A

output_dim = 1 + (input_dim - mask_dim + 2 \times padding)/stride

21
Q

Gradient Clipping

A

A technique which involves calculating the norm of the, seeing if it exceeds a predefined threshold \theta, and then scaling the gradients down if necessary. The primary benefit of gradient clipping is the prevention of the exploding gradient problem. \theta is a predefined parameter. If set too low, \theta may hinder the network’s ability to learn effectively by restricting the gradient updates too much.