College 5 Flashcards

1
Q

What is the objective of gradient descent?

A

Move 𝚹 (parameters) in the direction that minimizes the loss function J(𝚹)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does batch gradient descent work?

A

Updates the parameters 𝚹 by

calculating the gradients using the whole dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does stochastic gradients descent work?

A

Updates the parameters 𝚹 by calculating the gradients using randomly selected samples (one at a time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the advantages of batch gradient descent?

A

less noisy, more precise, smoother path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the advantages of stochastic gradient descent?

A

computationally fast, easier to fit in memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does mini-batch gradient descent work?

A

Updates the parameters 𝚹 by calculating the gradients using β€œsmall”
batches of samples (somewhere between Stochastic and Batch Gradient Descent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name 6 optimisers

A
  • SGD
  • momentum
  • NAG
  • Adagrad
  • Adadelta
  • RMSprop
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the limitations of (mini-batch stochastic gradient descent?

A

● It might get stuck in local minima.
● Choosing a proper learning rate can be difficult.
● The same learning rate applies to all parameter updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the purpose of exponentially weighted averages?

A

To filter/smooth noisy (temporal) data to reduce the chance of getting stuck in a local minima

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is the filtered version of a noisy signal shifted to the right?

A

You take into account the previous timestep of your average and add a bit of the function you want to filter and it take time for the average to show up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does momentum work?

A

updates are calculated using an exponentially weighted average of the gradients that were calculated for previous batches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between standard and nesters momentum

A

With Nesterov momentum you calculate the gradient assuming you keep on moving in the direction you were moving instead of where you are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Name three algorithms with adaptive learning rates

A
  • adagrad
  • rmsprop
  • Adam
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does adagrad work?

A

Adagrad uses a different learning rate for each parameter 𝚹i.

The new theta 𝚹 =
the previous theta 𝚹 - (the learning rate / a small number to avoid devision by 0 + square root of the gradient for each parameter r) x the gradient

r gets bigger over time and so learning rate decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Name an advantage and an disadvantage

A

+ It avoids the need for tuning of the learning rate

  • The sum of squared gradients for some parameters may cause fast decrease in the learning rate and difficult learning
  • Sum of squared gradient needs to be stored in memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of RMSprop

A

The goal is to improve on the monotonically decreasing learning rate of Adagrad

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the difference between Adagrad and RMSprop?

A

In adagrad you use the squared root of the gradient, in RMSprop you use the squared root of the exponentially weighted average of the gradient.

18
Q

What does Adam stand for?

A

Adaptive Moment Estimation

19
Q

How does Adam work?

A

It uses exponentially weighted average of gradient and exponentially weighted average of squared average to calculate update. A corrections is introduced to avoid bias towards zero at the beginning of training.

20
Q

Explain adadelta, adamax and nadam.

A

Adadelta: Extends Adagrad

Adamax: Extends Adam

Nadam: Combines Adam and Nesterov

21
Q

What is an advantage of batch normalisation?

A

The influence of each input neuron is more equal.
Can avoid getting neurons to die out.
Can speed up learning

22
Q

How does batch normalisation work?

A

you take the activation of each neuron and subtract the average over the activation of the whole batch. Then you divide this by the square root of the squared SD + the epsilon the avoid division by 0. This is the Z Norm.
Next you multiply this with a parameter gamma and you add beta, both values are learned. This is done because if you have a lot of values around zero with sd =1, you can get stuck in the linear activation part of the activation function (s-hape).

In practice, during training the mean and standard deviation are calculated from training batches and then applied to the test set.

23
Q

Explain bias

A

Bias comes from assumptions made by a model to learn more easily.
● Low Bias: Less assumptions about the target function. (Too few parameters)
● High-Bias: More assumptions about the target function. (Too many parameters)

24
Q

Explain variance

A

Variance relates to how a estimated function would change if different training data is used
● Low Variance: Small changes with different training data. (Not memorizing noise)
● High Variance: Large changes with different training dataset. (Memorizing noise)

25
Q

Define regularisation

A

Changes in the learning of a model to make it satisfy certain constraints or preferences.

26
Q

Name example of regularization strategies

A

● Constraints on weights

● Additional terms on the objective (Loss) function

27
Q

How does regularisation work?

A

A paramater norm penalty Ξ© is added to the objective function.

The objective function is normally the loss function.
Normally only the weight are regularised and not the biases.

28
Q

what is l2 regularisation also known as?

A

weight decay

29
Q

How does L2 regularisation work?

A

basically you add the sum of squared parameters to the loss function this keeps the weights small

30
Q

How does L1 regularisation work?

A

basically you add the absolute value of parameters to the loss function, this keeps the weights small

Can lead to sparsity in weight parameters

31
Q

What is the goal of early stopping?

A

Keeping track of training and validation error to get get the best performing model.

The best model is to be defined in terms of the performance metrics.

32
Q

How does dropout work?

A

It randomly remove units for each training iteration.

Dropout trains an ensemble consisting of all subnetworks that can be constructed by removing non-output units from an underlying base network.

33
Q

What are the advantages of using dropout?

A

Dropout encourages modular representations which work well in the absence of some parts of the network (Hinton et al 2012).

One advantage of dropout is that it is very computationally cheap

34
Q

How is dropout implemented during training?

A

Sample a binary variable from a Bernoulli distribution with probability phidden -> Multiply unit activation with the sampled variable

35
Q

How is dropout implemented during testing?

A

● Maintain expected total input to a given unit at test time similar to the expected total input to that unit at train time (when some units were missing).
● Note: the alternative is to do inverse-scaling during training, and no scaling for testing.

The point is that the network gets the same size expected in training and testing.

36
Q

In what classification problem is data augmentation used the most?

A

object recognition

37
Q

How can you apply data augmentation in object recognition / images?

A
● translation
● crop
● scaling
● rotation
● flipping
● adding noise
38
Q

what is the vanishing and exploding gradients problem

A

In the network you keep multiplying weights and adding them up. So your output and loss function and gradient can get very big., which makes the updates in back-propagation very big –> exploding gradient.

If you have very small weights you can get really small activations. So you don’t learn anything. –> vanishing gradients

39
Q

How can you solve the vanishing and exploding gradients problem?

A

better initialisation of the parameters

A good practice to initialize weights is by drawing values from a normal distribution with zero mean and a specific variance.
The most used methods are:
● Xavier
● He

40
Q

How does the Xavier method work?

A

The variance is given by:
1 / fan_in
Where fan_in corresponds to the number of incoming neurons.

41
Q

How does the He method work?

A

The variance is given by:
2 / fan_in
Where fan_in corresponds to the number of incoming neurons.

42
Q

When should you use Xavier and when should you use He?

A

Xavier works better with Sigmoid or Tanh activation functions.

He works better with ReLu activation function