College 5 Flashcards
(42 cards)
What is the objective of gradient descent?
Move πΉ (parameters) in the direction that minimizes the loss function J(πΉ)
How does batch gradient descent work?
Updates the parameters πΉ by
calculating the gradients using the whole dataset.
How does stochastic gradients descent work?
Updates the parameters πΉ by calculating the gradients using randomly selected samples (one at a time).
What are the advantages of batch gradient descent?
less noisy, more precise, smoother path
What are the advantages of stochastic gradient descent?
computationally fast, easier to fit in memory
How does mini-batch gradient descent work?
Updates the parameters πΉ by calculating the gradients using βsmallβ
batches of samples (somewhere between Stochastic and Batch Gradient Descent).
Name 6 optimisers
- SGD
- momentum
- NAG
- Adagrad
- Adadelta
- RMSprop
What are the limitations of (mini-batch stochastic gradient descent?
β It might get stuck in local minima.
β Choosing a proper learning rate can be difficult.
β The same learning rate applies to all parameter updates.
What is the purpose of exponentially weighted averages?
To filter/smooth noisy (temporal) data to reduce the chance of getting stuck in a local minima
Why is the filtered version of a noisy signal shifted to the right?
You take into account the previous timestep of your average and add a bit of the function you want to filter and it take time for the average to show up
How does momentum work?
updates are calculated using an exponentially weighted average of the gradients that were calculated for previous batches
What is the difference between standard and nesters momentum
With Nesterov momentum you calculate the gradient assuming you keep on moving in the direction you were moving instead of where you are.
Name three algorithms with adaptive learning rates
- adagrad
- rmsprop
- Adam
How does adagrad work?
Adagrad uses a different learning rate for each parameter πΉi.
The new theta πΉ =
the previous theta πΉ - (the learning rate / a small number to avoid devision by 0 + square root of the gradient for each parameter r) x the gradient
r gets bigger over time and so learning rate decreases
Name an advantage and an disadvantage
+ It avoids the need for tuning of the learning rate
- The sum of squared gradients for some parameters may cause fast decrease in the learning rate and difficult learning
- Sum of squared gradient needs to be stored in memory
What is the goal of RMSprop
The goal is to improve on the monotonically decreasing learning rate of Adagrad
What is the difference between Adagrad and RMSprop?
In adagrad you use the squared root of the gradient, in RMSprop you use the squared root of the exponentially weighted average of the gradient.
What does Adam stand for?
Adaptive Moment Estimation
How does Adam work?
It uses exponentially weighted average of gradient and exponentially weighted average of squared average to calculate update. A corrections is introduced to avoid bias towards zero at the beginning of training.
Explain adadelta, adamax and nadam.
Adadelta: Extends Adagrad
Adamax: Extends Adam
Nadam: Combines Adam and Nesterov
What is an advantage of batch normalisation?
The influence of each input neuron is more equal.
Can avoid getting neurons to die out.
Can speed up learning
How does batch normalisation work?
you take the activation of each neuron and subtract the average over the activation of the whole batch. Then you divide this by the square root of the squared SD + the epsilon the avoid division by 0. This is the Z Norm.
Next you multiply this with a parameter gamma and you add beta, both values are learned. This is done because if you have a lot of values around zero with sd =1, you can get stuck in the linear activation part of the activation function (s-hape).
In practice, during training the mean and standard deviation are calculated from training batches and then applied to the test set.
Explain bias
Bias comes from assumptions made by a model to learn more easily.
β Low Bias: Less assumptions about the target function. (Too few parameters)
β High-Bias: More assumptions about the target function. (Too many parameters)
Explain variance
Variance relates to how a estimated function would change if different training data is used
β Low Variance: Small changes with different training data. (Not memorizing noise)
β High Variance: Large changes with different training dataset. (Memorizing noise)