05 - Optimization Flashcards

1
Q

Optimization: Short recap of Gradient Descent.

A

f → highly non-linear function that maps input data x to output data y using parameters (W,b)

L → loss that quantifies the difference between predictions and desired output, generally non-convex

g → gradient of L with respect to function parameters

In gradient descent we follow the gradient of L with small steps from a random initialization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What types of Gradient Descent do we have?

A

Batch Gradient Descent, Stochastic Gradient Descent, Mini Batch GD (Mini batch SDG)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain Batch GD

A
  • All training data is taken into consideration for each step. The average of the gradient of all the training examples is found and used to update parameters.
    • Pro: great for convex or smooth error manifolds (a polygon is conves if all interior angles are less than 180 degrees, so convex error manifolds have no local minima or saddle points)
    • Con: infeasible or very slow for huge training data sets, possible redundency by using all examples even if many are near-duplicates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain SGD

A
  • one example at a time is considered
  • One Epoch: take 1 example → feed to NN → calculate its gradients → update weights →repeat for all examples in training dataset
  • The cost will fluctuate a lot since we are only looking at one example at a time, but over the long run it should decrease
    • Pro: high variance, faster, better at large data sets
    • Con: not very smooth, will never really reach a minima but keep dancing around, vectorized implementation not possible (which can slow down computations)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain Mini Batch GD

A
  • use of mini batches instead of all or just one example at a time
  • helps achieve advantages of both former variants.
  • One Epoch: pick a mini batch → feed to NN → calculate mean gradient of batch→ update weights →repeat for all created mini batches

Mini batch Size:
Trade off between:
- The fewer points we have, the higher tha variance/exploration is and the faster we get a gradient, so we learn faster
- More points make it smoother, bc they give a stability in gradient estimates

We should think about the data we have, eg for the number classes above, all class images are very much alike, so a batch size of 10 is probably not much better than 5, and should be enough

Anders usually starts with a batch size of 16 or 32, of course if we have 100 classes it might be too low

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name some challenges with gradient descent.

A

Non-Convexity of L
- a polygon is convex if all interior angles are less than 180 degrees, so convex error manifolds have no local minima or saddle points. If L is non-convex we have local minima and saddle points

Steep clifs:
- where L is very steep. This can cause exploding gradients and overshooting the optimal point because of taking too large steps while optimizing

The empirical risk:
- learning on training set while caring about test set performance
- using a smooth (differentiable ) loss like log loss while we care about the non differentiable 0-1 classification loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are typical learning rate schedulers and what do we use them for?

A

Common approaches for SDG:
- Constant LR
- Linear decay
- Piecewise constant or “staircase”
- Exponential

If we have a too big lr close to the optimal point the model would chaotically bounce around and not be able to settle down. Having a too small learning rate in the beginning risks ending up at critical points

Make a step function for learning rate in training: when this point is reached, the learning rate should be devided by x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain Momentum and Nesterov Momentum

A

Momentum
Using momentum accelerates and stabelizes optimization. Instead of only updating the params based on the current gradient, a moving average of past gradients, the momentum term, is introduced. This helps gaining speed in directions with consistent gradients and dacreases oscillations in direction with high curvature.

The update rule we had before: W <- W - epsilon g
The momentum update rule: W <- W +v, v <- alpha v -epsilon g, where v is a velocity vector initialized to zero and alpha is the momentum parameter

Nesterov Momentum
Instead of using the current gradient a new gradient is predicted, to “look ahead”

All in all momentum is way faster and has a chance of getting out f local minima

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is it good to use Adaptive Learning Rates?

A

LR is a hyperparameter and is hard to tune.

In a high dimentional error landscape, it optimally should decrease step size in steep directions and increase in almost flat directions.

Adaptive Learning Rates:

  • Three variants mentioned in the book: AdaGrad, RMSProp, Adam
  • Good learning rate to start with could be 0.001
  • Most people use Adam or a variation of it. Has momentum build in
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the differences between SDG, AdaGrad, RMSProp and Adam.

A

Stochastic Gradient Descent (SGD):
Simplest optimization algorithm. Updates model parameters using the negative gradient of the loss with respect to the parameters multiplied by a learning rate. May use momentum to speed up.
Drawbacks: It may have slow convergence, and is prown to end up in critical points

AdaGrad (Adaptive Gradient Algorithm):
Improvement: AdaGrad adapts the learning rate for each parameter based on historical gradients.
It uses a different learning rate for each parameter, scaling the learning rates inversely proportional to the square root of the sum of the squared gradients for each parameter.
Drawback: learning rates become small and process is very slow

RMSProp (Root Mean Square Propagation):
Improvement: RMSProp addresses the issue of rapidly diminishing learning rates in AdaGrad by using a moving average of squared gradients.
Mechanism: It scales the learning rates by the root mean square of the past gradients, giving more weight to recent gradients.
Effect: way faster than AdaGrad

Adam (Adaptive Moment Estimation):
Combination of Ideas: Adam combines the ideas of momentum and RMSProp.
Mechanism: It uses both a momentum term (like in SGD with momentum) and an adaptive learning rate term (like in RMSProp).
Benefits: Adam tends to perform well across a wide range of tasks and has become a popular choice for many applications due to its adaptive nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain batch normalization

A

We already talked about normalizing data before we start training (preprocessing) but it might be a good idea to normalize throughout the net as well. With Batch Norm we normalize the output of activation functions before feeding it forward in the net.

It calculates the mean and standard deviation of each feature in a mini-batch and normalizes the inputs of a layer based on these statistics.
Benefits:
- Stability: Reduces internal covariate shift, making training more stable.
- Faster Convergence: Accelerates convergence by allowing higher learning rates.
- Regularization: Acts as a form of regularization, reducing the need for other regularization techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly