optimization algos Flashcards

1
Q

what is batch and mini batch gradient descent?

A

if data too large in number of samples –> split into minibatches
batch: entire training set at the same time
minibatch: 1 smaller batch at a time. each step of training we only process 1 minibatch, update weight then move on to the next batch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

is minibatch faster

A

when training set is large, minibatch runs much faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

difference in cost during training btw batch and mini batch

A

batch: only goes down
mini batch: fluctuate w a downward trend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

different batch sizes?

A

batch grad des: size = m smooth but take long
stochastic gd: size = 1 can be xtremely noisy but fast (can use small alpha) but will lose speedup from variation
minibatch: some size in between fastest learning but also make progress (usually works better than the other 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to choose minibatch size?

A

small training: batch gd
typical size: 64, 128, 256, 512 (common)
1024
make sure it fits in GPU or CPU memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

goal of gradient descent

A

minimize the loss function for a ML prob

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

idea of basic GD algo

A

the opposite direction of the gradient points to where the lower area is –> iteratively take steps in the opposite direction of the gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

vanilla GD algo

A

delta = -learning_rate * gradient
theta += delta

theta is the param we want to optimize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

momentum algo?

A

in each step, in addition to the regular gradient, it also adds on the movement from the prev step
sum_of_gradient = gradient + prev_sum_of_gradient×decay_rate
delta = -lr×sum_of_gradient
theta += delta

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what if decay rate = 0? = 1? what are the reasonable decay rate?

A

decay rate = 0 –> vanilla GD (no momentum)
decay rate = 1 –> cost goes back and forth endlessly (given reasonably small lr)
typically 0.8 0.9 (surface with a little bit of friction so it eventually slows down and stops)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how is momentum better than vanilla GD?

A

simply moves faster (all the momentum it accumulates)
has a shot at escaping local minima (momentum may propel it out of a local minimum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AdaGrad (adaptive gradient) algo

A

adagrad keeps track of the sum of gradient squared
sum_of_gradient_squared = prev_sum_of_gradient_squared + gradient^2
delta = -lr*gradient/sqrt(sum_of_gradient_squared)
theta += delta

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

idea of AdaGrad

A

in ml optimization, some features are very sparse –> avg gradient for sparse features is usually small so such fts get trained at a much slower rate
adagrad: the more you have updated a feature alr, the less you will update it in the future –> sparse fts can catch up
–> escape saddle point faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are some gradient squared based methods

A

adagrad, rmsprop, adam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

gradient squared based methods vs vanilla GD?

A

vanilla: slide down the steep slope first then maybe worry ab the slower direction later
gsb methods: slow down the fast learning features so that sparse features will catch up –> update all fts similarly at a step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

drawbacks of adagrad?

A

very slow because the sum of gradient squared only grows and nvr shrinks

17
Q

RMSProp (Root Mean Square Propagation) algo?

A

sum_of_gradient_squared = prev_sum_of_gradient_squared×decay_rate + gradient^2×(1-decay_rate)
delta = -lr×gradient/sqrt(sum_of_gradient_squared)
theta += delta

18
Q

how rmsprop tackle adagrad issue? (the idea of decay rate)

A

it introduces decay rate –> sum of gradient squared actually the decayed sum of gradient squared.
decay rate is saying that only recent gradient matters, the ones from long ago will be forgotten
in addition to decaying the prev sum of gs, it also scales down the whole sum of gs by factor of 1-decay_rate –> step is larger –> eventually faster than adagrad

19
Q

adam (adaptive moment estimation) algo?

A

take the best of momentum and rmsprop
Momentum: sum_of_grad=prev_sum_of_grad×beta1+grad×(1-beta1)
RMSprop: sum_of_grad_squared = prev_sum_of_grad_squared×beta2+grad^2×*1-beta2)
delta = -lr×sum_of_grad/sqrt(sum_of_grad_squared)
theta += delta

beta1: decay rate for the first moment, sum of grad, commonly set at 0.9
betaa2: decay rate for the 2nd moment, sum of grad squared, commonly set at 0.999

20
Q

adam advantages?

A

adam gets the speed from momentum and the ability to adapt gradients in different directions from rmsprop –> powerful

21
Q

why mini batch maybe faster than batch?

A

Although there are more steps in 1 epoch,
typically, especially in the early training, param-gradients for diff subsets of data will point in the same direction –> gradients evaluated just on a subsets will roughly point in the same direction of the whole set but requires less computation
convergence on a highly nonlinear deep network typically requires thousands/millions of iterations no matter how good the gradients are –> makes sense to do many updates based on cheap estimates of grads rather than a few updates based on the good ones