Optimization in neural networks Flashcards

1
Q

What is backpropagation? How does it work? Why do we need it? ‍⭐️

A

The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem.

We need backpropogation because,

Calculate the error – How far is your model output from the actual output.
Minimum Error – Check whether the error is minimized or not.
Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error.
Repeat the process until the error becomes minimum.
Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which optimization techniques for training neural nets do you know? ‍⭐️

A

Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent(best among gradient descents)
Nesterov Accelerated Gradient
Momentum
Adagrad
AdaDelta
Adam(best one. less time, more efficient)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we use SGD (stochastic gradient descent) for training a neural net? ‍⭐️

A

SGD approximates the expectation with few randomly selected samples (instead of the full data). In comparison to batch gradient descent, we can efficiently approximate the expectation in large data sets using SGD. For neural networks this reduces the training time a lot even considering that it will converge later as the random sampling adds noise to the gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the learning rate? 👶

A

The learning rate is an important hyperparameter that controls how quickly the model is adapted to the problem during the training. It can be seen as the “step width” during the parameter updates, i.e. how far the weights are moved into the direction of the minimum of our optimization problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens when the learning rate is too large? Too small? 👶

A

A large learning rate can accelerate the training. However, it is possible that we “shoot” too far and miss the minimum of the function that we want to optimize, which will not result in the best solution. On the other hand, training with a small learning rate takes more time but it is possible to find a more precise minimum. The downside can be that the solution is stuck in a local minimum, and the weights won’t update even if it is not the best possible global solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to set the learning rate? ‍⭐️

A

There is no straightforward way of finding an optimum learning rate for a model. It involves a lot of hit and trial. Usually starting with a small values such as 0.01 is a good starting point for setting a learning rate and further tweaking it so that it doesn’t overshoot or converge too slowly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Adam? What’s the main difference between Adam and SGD? ‍⭐️

A

Adam (Adaptive Moment Estimation) is a optimization technique for training neural networks. on an average, it is the best optimizer .It works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search.

Adam tends to converge faster, while SGD often converges to more optimal solutions. SGD’s high variance disadvantages gets rectified by Adam (as advantage for Adam).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When would you use Adam and when SGD? ‍⭐️

A

Adam tends to converge faster, while SGD often converges to more optimal solutions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Do we want to have a constant learning rate or we better change it throughout training? ‍⭐️

A

Generally, it is recommended to start learning rate with relatively high value and then gradually decrease learning rate so the model does not overshoot the minima and at the same time we don’t want to start with very low learning rate as the model will take too long to converge. There are many available techniques to do decay the learning rate. For example, in PyTorch you can use a function called StepLR which decays the learning rate of each parameter by value gamma-which we have to pass through argument- after n number of epoch which you can also set through function argument named epoch_size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we decide when to stop training a neural net? 👶

A

Simply stop training when the validation error is the minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is model checkpointing? ‍⭐️

A

Saving the weights learned by a model mid training for long running processes is known as model checkpointing so that you can resume your training from a certain checkpoint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly