optimisation Flashcards
(23 cards)
What is the goal of optimisation in supervised learning?
To find parameters that minimize the total loss over the dataset.
In linear regression, what are we trying to minimize?
The sum of squared differences between predictions and true values.
What does the gradient of a loss function represent?
The direction of steepest increase in loss.
What does the learning rate α control?
The size of the step taken during each update.
What happens if the learning rate is too small?
The model trains slowly and may take too long to converge.
What happens if the learning rate is too large?
The updates may overshoot the minimum and diverge.
What kind of function guarantees a single global minimum?
A convex function.
Why are neural network loss surfaces non-convex?
Because they contain many local minima and saddle points.
What is stochastic gradient descent (SGD)?
An optimisation method that updates parameters using mini-batches instead of the full dataset.
What are the advantages of using SGD?
It is faster, more memory efficient, and helps escape local minima.
What are the disadvantages of SGD?
Updates are noisy and convergence can be unstable.
What does momentum add to gradient descent?
An inertia term that smooths updates and helps traverse valleys.
How does Nesterov Accelerated Gradient improve on momentum?
It computes the gradient at the predicted future position for more accurate updates.
What is Adam short for?
Adaptive Moment Estimation.
What two ideas does Adam combine?
Momentum and adaptive learning rates (via RMSProp).
What are the first and second moments in Adam?
First moment is the mean of gradients; second moment is the mean of squared gradients.
Why do we apply bias correction in Adam?
To correct the initialization bias in moment estimates early in training.
What is the main benefit of Adam over SGD?
It adapts learning rates for each parameter and typically converges faster.
What is a hyperparameter in optimisation?
A setting (like learning rate) that is not learned from data but chosen beforehand.
What are some key hyperparameters in training?
Learning rate, batch size, momentum, optimiser choice, schedule.
What is the purpose of using a learning rate schedule?
To reduce the learning rate over time for more stable convergence.
What is the Frobenius norm used for in optimisation problems?
To measure the size or error in matrix approximations.
What does optimisation help us achieve in training neural networks?
To reduce prediction error and improve generalisation by tuning parameters.