optimisation Flashcards
(27 cards)
What is the goal of optimisation in supervised learning?
To find parameters that minimize the total loss over the dataset.
What is the general form of a supervised learning optimisation objective?
θ* = arg min_θ Σ L(yᵢ, ŷᵢ(θ))
In linear regression, what are we trying to minimize?
The sum of squared differences between predictions and true values.
What does the gradient of a loss function represent?
The direction of steepest increase in loss.
What is the update rule in gradient descent?
θ ← θ - α ∇_θ L
What does the learning rate α control?
The size of the step taken during each update.
What happens if the learning rate is too small?
The model trains slowly and may take too long to converge.
What happens if the learning rate is too large?
The updates may overshoot the minimum and diverge.
What kind of function guarantees a single global minimum?
A convex function.
Why are neural network loss surfaces non-convex?
Because they contain many local minima and saddle points.
What is stochastic gradient descent (SGD)?
An optimisation method that updates parameters using mini-batches instead of the full dataset.
What are the advantages of using SGD?
It is faster, more memory efficient, and helps escape local minima.
What are the disadvantages of SGD?
Updates are noisy and convergence can be unstable.
What does momentum add to gradient descent?
An inertia term that smooths updates and helps traverse valleys.
What is the formula for the momentum update step?
v_t = βv_{t-1} + (1 - β)∇_θ L; θ ← θ - αv_t
How does Nesterov Accelerated Gradient improve on momentum?
It computes the gradient at the predicted future position for more accurate updates.
What is Adam short for?
Adaptive Moment Estimation.
What two ideas does Adam combine?
Momentum and adaptive learning rates (via RMSProp).
What are the first and second moments in Adam?
First moment is the mean of gradients; second moment is the mean of squared gradients.
What is the Adam update rule?
θ ← θ - α · (m̂_t / (√v̂_t + ε))
Why do we apply bias correction in Adam?
To correct the initialization bias in moment estimates early in training.
What is the main benefit of Adam over SGD?
It adapts learning rates for each parameter and typically converges faster.
What is a hyperparameter in optimisation?
A setting (like learning rate) that is not learned from data but chosen beforehand.
What are some key hyperparameters in training?
Learning rate, batch size, momentum, optimiser choice, schedule.