optimisation Flashcards

(27 cards)

1
Q

What is the goal of optimisation in supervised learning?

A

To find parameters that minimize the total loss over the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the general form of a supervised learning optimisation objective?

A

θ* = arg min_θ Σ L(yᵢ, ŷᵢ(θ))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In linear regression, what are we trying to minimize?

A

The sum of squared differences between predictions and true values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the gradient of a loss function represent?

A

The direction of steepest increase in loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the update rule in gradient descent?

A

θ ← θ - α ∇_θ L

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the learning rate α control?

A

The size of the step taken during each update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens if the learning rate is too small?

A

The model trains slowly and may take too long to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens if the learning rate is too large?

A

The updates may overshoot the minimum and diverge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What kind of function guarantees a single global minimum?

A

A convex function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are neural network loss surfaces non-convex?

A

Because they contain many local minima and saddle points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is stochastic gradient descent (SGD)?

A

An optimisation method that updates parameters using mini-batches instead of the full dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the advantages of using SGD?

A

It is faster, more memory efficient, and helps escape local minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the disadvantages of SGD?

A

Updates are noisy and convergence can be unstable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does momentum add to gradient descent?

A

An inertia term that smooths updates and helps traverse valleys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the formula for the momentum update step?

A

v_t = βv_{t-1} + (1 - β)∇_θ L; θ ← θ - αv_t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does Nesterov Accelerated Gradient improve on momentum?

A

It computes the gradient at the predicted future position for more accurate updates.

17
Q

What is Adam short for?

A

Adaptive Moment Estimation.

18
Q

What two ideas does Adam combine?

A

Momentum and adaptive learning rates (via RMSProp).

19
Q

What are the first and second moments in Adam?

A

First moment is the mean of gradients; second moment is the mean of squared gradients.

20
Q

What is the Adam update rule?

A

θ ← θ - α · (m̂_t / (√v̂_t + ε))

21
Q

Why do we apply bias correction in Adam?

A

To correct the initialization bias in moment estimates early in training.

22
Q

What is the main benefit of Adam over SGD?

A

It adapts learning rates for each parameter and typically converges faster.

23
Q

What is a hyperparameter in optimisation?

A

A setting (like learning rate) that is not learned from data but chosen beforehand.

24
Q

What are some key hyperparameters in training?

A

Learning rate, batch size, momentum, optimiser choice, schedule.

25
What is the purpose of using a learning rate schedule?
To reduce the learning rate over time for more stable convergence.
26
What is the Frobenius norm used for in optimisation problems?
To measure the size or error in matrix approximations.
27
What does optimisation help us achieve in training neural networks?
To reduce prediction error and improve generalisation by tuning parameters.