optimisation Flashcards

(23 cards)

1
Q

What is the goal of optimisation in supervised learning?

A

To find parameters that minimize the total loss over the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In linear regression, what are we trying to minimize?

A

The sum of squared differences between predictions and true values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the gradient of a loss function represent?

A

The direction of steepest increase in loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the learning rate α control?

A

The size of the step taken during each update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens if the learning rate is too small?

A

The model trains slowly and may take too long to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What happens if the learning rate is too large?

A

The updates may overshoot the minimum and diverge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What kind of function guarantees a single global minimum?

A

A convex function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are neural network loss surfaces non-convex?

A

Because they contain many local minima and saddle points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is stochastic gradient descent (SGD)?

A

An optimisation method that updates parameters using mini-batches instead of the full dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the advantages of using SGD?

A

It is faster, more memory efficient, and helps escape local minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the disadvantages of SGD?

A

Updates are noisy and convergence can be unstable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does momentum add to gradient descent?

A

An inertia term that smooths updates and helps traverse valleys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Nesterov Accelerated Gradient improve on momentum?

A

It computes the gradient at the predicted future position for more accurate updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Adam short for?

A

Adaptive Moment Estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What two ideas does Adam combine?

A

Momentum and adaptive learning rates (via RMSProp).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the first and second moments in Adam?

A

First moment is the mean of gradients; second moment is the mean of squared gradients.

17
Q

Why do we apply bias correction in Adam?

A

To correct the initialization bias in moment estimates early in training.

18
Q

What is the main benefit of Adam over SGD?

A

It adapts learning rates for each parameter and typically converges faster.

19
Q

What is a hyperparameter in optimisation?

A

A setting (like learning rate) that is not learned from data but chosen beforehand.

20
Q

What are some key hyperparameters in training?

A

Learning rate, batch size, momentum, optimiser choice, schedule.

21
Q

What is the purpose of using a learning rate schedule?

A

To reduce the learning rate over time for more stable convergence.

22
Q

What is the Frobenius norm used for in optimisation problems?

A

To measure the size or error in matrix approximations.

23
Q

What does optimisation help us achieve in training neural networks?

A

To reduce prediction error and improve generalisation by tuning parameters.