Gradient Descent Flashcards

(18 cards)

1
Q

What is gradient descent?

A

Gradient descent adjusts w iteratively in the direction that leads to the biggest decrease in E(w)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the process for gradient descent?

A

Initialise w with zeroes or random variables near zero and repeat for a given number of iterations or until ∇E(w) is a vector of zeros.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the equation for gradient descent?

A

w = w − η∇E(w)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the equation for the change in loss?

A

∇E(w) = Σ (p(1|x(i),w) − y(i)) . x(i)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does a large learning rate lead to?

A

A large learning rate leads to jumping across the optimum, which lacks stability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can gradient descent get stuck at the local minima for E(w)?

A

No, because E(w) is strictly convex with respect to w and so has a unique minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a small learning rate lead to?

A

Small values results in longer time to converge to the optimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is differential curvature?

A

Where we contour the plots of the loss function, where each line corresponds to points where the loss is the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the drawbacks with differential curvature?

A

Differential curvature is only an instantaneous direction of best movement, not long term so optimisation is slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What causes partial derivatives?

A

Different partial derivatives w.r.t different weights can be result of input variables having different scales and variables, which affect the loss function. Standardising input variables helps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are second order derivatives?

A

Second order derivates show if the gradient is changing too quickly to allow weight updates.

+ = convex, - = concave

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Newton-Raphson equation (Univariate)?

A

w = w - (E’(w) / E’‘(w))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the new weight update rule for Newton-Raphson

A

The weight updated rule is based on a quadratic approximation based on a Tyler polynomial of degree 2. This takes the optimal of a quadratic approximation in a single step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are partial derivatives?

A

Partial derivative is the rate of change of a function of multiple variables w.r.t one of those variables, leaving the other constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Hessian?

A

The Hessian contains all second-order partial derivatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Newton-Raphson equation (Multivariate)?

A

w = w - H^(-1)E(w) . ∇E(w)

16
Q

Why use Newton-Raphson Multivariate?

A

This update will take us to the optimal of the quadratic approximation in a single step. However, as the quadratic approximation is not the true loss function, we will need to apply this rule iteratively.

17
Q

Can we apply a learning rate to the Newton-Raphson method?

A

We could add a learning rate since it is an approximation, but this isn’t ideal for large datasets and limited memory requirements due to the matrices.