Gradient Descent Flashcards

Question 1

Q

What is gradient descent?

Answer

A

Gradient descent adjusts w iteratively in the direction that leads to the biggest decrease in E(w)

Question 2

Q

What is the process for gradient descent?

Answer

A

Initialise w with zeroes or random variables near zero and repeat for a given number of iterations or until ∇E(w) is a vector of zeros.

Question 3

Q

What is the equation for gradient descent?

Answer

A

w = w − η∇E(w)

Question 4

Q

What is the equation for the change in loss?

Answer

A

∇E(w) = Σ (p(1|x(i),w) − y(i)) . x(i)

Question 5

Q

What does a large learning rate lead to?

Answer

A

A large learning rate leads to jumping across the optimum, which lacks stability

Question 6

Q

Can gradient descent get stuck at the local minima for E(w)?

Answer

A

No, because E(w) is strictly convex with respect to w and so has a unique minima.

Question 7

Q

What does a small learning rate lead to?

Answer

A

Small values results in longer time to converge to the optimum

Question 8

Q

What is differential curvature?

Answer

A

Where we contour the plots of the loss function, where each line corresponds to points where the loss is the same.

Question 9

Q

What are the drawbacks with differential curvature?

Answer

A

Differential curvature is only an instantaneous direction of best movement, not long term so optimisation is slow

Question 10

Q

What causes partial derivatives?

Answer

A

Different partial derivatives w.r.t different weights can be result of input variables having different scales and variables, which affect the loss function. Standardising input variables helps.

Question 11

Q

What are second order derivatives?

Answer

A

Second order derivates show if the gradient is changing too quickly to allow weight updates.

+ = convex, - = concave

Question 12

Q

What is the Newton-Raphson equation (Univariate)?

Answer

A

w = w - (E’(w) / E’‘(w))

Question 13

Q

Explain the new weight update rule for Newton-Raphson

Answer

A

The weight updated rule is based on a quadratic approximation based on a Tyler polynomial of degree 2. This takes the optimal of a quadratic approximation in a single step.

Question 14

Q

What are partial derivatives?

Answer

A

Partial derivative is the rate of change of a function of multiple variables w.r.t one of those variables, leaving the other constant.

Question 15

Q

What is the Hessian?

Answer

A

The Hessian contains all second-order partial derivatives

Question 16

Q

What is the Newton-Raphson equation (Multivariate)?

Answer

Study These Flashcards

A

w = w - H^(-1)E(w) . ∇E(w)

Question 17

Q

Why use Newton-Raphson Multivariate?

Answer

Study These Flashcards

A

This update will take us to the optimal of the quadratic approximation in a single step. However, as the quadratic approximation is not the true loss function, we will need to apply this rule iteratively.

Question 18

Q

Can we apply a learning rate to the Newton-Raphson method?

Answer

Study These Flashcards

A

We could add a learning rate since it is an approximation, but this isn’t ideal for large datasets and limited memory requirements due to the matrices.

Gradient Descent Flashcards

(18 cards)