Gradient Descent Flashcards
(18 cards)
What is gradient descent?
Gradient descent adjusts w iteratively in the direction that leads to the biggest decrease in E(w)
What is the process for gradient descent?
Initialise w with zeroes or random variables near zero and repeat for a given number of iterations or until ∇E(w) is a vector of zeros.
What is the equation for gradient descent?
w = w − η∇E(w)
What is the equation for the change in loss?
∇E(w) = Σ (p(1|x(i),w) − y(i)) . x(i)
What does a large learning rate lead to?
A large learning rate leads to jumping across the optimum, which lacks stability
Can gradient descent get stuck at the local minima for E(w)?
No, because E(w) is strictly convex with respect to w and so has a unique minima.
What does a small learning rate lead to?
Small values results in longer time to converge to the optimum
What is differential curvature?
Where we contour the plots of the loss function, where each line corresponds to points where the loss is the same.
What are the drawbacks with differential curvature?
Differential curvature is only an instantaneous direction of best movement, not long term so optimisation is slow
What causes partial derivatives?
Different partial derivatives w.r.t different weights can be result of input variables having different scales and variables, which affect the loss function. Standardising input variables helps.
What are second order derivatives?
Second order derivates show if the gradient is changing too quickly to allow weight updates.
+ = convex, - = concave
What is the Newton-Raphson equation (Univariate)?
w = w - (E’(w) / E’‘(w))
Explain the new weight update rule for Newton-Raphson
The weight updated rule is based on a quadratic approximation based on a Tyler polynomial of degree 2. This takes the optimal of a quadratic approximation in a single step.
What are partial derivatives?
Partial derivative is the rate of change of a function of multiple variables w.r.t one of those variables, leaving the other constant.
What is the Hessian?
The Hessian contains all second-order partial derivatives
What is the Newton-Raphson equation (Multivariate)?
w = w - H^(-1)E(w) . ∇E(w)
Why use Newton-Raphson Multivariate?
This update will take us to the optimal of the quadratic approximation in a single step. However, as the quadratic approximation is not the true loss function, we will need to apply this rule iteratively.
Can we apply a learning rate to the Newton-Raphson method?
We could add a learning rate since it is an approximation, but this isn’t ideal for large datasets and limited memory requirements due to the matrices.