Lecture 8 - Gradient Methods Flashcards
What type of space affect optimisation (qualties of Hypothesis Space)
- Continuous: No sudden jumps; gradient methods work well.
- Discontinuous: May break gradient methods.
- Differentiable: Needed for gradient descent.
- Non-differentiable: Use direct methods instead.
- Low modality: Few minima (easier to optimize).
- High modality: Many local minima (harder).
- Pathological: Strange landscapes, e.g., spiky or flat plateaus.
Why is it important to know the affects of spaces?
Because gradient descent assumes a nice, smooth bowl. If the space is bumpy or weird, the method may fail or converge slowly.
What do derivatives tell us?
- First derivative f′(x): tells you the slope — use it to go up or down.
- Second derivative f′′(x): tells you the curvature — use it to judge how fast slope is changing.
What questions do derivatives answer?
- Where is the slope 0? ⇒ potential extrema (stationary points)
- Is the point a min, max, or saddle?
- Which way should I move for fastest descent?
What is Gradient Descent and its rule?
Gradient descent is an iterative optimization algorithm used to find the minimum of a function.
It works by moving in the direction of the negative gradient, which is the direction of steepest decrease in the function.
x <- x - alpha * f’(x)
What is Gradient Ascent and its rule?
Gradient ascent is the same idea, but instead of minimizing, it maximizes the function.
You move in the direction of the gradient — where the function increases most quickly.
x <- x + alpha * f’(x)
What is the Gradient Ascent/Descent Algorithm process?
REFER TO SLIDES FOR BREAKDOWN
What is the Stopping Criteria of Gradient Ascent/Descent?
You stop when:
* f′(x)=0: The slope is flat — potential min or max.
* ∣f′(x)∣<ϵ: Close enough to flat.
* Time or iteration limit reached.
What do you need to consider about the Stopping Criteria of Gradient Ascent/Descent?
- A zero gradient (f′(x)=0) doesn’t always mean you’ve found a minimum
- Could be:
○ Maximum (gradient ascent)
○ Minimum (gradient descent)
○ Saddle point (neither — it’s flat but unstable) - How do we tell the difference?
○ Use the second derivative, f′′(x)
How do you determine if its a local min or local max?
f’(x) = 0, f’‘(x) > 0 (minimum)
f’(x) = 0, f’‘(x) < 0 (maximum)
f’(x) = 0, f’‘(x) = 0 (inconclusive)
What is the Ideal Case of Gradient Ascent/Descent?
○ The function is smooth and differentiable
§ No jumps, kinks, or flat regions
§ Gradient exists everywhere
○ It has a single global minimum or maximum
§ No local minima or maxima to get stuck in
§ Gradient descent will always find the right answer
○ The gradient behaves predictably
§ Far from the minimum → big slope → big step
§ Close to the minimum → small slope → small step
This means you naturally slow down as you approach the minimum → smooth convergence.
Why does the Ideal Case for Gradient Ascent/Descent matter??
- No overshooting
- No weird local optima
- No need to adjust α much
- Convergence is fast and stable
What is Derivative Step Size?
- When you’re far from the minimum, the slope (derivative) is steep → the gradient is large → you take bigger steps
- When you’re close to the minimum, the slope flattens out → the gradient is small → you take smaller steps
Why is Derivative Step Size helpful?
Because in many functions (especially quadratics), the gradient acts like a natural brake:
- Far away? → move quickly
- Close to the target? → slow down automatically
- This helps avoid overshooting the minimum
Hence, the derivative acts like a natural step-size scaler when α=1.
What is Rayleigh Distribution Case?
- A Rayleigh distribution is asymmetric:
○ Steep on one side
○ Flat on the other - Where the Gradient descent might:
○ Take small steps where slope is flat
○ Overshoot where slope is steep
Why is it a problem if the Gradient takes small steps or overshoots in Rayleigh Distribution
Why is this a problem?
- Poor convergence, unpredictable behaviour. Choosing α becomes difficult.
What are the Tradeoffs when choosing Step Size for Rayleighs distribution?
Choosing Step Size
Trade-offs:
- Too small:
○ Converges slowly
○ Wastes computation
- Too large:
○ Overshoots the minimum
○ Can oscillate or diverge
What is the Newton-Raphson Method?
The Newton-Raphson Method is a fast, iterative algorithm used to find:
1. The roots (zeros) of a function (where f(x)=0), or
2. The optima of a function (where f′(x)=0)
How do you find the roots in Newton-Raphson Method
xn + 1 = xn - ( ((f’(x)) / (f’‘(x)) )
How is Newton-Raphson used for Optimisation?
You apply the formula to the derivative to find the minima or maxima
Newton-Raphson Algorithm
REFER TO SLIDES
What is Smoothness?
Smoothness refers to how “nice” or well-behaved a function is, especially in terms of:
* Continuity (no jumps)
* Differentiability (has a slope)
* Second-order differentiability (has curvature)
Smoothness is based off classes -REFER TO NOTES
Why is Smoothness important in Newton-Raphson
It requires a C^2 class where the function is continuous, first and second derivative exists and those derivatives are also continuous
What are the limitations of Newton-Raphson?
Requires second derivative - not all functions a twice differential
Can diverge - if stating points is too far from root or second derivaive is too small or 0
Doesnt guarantee global minimum - only finds local optimum which could a min, max or saddle point