Gradient Descent Flashcards
(65 cards)
What is Gradient Descent?
Gradient Descent is an iterative optimization algorithm used to minimize a loss/cost function by adjusting model parameters in the direction of the negative gradient. It is foundational in machine learning for training models like linear regression and neural networks.
What is the general update rule for Gradient Descent?
For parameters θ and learning rate α:
θ = θ - α ⋅ ∇J(θ),
where ∇J(θ) is the gradient of the cost function J(θ).
What is Batch Gradient Descent (BGD)?
BGD computes the gradient using the entire training dataset and updates parameters once per epoch. It is stable but computationally expensive.
What is the cost function for BGD in linear regression (MSE)?
J(θ) = (1/2m) Σ(hθ(xⁱ) - yⁱ)²,
where m = number of training examples, hθ(x) = θᵀx.
What is the gradient of MSE for BGD?
∇J(θ) = (1/m) Σ(hθ(xⁱ) - yⁱ) ⋅ xⁱ.
For parameter θ_j: ∂J/∂θ_j = (1/m) Σ(hθ(xⁱ) - yⁱ) ⋅ x_jⁱ.
What are the pros and cons of BGD?
Pros: Stable convergence, accurate gradients. Cons: Slow for large datasets, high memory usage.
What is Stochastic Gradient Descent (SGD)?
SGD updates parameters per training example (one example at a time). It is faster but noisier than BGD.
What is the update rule for SGD?
For a single example (xⁱ, yⁱ):
θ = θ - α ⋅ (hθ(xⁱ) - yⁱ) ⋅ xⁱ.
Why does SGD have noisy updates?
Because it uses a single example’s gradient, which may not represent the true gradient of the entire dataset.
What is Mini-Batch Gradient Descent?
A compromise between BGD and SGD. It uses small batches (e.g., 32, 64 examples) to compute gradients. Balances speed and stability.
What is the update rule for Mini-Batch GD?
For a batch size b:
θ = θ - α ⋅ (1/b) Σ(hθ(xⁱ) - yⁱ) ⋅ xⁱ
for i=1 to b.
Why is Mini-Batch GD widely used in deep learning?
It leverages vectorized operations for efficiency, avoids SGD’s noise, and scales well to large datasets.
What is the role of the learning rate (α)?
α controls the step size of parameter updates. Too small: slow convergence. Too large: oscillations/divergence.
How does learning rate decay improve convergence?
Reducing α over time (e.g., α = α₀ / (1 + decay_rate * epoch)) helps fine-tune parameters near the minimum.
What is the difference between BGD, SGD, and Mini-Batch GD in terms of updates?
- BGD: 1 update/epoch.
- SGD: m updates/epoch (m = dataset size).
- Mini-Batch: m/b updates/epoch (b = batch size).
What is an ‘epoch’?
One full pass through the entire training dataset.
What is a ‘local minimum’? How does SGD escape it?
A point where J(θ) is lower than nearby points but not the global minimum. SGD’s noisy updates can ‘jump’ out of local minima.
What is a saddle point? Why is it problematic?
A point where gradients are zero but not a minimum (common in high-dimensional spaces). SGD’s noise helps escape it.
Why is feature scaling important for Gradient Descent?
Scaling features (e.g., normalization) ensures all parameters update at the same rate, speeding up convergence.
What is the convergence criterion for Gradient Descent?
Stop when ||∇J(θ)|| < ε (small threshold) or when J(θ) changes minimally between epochs.
How does BGD guarantee convergence for convex functions?
For convex J(θ), BGD converges to the global minimum with a sufficiently small α.
What is the time complexity of BGD vs. SGD?
- BGD: O(m) per epoch.
- SGD: O(1) per update (total O(m) per epoch).
What is the ‘vanishing gradient’ problem?
In deep networks, gradients become extremely small, slowing down updates in early layers. Not directly related to GD variants.
What is momentum in Gradient Descent?
Momentum (e.g., v = γv + α∇J(θ); θ = θ - v) accelerates updates in consistent gradient directions, reducing oscillations.