Gradient Descent Flashcards

Question

What is AdaGrad?

Answer 1

An adaptive learning rate method: α is scaled per parameter using past squared gradients. Works well for sparse data.

Answer 2

Improves AdaGrad by using an exponentially weighted average of squared gradients to handle non-convex problems.

Answer 3

Combines momentum and RMSProp: m_t = β₁m_{t-1} + (1-β₁)g_t v_t = β₂v_{t-1} + (1-β₂)g_t² θ = θ - α ⋅ m_t / (√v_t + ε).

Answer 4

Mini-Batch GD uses hardware parallelism (e.g., GPUs) for batch computations, reducing noise compared to SGD.

Answer 5

Common choices: 32, 64, 128. Small batches add noise (regularization), large batches mimic BGD’s stability.

Answer 6

Online learning updates models incrementally as data arrives (e.g., streaming). SGD is a form of online learning.

Answer 7

Halting training when validation error starts increasing to prevent overfitting. Used with all GD variants.

Answer 8

Adds a penalty term (e.g., λ||θ||²) to the loss. Gradients become ∇J(θ) + 2λθ.

Answer 9

Newton’s method uses the Hessian (second derivatives) for faster convergence but is computationally expensive.

Answer 10

Its noisy updates help escape local minima, making it suitable for complex, non-convex optimization.

Answer 11

A plan to adjust α during training. Example: α = α₀ * e^(-kt), where t is the iteration number.

Answer 12

Batch computations can be split across multiple cores/GPUs, speeding up training.

Answer 13

Limiting gradient values to a maximum threshold to prevent exploding gradients in deep networks.

Answer 14

α = (J(θ) - J*) / ||∇J(θ)||², where J* is the optimal loss. Rarely used practically but theoretically optimal.

Answer 15

It may cause slow convergence (if too small) or oscillations (if too large). Adaptive methods (Adam) mitigate this.

Answer 16

Corrects initial estimates of m_t and v_t to account for zero initialization (m_t / (1 - β₁^t)).

Answer 17

Redundant examples may bias gradients, but shuffling the dataset mitigates this.

Answer 18

Small batches (e.g., SGD) act as a regularizer, often improving generalization by adding noise.

Answer 19

GD is deterministic and gradient-based. Genetic algorithms are stochastic and inspired by evolution.

Answer 20

Subgradient methods or proximal GD are used (e.g., for L1 regularization).

Answer 21

Updates one parameter at a time instead of all parameters. Useful for high-dimensional problems.

Answer 22

The Hessian contains second-order partial derivatives. GD uses first-order gradients; Newton’s method uses Hessians.

Answer 23

A variant of SGD that reduces variance by combining current and historical gradients.

Answer 24

The learning rate can become infinitesimally small over time, halting updates.

Answer 25

GD is the optimization algorithm. Backpropagation computes gradients efficiently in neural networks.

Answer 26

A momentum variant that 'looks ahead' to the future gradient for smoother updates.

Answer 27

A conjugate gradient method that avoids computing the Hessian, using past gradients for direction.

Answer 28

A set of criteria to ensure sufficient decrease in the loss when choosing a step size in line search methods.

Answer 29

A condition to guarantee a sufficient decrease in the loss during line search.

Answer 30

It can struggle with saddle points and flat regions. Adaptive methods (Adam) help navigate such landscapes.

Answer 31

The Normal Equation (θ = (XᵀX)⁻¹Xᵀy) solves linear regression in one step, while GD iteratively approaches the solution.

Answer 32

GD will still converge, but the learning rate may need tuning. Regularization (e.g., L2) is often added.

Answer 33

BGD requires the entire dataset, making it incompatible with streaming/online data.

Answer 34

Regret measures the difference between the learner’s loss and the best fixed parameter’s loss. SGD has sublinear regret.

Answer 35

EM is used for latent variable models (e.g., GMMs), while GD is a general-purpose optimizer.

Answer 36

The Evidence Lower BOund is maximized using GD to train variational autoencoders (VAEs).

Answer 37

GD optimizes policy or value function parameters (e.g., in policy gradient methods).

Answer 38

KKT conditions generalize gradients to constrained optimization. GD is used in unconstrained settings.

Answer 39

Frank-Wolfe solves constrained problems by iteratively moving toward the steepest feasible direction.

Answer 40

A generalization of GD using Bregman divergences for non-Euclidean geometry.

Answer 41

GD is used to optimize hyperparameters or learn initialization parameters for fast adaptation.

(65 cards)