Backpropagation Flashcards
(14 cards)
What problem does backpropagation solve in training neural networks?
It efficiently computes the gradient of the loss with respect to every learnable parameter so that optimisation algorithms can update the weights.
What is a computational graph in the context of automatic differentiation?
A directed acyclic graph where each node represents the output of a primitive operation and edges indicate data flow, enabling systematic gradient computation.
Name the three pillars of automatic differentiation.
1) Computational graph representation, 2) Local derivatives for every primitive operation, 3) Chain rule to combine those derivatives across the graph.
What information is stored during the forward pass for use in backpropagation?
The output (and sometimes input) of each operation and a link (grad_fn
) describing how to compute that operation’s gradient.
Describe the backward pass (reverse‑mode autodiff) in plain words.
Starting from the loss node, gradients are propagated backwards through the graph, multiplying or adding local derivatives, to obtain gradients for all parameters.
Explain the chain rule conceptually as it applies to backpropagation.
The sensitivity of the loss to an earlier variable equals the sensitivity to its immediate parent multiplied by the parent’s sensitivity to that earlier variable, summed over all paths.
How does backprop handle operations with multiple inputs?
It computes a partial derivative for each input and sums their contributions when propagating gradients upstream.
Why must forward activations be saved until the backward pass finishes?
Many gradient formulas reuse forward values (e.g., 1/x for log), so storing them avoids expensive recomputation.
In PyTorch eager mode, how are operations recorded for gradient computation?
Each tensor that requires gradients keeps a reference to its creator operation in tensor.grad_fn
, forming the backward graph implicitly.
What qualifies as a primitive operation in autodiff frameworks?
Any operation with a hard‑coded derivative rule, such as addition, multiplication, logarithm, matrix multiplication, or ReLU.
Why is backpropagation memory‑intensive and how can memory be reduced?
All intermediate activations are kept for gradient computation; techniques like checkpointing re‑compute parts of the forward pass to trade compute for reduced memory.
Compare eager mode and graph mode (TorchScript) in PyTorch.
Eager executes operations immediately and stores only backward links, while graph mode builds the full forward graph first for optimisation and deployment.
Why can’t we symbolically differentiate an entire deep network expression?
The symbolic expression would be immensely large and impractical; autodiff breaks computation into manageable primitives with known derivatives.
How does the runtime of the backward pass compare to the forward pass?
It is typically of the same order because each primitive’s gradient computation requires a similar amount of work as its forward computation.