Backpropagation Flashcards

Question 1

Q

What problem does backpropagation solve in training neural networks?

Answer

A

It efficiently computes the gradient of the loss with respect to every learnable parameter so that optimisation algorithms can update the weights.

Question 2

Q

What is a computational graph in the context of automatic differentiation?

Answer

A

A directed acyclic graph where each node represents the output of a primitive operation and edges indicate data flow, enabling systematic gradient computation.

Question 3

Q

Name the three pillars of automatic differentiation.

Answer

A

1) Computational graph representation, 2) Local derivatives for every primitive operation, 3) Chain rule to combine those derivatives across the graph.

Question 4

Q

What information is stored during the forward pass for use in backpropagation?

Answer

A

The output (and sometimes input) of each operation and a link (grad_fn) describing how to compute that operation’s gradient.

Question 5

Q

Describe the backward pass (reverse‑mode autodiff) in plain words.

Answer

A

Starting from the loss node, gradients are propagated backwards through the graph, multiplying or adding local derivatives, to obtain gradients for all parameters.

Question 6

Q

Explain the chain rule conceptually as it applies to backpropagation.

Answer

A

The sensitivity of the loss to an earlier variable equals the sensitivity to its immediate parent multiplied by the parent’s sensitivity to that earlier variable, summed over all paths.

Question 7

Q

How does backprop handle operations with multiple inputs?

Answer

A

It computes a partial derivative for each input and sums their contributions when propagating gradients upstream.

Question 8

Q

Why must forward activations be saved until the backward pass finishes?

Answer

A

Many gradient formulas reuse forward values (e.g., 1/x for log), so storing them avoids expensive recomputation.

Question 9

Q

In PyTorch eager mode, how are operations recorded for gradient computation?

Answer

A

Each tensor that requires gradients keeps a reference to its creator operation in tensor.grad_fn, forming the backward graph implicitly.

Question 10

Q

What qualifies as a primitive operation in autodiff frameworks?

Answer

A

Any operation with a hard‑coded derivative rule, such as addition, multiplication, logarithm, matrix multiplication, or ReLU.

Question 11

Q

Why is backpropagation memory‑intensive and how can memory be reduced?

Answer

A

All intermediate activations are kept for gradient computation; techniques like checkpointing re‑compute parts of the forward pass to trade compute for reduced memory.

Question 12

Q

Compare eager mode and graph mode (TorchScript) in PyTorch.

Answer

A

Eager executes operations immediately and stores only backward links, while graph mode builds the full forward graph first for optimisation and deployment.

Question 13

Q

Why can’t we symbolically differentiate an entire deep network expression?

Answer

A

The symbolic expression would be immensely large and impractical; autodiff breaks computation into manageable primitives with known derivatives.

Question 14

Q

How does the runtime of the backward pass compare to the forward pass?

Answer

A

It is typically of the same order because each primitive’s gradient computation requires a similar amount of work as its forward computation.

Backpropagation Flashcards

(14 cards)