Backpropagation Flashcards

(14 cards)

1
Q

What problem does backpropagation solve in training neural networks?

A

It efficiently computes the gradient of the loss with respect to every learnable parameter so that optimisation algorithms can update the weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a computational graph in the context of automatic differentiation?

A

A directed acyclic graph where each node represents the output of a primitive operation and edges indicate data flow, enabling systematic gradient computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name the three pillars of automatic differentiation.

A

1) Computational graph representation, 2) Local derivatives for every primitive operation, 3) Chain rule to combine those derivatives across the graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What information is stored during the forward pass for use in backpropagation?

A

The output (and sometimes input) of each operation and a link (grad_fn) describing how to compute that operation’s gradient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the backward pass (reverse‑mode autodiff) in plain words.

A

Starting from the loss node, gradients are propagated backwards through the graph, multiplying or adding local derivatives, to obtain gradients for all parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the chain rule conceptually as it applies to backpropagation.

A

The sensitivity of the loss to an earlier variable equals the sensitivity to its immediate parent multiplied by the parent’s sensitivity to that earlier variable, summed over all paths.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does backprop handle operations with multiple inputs?

A

It computes a partial derivative for each input and sums their contributions when propagating gradients upstream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why must forward activations be saved until the backward pass finishes?

A

Many gradient formulas reuse forward values (e.g., 1/x for log), so storing them avoids expensive recomputation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In PyTorch eager mode, how are operations recorded for gradient computation?

A

Each tensor that requires gradients keeps a reference to its creator operation in tensor.grad_fn, forming the backward graph implicitly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What qualifies as a primitive operation in autodiff frameworks?

A

Any operation with a hard‑coded derivative rule, such as addition, multiplication, logarithm, matrix multiplication, or ReLU.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is backpropagation memory‑intensive and how can memory be reduced?

A

All intermediate activations are kept for gradient computation; techniques like checkpointing re‑compute parts of the forward pass to trade compute for reduced memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Compare eager mode and graph mode (TorchScript) in PyTorch.

A

Eager executes operations immediately and stores only backward links, while graph mode builds the full forward graph first for optimisation and deployment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why can’t we symbolically differentiate an entire deep network expression?

A

The symbolic expression would be immensely large and impractical; autodiff breaks computation into manageable primitives with known derivatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the runtime of the backward pass compare to the forward pass?

A

It is typically of the same order because each primitive’s gradient computation requires a similar amount of work as its forward computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly