Backprop Flashcards
(27 cards)
What is backpropagation used for in neural networks?
To compute gradients of the loss function with respect to weights.
What is the key mathematical tool used in backpropagation?
The chain rule of calculus.
What is the goal of training a neural network?
To minimize the loss function by updating the model’s parameters.
What does the forward pass in a neural network compute?
The output prediction ŷ from input x using current weights.
What does the loss function measure?
The difference between predicted output ŷ and true output y.
What is the formula for mean squared error (MSE)?
MSE = ½ Σ(yᵢ - ŷᵢ)²
What does gradient descent do?
Updates parameters in the direction that reduces the loss.
What is the general update rule in gradient descent?
w ← w - η · ∂J/∂w
Why do we need the chain rule in multilayer networks?
Because each weight affects the loss through multiple intermediate functions.
What does the backward pass in backpropagation compute?
Gradients of the loss with respect to each weight in the network.
How does backpropagation proceed through the network?
From the output layer backward to the input layer.
What does ∂J/∂w₅ represent?
The partial derivative of the loss with respect to weight w₅.
What is the derivative of the sigmoid function?
σ’(z) = σ(z) · (1 - σ(z))
What does each term in the chain rule represent during backprop?
A local gradient of one part of the network function.
In a two-layer network, where do we start backpropagation?
At the output layer.
What does the learning rate η control?
The size of each update step during training.
What is the effect of a high learning rate?
Potential overshooting or divergence.
What is the effect of a very low learning rate?
Very slow convergence.
In backpropagation, why do we update hidden layer weights?
Because they also influence the loss through their effect on the output.
What does the forward pass compute in numerical backprop?
Layer outputs using initial weights and activations.
What is the typical loss value after the first forward pass in an example?
~0.2984 (in the lecture example)
How is the gradient with respect to an output weight computed?
Using the error, activation derivative, and the previous activation value.
What does backprop through the hidden layer require?
Propagating the output error through the activation and weights.
What kind of function is used as activation in the numerical example?
Sigmoid function.