Quiz #3 Flashcards
What three partial derivatives must we calculate for backpropagation in a convolutional layer?
- dL/dh_in = dL/dHout * dHout/dHin (i.e. the partial derivative of the loss w.r.t. to the input from the previous layer. This is what gets passed back to the previous layer.)
- dL/dk = dL/dHout * dHout/dK (i.e. the partial derivative of the loss w.r.t. the kernel values)
- dL/dh_out (i.e. the partial derivative of the loss w.r.t. the output from the current layer. Remember that this is given because it is the “upstream gradient”)
When calculating dL/dK, a kernel pixel does not affect all the values in the output? (True/False)
False, it does impact all the values of the output map. This is because we stride the kernel across the image, and we share weights in output maps.
In a convolutional layer, when calculating the partial derivative of the loss w.r.t. the output of the layer (dL/dHout), we must incorporate ALL the upstream gradients and apply the chain rule over all the output pixels? (True/False)
True. This is because a single kernel pixel impacts the entire output since the kernel is strided across the image and weights are shared.
If a node in a computation graph impacts multiple values in the output, what operation must be applied in the backward pass to ensure that information from each of those individual connections is incorporated in the backprop update?
We SUM the gradients from each of the upstream connections.
If we take the partial derivative of the output pixel located at (r, c) w.r.t the kernel pixel located at (a’, b’), what expression represents the value of dY(r,c)/dK(a’,b’) if a’=b’=0?
dY(r,c)/dK(a’,b’) = x(r + a’, c + b’), so if a=b=0 then the derivative for this location is simply x(r, c)
When calculating the partial derivatives for backpropagation in a convolutional layer, it is unnecessary to calculate the partial derivative of the loss L with respect to the input x (i.e. dL/dx) because that derivative does not impact the kernel weight value updates? (True/False).
False. While it’s true that dL/dx isn’t needed for updating the kernel values, this derivative is important because it is the gradient that gets passed back to the previous layer.
What gradient needs to be calculated in order to pass back to the previous layer?
dL/dx, i.e. the partial derivative of the loss w.r.t the input of the current layer.
For input pixel x(r’, c’), what impact does this pixel have on the output when calculating the gradient dL/dx?
It impacts the neighborhood around it (where part of the kernel touches it).
When calculating the loss w.r.t. the input x (dL/dx), each pixel in the output is impacted by the input pixel? (True/False)
False. Since we’re striding the kernel across the input x, only the region where the kernel touches that input pixel are impacted. In those regions, we have to sum the gradients, hence the impact of all the pixels in this neighboring region.
When calculating the gradient for a max pooling layer, every input pixel into the max pool layer impacts the gradient? (True/False)
False. The entire point of the max pooling operation is to perform dimensionality reduction by zeroing out all non-max pixels within the kernel region. Since only one pixel in the region will have a non-zero value, the gradients with respect to any other pixel in the region are also zero.
A single pixel deep in a multi-layered CNN is only sensitive to the receptive field from the n -1 layer? (True/False)
False. A single pixel in the deeper layers is impacted a larger receptive field from the previous layer, which in turn is influenced by a larger receptive field from the previous layer, and so on. This is what gives CNN their representational power.
What was the first major 21st century CNN architecture and when was it introduced?
AlexNet in 2012
We tend to use fewer convolutional kernels (i.e. feature maps) as we go deeper into the network? (True/False)
False, generally speaking.
What was the first modern CNN architecture to use ReLU instead of sigmoid or tanh?
AlexNet
What activation function is used in AlexNet?
ReLU (it was the first to do this)
What are the 5 key aspects of the AlexNet architecture (per the lectures)?
- ReLU instead of sigmoid or tanh
- Specialized normalization layers
- PCA-based data augmentation
- Dropout
- Ensembling (7 models were trained together)
As we go deeper into a CNN, the receptive field increases?
True
What layers uses the most memory and why?
Convolutional Layers. Because we have to store the activations we obtained from the forward pass because the gradient calculation requires them for the backward pass. Since the output from the forward pass is so large (we’re striding across the entire image, remember) this leads to a large memory footprint.
Convolutional layers tend to have more parameters than FC layers? (True/False)
False. Convolutional layers have a higher memory footprint, but FC layers have many more parameters since every weight is connected.
What layers tend to have the most parameters and why?
Fully connected layers. This is because every weight is (as implied by its name) connected.
For a fully connected layer with 12 input neurons, 10 output neurons and 3 channels, how many parameters are there (excluding bias terms)?
12103 = 360
What are the two key aspects of the VGG architecture?
- Repeated application of blocks:
- 3x3 conv (stride=1, padding=1)
- 2x2 max pool (stride=2)
- Very large number of parameters (mostly from big FC layers)
What are some of the main architectural differences between VGG and Alexnet?
- Alexnet used a large stride, but this loses information. VGG uses a much smaller stride (1 for conv layers, 2 for maxpool) to preserve information.
Roughly how many trainable parameters are required for VGG architectures versus Alexnet?
Hundreds of millions for VGG compared to 60-70M for AlexNet