Deep Learning Basics Flashcards
- What happens if you apply a 5x5 filter to a 7x7 image with no padding, stride = 1.
- What happens if you apply a 5x5 filter to a 7x7 image with no padding, stride = 2.
- What happens if you apply a 5x5 filter to a 7x7 image with no padding, stride = 3.
- Get a 3x3
- Get a 2x2
- Not possible as the filter would go outside the image
What is stride?
Stride is how many pixels along we move the filter each time.
Stride = 1, means we move 1 pixel in any direction.
What is the formula for the output size of an image?
What is the size of the following:
1. 7x7 image, filter = 3x3, stride = 1
2. 7x7 image, filter = 3x3, stride = 2
3. 7x7 image, filter = 3x3, stride = 3
((N-F)/Stride) + 1
stride 1 => ((7 - 3) / 1) + 1 = 5
stride 2 => ((7 - 3) / 2) + 1 = 3
stride 3 => ((7 - 3) / 3) + 1 = 2.33 - not recommended because it’s not an integer value.
What is the motivation of padding?
To obtain an output size that is the same as the input image size.
What formula do we use to calculate how the size we would pad with?
How much should we pad for filter size:
- 3
- 5
- 7
(F-1)/2
(3-1)/2 = 1 (padding)
(5-1)/2 = 2 (padding)
(7-1)/2 = 3 (padding)
How do we deal with images that have a depth of 3 - RGB images (larger than 1)
We must use a filter with a depth that matches in input image depth. E.g. for an RGB image of depth 3, the filter must have depth 3.
- We calculate the dot product to merge the 3 output values into 1. Perform the convolution then add the 3 values to get 1.
What happens if the image depth and filter depth aren’t the same value?
We can’t calculate their dot product.
We may want to use multiple filters, how does this effect the output size?
The filter size decides the depth of the output, the output is known as the number of activation maps
Convolutional Neural Networks (CNN) size
In one convolution layer, we have 128 filters of 3x3x3 applied to input volume 128x128x3 with stride 2 and pad 1. What is the size of the output volume? Give details of how you calculate the size of the output volume.
(((N - F + (2*P))/stride) + 1
(((128-3+(2*1))/2)+1 = 64.5
round down/floor to 64
so output is 64x64x128
- number of filters is always the same as the number of output activation maps
What are the formulas for calculating the output images height, width and depth:
a volume of size W1 x H1 x D1
Four hyperparameters are required: Number of filters K, Filter size F, stride S, amount of zero padding P
When W2 and H2 are integers:
* Next layer: a volume of size W2 x H2 x D2
W2 = (W1 - F +2P) / S +1
H2 = (H1 - F +2P) / S +1
D2 = K
When and are not integers:
Next layer: a volume of size W2 x H2 x D2
▪ W2 = floor((W1- F +2P) / S) +1
▪ H2 = floor((H1 - F +2P) / S) +1
▪ D2 = K
Explain the role of the pooling layer in CNN:
- performs down-sampling making image representations smaller and more manageable, by aggregating information, reducing computation costs and memory usage
- e.g. given an input image of 200x200, it can make it 100x100.
- it operates over each activation map independently. The width and height get smaller but the depth remains the same
List two methods of down-sampling/pooling layers:
Max pooling: given a subregion, take the max value. The next region is where the next stride is. max(2, 9, 4, 5) = 9
Average pooling: given a subregion, calculate the average of the pixels. Next region is where the next stride is. avg(2, 9, 4, 5) = 20/4 = 5
What is a drawback of max and average pooling:
- They may remove important information or whole features from an image when down-sampling.
Calculate the output size of the pooling layer:
Give the output size for this question:
image = 128x128x3, Filter/pooling size = 2x2, stride = 2, no padding
image: W x H x D, filter F, stride S
- depth is the same
- W = floor((W - F)/S) + 1
- H = floor((H - F)/S) + 1
D = 3
W = ((128-2)/2)+1 = 64
H = ((128-2)/2)+1 = 64
output = 64x64x3
Explain the role of the fully connected layer in CNN:
- connects all neurons, so every input of the input vector influences every output of the output vector.
- maps learned high level features to the final output
Explain the process of the fully connected layer:
What is the output size of a fully connected layer with input image 32x32x3, 10 categories.
How many parameters would there be:
step 1) stretch channels into a 1D vector
step 2) connect each input node to each output node
step 3) y = Wx (ignoring bias)
- the 32x32x3 image becomes a 3072x1 vector
- the weights matrix would consist of 10x3072, perform matrix multiplication to get a 10x1 output
there would be 30720 parameters without bias. (30730 with bias)
What is the role of the activation function in CNN:
- it introduces non-linearity to the network, enabling it to learn more complex relationships/patterns in the data.
- They can mitigate vanishing gradients
What is the role of the BatchNorm layer in CNN?
- It normalises the output of each layer in the network
- Ensures outputs of a layer have mean = 0 and standard deviation = 1 which improves training stability
- Improves convergence speed. Can use random initialisation and a big learning rate
When is the BatchNorm layer applied:
After the convolution layer but before the activation function
What happens if we don’t perform batchNorm:
The state distributions are very strange for each layer, they are unknown and vary for each iteration.
Why do we forward batches through model instead of single images:
This creates a more stable network.
Why does batchNorm have two learnable parameters?
- γ (scale) and β (shift) for each channel
-The parameters γ and β allow the network to learn the optimal scale and shift for each feature map.
Why is the training process more stable using batchNorm?
Because the moving average of the mean and standard deviation is used to update them.
What are the differences between a standard network and a network that uses batchNorm?
- standard has to initalise parameters beforehand, have to design careful initialisation strategy
- BatchNorm can use random initialisation
- standard sometimes can’t use a big learning rate, converges slower
- batchNorm uses a big learning rate so converges quicker
- batchNorm uses many more layers than standard