Lecture 4 Flashcards
1
Q
Convolutional neural networks
A
- Apply filter F to input channel A
- If no padding is used, the output channel is smaller than the input
channel
o But can do 0-padding around the input channel to get output
channel with the same size
o If F is an n by n filter, we need to pad: (n-1)/2 pixels per side to get same convolution - Stride: step-size when applying convolutions
o Influences size of the output by keeping filter size the same - Applying multiple filters gives multiple output channels
- And can also add the output of multiple filters into one output channel
o E.g. RGB images have multiple input channels - The result of the dot product between filter and a small chunk of A is one
number
o Note that there is always a bias added - We get an activation map (also called feature maps) by sliding the filter over all spatial
locations of A
o Receptive field: size of the filter - Convolutional neural network is a stack of convolutional layers
- Benefits over using a fully-connected layer:
o Fully connected layer:
▪ many weights
▪ increases capacity (over-fitting)
▪ Needs a lot of data and memory
▪ No robustness to distortions or shifts of the input
▪ Does not take topology in data into account
o Convolutional layer:
▪ Filter is applied to the entire image, fewer weights - Weights are shared over different input regions
▪ Robust to shifts of the input - Pooling: encodes degree of invariance with respect to translations, reduces size of layer
o Smaller feature maps are good for memory footprint
o Max-pooling: take maximum of NxM region, standard pooling approach
▪ Keeps highest activations
o Min-pooling: take minimum of
NxM region - Final layer before output layer is always
a fully connected layer
o Final layer is usually a soft-max
layer, i.e. if probabilities are
output they must sum to 1
2
Q
Parameters of a ConvNet
A
- Output shape depends on type of convolutions we
do
o Except for final layer: output shape is a
vector that is the length of the amount of
classes - Convolutional layer parameters: filter size * #filters
+ #filters - Pooling: no parameters
- Fully connected layer parameters: previous layer shape * #filters + #filters
3
Q
ReLU Initialisation
A
- We need to have an activation function where the derivative is never 0 -> otherwise
backpropagation of gradient is killed (vanishing gradient problem) - ReLU accelerates convergence of gradient descent compared to sigmoid/tanh functions
- Very inexpensive
- No vanishing gradient
- LeakyReLU: introduces a small negative slope for negative function inputs
o Helps to maintain better information flow during training
o Solves problem where many neurons only output values of 0 when using ReLU - Often combined with He initialisation for weights
o Samples from zero-mean Gaussian distribution with standard deviation of
sqrt(2/N_inputs)
o Distribution of initial activations should fall in the range where the function has the
largest gradient
4
Q
Regularisation
A
- Limit freedom of parameters of the model -> reduces overfitting
o By regulating weights - Modify the loss function by adding a term that discourages large weights
- Typically use L1 or L2 regularisation
o Forces network to be small
5
Q
Dropout
A
- For some probability P (usually 0.5) remove a neuron
o We remove each neuron with a probability P - Done for each training sample
o Left with smaller network
▪ Smaller network -> less chance for overfitting
o Each sample is processed by a different network - As a result each neuron becomes unable to rely on any one feature
o Spread out weights by shrinking the norm of weights using L2 regularisation - Typically done in fully-connected layers (but also possible for convolutional layers)
o Often (slightly) improves performance
6
Q
Data Augmentation
A
- When the dataset is too small, make modifications to the image
o E.g. mirror, blur or rotate image - Must not modify main characteristic that we are interested in
- Rigid transformations: preserve distance and angles:
o Rotation
o Translation - Affine transformations: may modify distances and angles:
o Scaling
o Shear
7
Q
Batch Normalisation
A
- Normalise activations from previous layer
o Makes learning more efficient & faster - Makes weights of deep layers more robust to changes in
previous layers
o Makes sure that mean and variance don’t change - Regularisation effect
o Mean and variance computed per mini-batch
o Mini-batches are random -> adds randomness to
normalised values
o Similar to drop-out, if we use small mini-batches
▪ But larger than n=1
8
Q
Update Rules
A
- Want to be able to change the rate at which parameters are
updated
o E.g. larger steps if we are further away -> faster
convergence - Add momentum: over time speed up the velocity if we are still improving
- Personalise update per parameter: RMSProp
- ADAM: combines momentum and personalised updates