Lecture 4 Flashcards

Question 1

Q

Convolutional neural networks

Answer

A

Apply filter F to input channel A
If no padding is used, the output channel is smaller than the input
channel
o But can do 0-padding around the input channel to get output
channel with the same size
o If F is an n by n filter, we need to pad: (n-1)/2 pixels per side to get same convolution
Stride: step-size when applying convolutions
o Influences size of the output by keeping filter size the same
Applying multiple filters gives multiple output channels
And can also add the output of multiple filters into one output channel
o E.g. RGB images have multiple input channels
The result of the dot product between filter and a small chunk of A is one
number
o Note that there is always a bias added
We get an activation map (also called feature maps) by sliding the filter over all spatial
locations of A
o Receptive field: size of the filter
Convolutional neural network is a stack of convolutional layers
Benefits over using a fully-connected layer:
o Fully connected layer:
▪ many weights
▪ increases capacity (over-fitting)
▪ Needs a lot of data and memory
▪ No robustness to distortions or shifts of the input
▪ Does not take topology in data into account
o Convolutional layer:
▪ Filter is applied to the entire image, fewer weights
Weights are shared over different input regions
▪ Robust to shifts of the input
Pooling: encodes degree of invariance with respect to translations, reduces size of layer
o Smaller feature maps are good for memory footprint
o Max-pooling: take maximum of NxM region, standard pooling approach
▪ Keeps highest activations
o Min-pooling: take minimum of
NxM region
Final layer before output layer is always
a fully connected layer
o Final layer is usually a soft-max
layer, i.e. if probabilities are
output they must sum to 1

Question 2

Q

Parameters of a ConvNet

Answer

A

Output shape depends on type of convolutions we
do
o Except for final layer: output shape is a
vector that is the length of the amount of
classes
Convolutional layer parameters: filter size * #filters
+ #filters
Pooling: no parameters
Fully connected layer parameters: previous layer shape * #filters + #filters

Question 3

Q

ReLU Initialisation

Answer

A

We need to have an activation function where the derivative is never 0 -> otherwise
backpropagation of gradient is killed (vanishing gradient problem)
ReLU accelerates convergence of gradient descent compared to sigmoid/tanh functions
Very inexpensive
No vanishing gradient
LeakyReLU: introduces a small negative slope for negative function inputs
o Helps to maintain better information flow during training
o Solves problem where many neurons only output values of 0 when using ReLU
Often combined with He initialisation for weights
o Samples from zero-mean Gaussian distribution with standard deviation of
sqrt(2/N_inputs)
o Distribution of initial activations should fall in the range where the function has the
largest gradient

Question 4

Q

Regularisation

Answer

A

Limit freedom of parameters of the model -> reduces overfitting
o By regulating weights
Modify the loss function by adding a term that discourages large weights
Typically use L1 or L2 regularisation
o Forces network to be small

Question 5

Q

Dropout

Answer

A

For some probability P (usually 0.5) remove a neuron
o We remove each neuron with a probability P
Done for each training sample
o Left with smaller network
▪ Smaller network -> less chance for overfitting
o Each sample is processed by a different network
As a result each neuron becomes unable to rely on any one feature
o Spread out weights by shrinking the norm of weights using L2 regularisation
Typically done in fully-connected layers (but also possible for convolutional layers)
o Often (slightly) improves performance

Question 6

Q

Data Augmentation

Answer

A

When the dataset is too small, make modifications to the image
o E.g. mirror, blur or rotate image
Must not modify main characteristic that we are interested in
Rigid transformations: preserve distance and angles:
o Rotation
o Translation
Affine transformations: may modify distances and angles:
o Scaling
o Shear

Question 7

Q

Batch Normalisation

Answer

A

Normalise activations from previous layer
o Makes learning more efficient & faster
Makes weights of deep layers more robust to changes in
previous layers
o Makes sure that mean and variance don’t change
Regularisation effect
o Mean and variance computed per mini-batch
o Mini-batches are random -> adds randomness to
normalised values
o Similar to drop-out, if we use small mini-batches
▪ But larger than n=1

Question 8

Q

Update Rules

Answer

A

Want to be able to change the rate at which parameters are
updated
o E.g. larger steps if we are further away -> faster
convergence
Add momentum: over time speed up the velocity if we are still improving
Personalise update per parameter: RMSProp
ADAM: combines momentum and personalised updates

Lecture 4 Flashcards

(8 cards)