Lecture 4 Flashcards

1
Q

Convolutional neural networks

A
  • Apply filter F to input channel A
  • If no padding is used, the output channel is smaller than the input
    channel
    o But can do 0-padding around the input channel to get output
    channel with the same size
    o If F is an n by n filter, we need to pad: (n-1)/2 pixels per side to get same convolution
  • Stride: step-size when applying convolutions
    o Influences size of the output by keeping filter size the same
  • Applying multiple filters gives multiple output channels
  • And can also add the output of multiple filters into one output channel
    o E.g. RGB images have multiple input channels
  • The result of the dot product between filter and a small chunk of A is one
    number
    o Note that there is always a bias added
  • We get an activation map (also called feature maps) by sliding the filter over all spatial
    locations of A
    o Receptive field: size of the filter
  • Convolutional neural network is a stack of convolutional layers
  • Benefits over using a fully-connected layer:
    o Fully connected layer:
    ▪ many weights
    ▪ increases capacity (over-fitting)
    ▪ Needs a lot of data and memory
    ▪ No robustness to distortions or shifts of the input
    ▪ Does not take topology in data into account
    o Convolutional layer:
    ▪ Filter is applied to the entire image, fewer weights
  • Weights are shared over different input regions
    ▪ Robust to shifts of the input
  • Pooling: encodes degree of invariance with respect to translations, reduces size of layer
    o Smaller feature maps are good for memory footprint
    o Max-pooling: take maximum of NxM region, standard pooling approach
    ▪ Keeps highest activations
    o Min-pooling: take minimum of
    NxM region
  • Final layer before output layer is always
    a fully connected layer
    o Final layer is usually a soft-max
    layer, i.e. if probabilities are
    output they must sum to 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Parameters of a ConvNet

A
  • Output shape depends on type of convolutions we
    do
    o Except for final layer: output shape is a
    vector that is the length of the amount of
    classes
  • Convolutional layer parameters: filter size * #filters
    + #filters
  • Pooling: no parameters
  • Fully connected layer parameters: previous layer shape * #filters + #filters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ReLU Initialisation

A
  • We need to have an activation function where the derivative is never 0 -> otherwise
    backpropagation of gradient is killed (vanishing gradient problem)
  • ReLU accelerates convergence of gradient descent compared to sigmoid/tanh functions
  • Very inexpensive
  • No vanishing gradient
  • LeakyReLU: introduces a small negative slope for negative function inputs
    o Helps to maintain better information flow during training
    o Solves problem where many neurons only output values of 0 when using ReLU
  • Often combined with He initialisation for weights
    o Samples from zero-mean Gaussian distribution with standard deviation of
    sqrt(2/N_inputs)
    o Distribution of initial activations should fall in the range where the function has the
    largest gradient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Regularisation

A
  • Limit freedom of parameters of the model -> reduces overfitting
    o By regulating weights
  • Modify the loss function by adding a term that discourages large weights
  • Typically use L1 or L2 regularisation
    o Forces network to be small
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Dropout

A
  • For some probability P (usually 0.5) remove a neuron
    o We remove each neuron with a probability P
  • Done for each training sample
    o Left with smaller network
    ▪ Smaller network -> less chance for overfitting
    o Each sample is processed by a different network
  • As a result each neuron becomes unable to rely on any one feature
    o Spread out weights by shrinking the norm of weights using L2 regularisation
  • Typically done in fully-connected layers (but also possible for convolutional layers)
    o Often (slightly) improves performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Augmentation

A
  • When the dataset is too small, make modifications to the image
    o E.g. mirror, blur or rotate image
  • Must not modify main characteristic that we are interested in
  • Rigid transformations: preserve distance and angles:
    o Rotation
    o Translation
  • Affine transformations: may modify distances and angles:
    o Scaling
    o Shear
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Batch Normalisation

A
  • Normalise activations from previous layer
    o Makes learning more efficient & faster
  • Makes weights of deep layers more robust to changes in
    previous layers
    o Makes sure that mean and variance don’t change
  • Regularisation effect
    o Mean and variance computed per mini-batch
    o Mini-batches are random -> adds randomness to
    normalised values
    o Similar to drop-out, if we use small mini-batches
    ▪ But larger than n=1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Update Rules

A
  • Want to be able to change the rate at which parameters are
    updated
    o E.g. larger steps if we are further away -> faster
    convergence
  • Add momentum: over time speed up the velocity if we are still improving
  • Personalise update per parameter: RMSProp
  • ADAM: combines momentum and personalised updates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly