Lecture 6 Flashcards

Question 1

Q

GoogLeNet

Answer

A

new layer architecture
Outputs are concatenated
Problems
o Concatenation makes the
matrix very big
o Max-pooling does not change
the feature maps, only their
size
Thus, we should reduce the feature
maps at some locations in the
architecture
Solution
o Use additional 1x1 convolutions -> does not change feature map size
▪ Given a feature map of NxNxM and a 1x1xM convolutional filter results in a
NxN feature map
▪ Given K 1x1xM convolutional filters and a feature map of
NxNxM results in NxNxK feature map
o Can tune the number of output feature maps -> dimension
reduction
o Combines all activations across feature maps and makes a smaller set
Also called inception network -> often repeated multiple times in a bigger network
Uses two additional classifiers during training
o To counteract very deep network that may not propagate gradients back through all
layers in an effective manner
o Encourages discrimination in lower stages
o Increases gradient signal that gets propagated back
o Fully connected layers are replaced by average pooling

Question 2

Q

ResNet

Answer

A

Deeper networks did not necessarily improve training error
o Gradients no longer flow back to the input signal
▪ Use skip connections
o As we go deeper, representations become difficult to learn
▪ Learn the residual
Skip connection skips one activation function and is added to the
next
In DenseNet, skip connections go to all future activation
functions
o Strong gradient flow
o Computationally efficient -> controlled by constant K
channels per layer
o Low complexity features -> final classifier sees
features from all layers

Question 3

Q

Using pre-trained Networks

Answer

A

Common networks (VGG16,Inception,ResNet) are trained on ImageNet and available
Helps when we have limited data
Very small dataset: feature extractor
o Only train final layer, freeze all other layers
Medium dataset: fine-tuning
o Train last few layers, freeze all other layers
Much faster, because low-level features are often similar between tasks

Question 4

Q

Detection Tasks

Answer

A

Detection Tasks
* Find certain structures in an image
o E.g. can we find any nodules within this scan?
* Possible approaches:
o Patch classification (is the patch centred on the structure)
o Segmentation (U-Net, Dilated Networks)
o Predict location

Question 5

Q

Sampling Strategies

Answer

A

Often most of the image area is easy to recognise
Few target objects
Difficult negatives are difficult to identify upfront
Thus sampling using a uniform grid or random
sampling is not likely to find target objects or
difficult negatives
Can train a CNN to first identify possible matches
using random sampling
Then, use a second CNN to find the (hard)
negative samples (hard-negative mining)
o Negative samples with a high likelihood are samples more often in CNN-2
o Can use a deep network to identify diffucult negatives (often healthy cells that look a
lot like unhealthy cells)

Question 6

Q

Intersection over Union (IoU)

Answer

A

Measure of object detection when using bounding boxes
o Bounding boxes are a box around an object of interest
IoU measures how large the intersection is compared to the total
combined area of the ground truth and detection
o A larger score means that the detection covers more of
the ground truth
Often use a threshold (e.g. 0.5) as hit criteria
o I.e. the bounded box from detection must cover at least half of the ground
truth detection box
o Also depends on the application, for higher accuracy the threshold is often
higher

Question 7

Q

Region Proposal Networks

Answer

A

Using ConvNets for detection tasks
For example the Region with CNN features (R-CNN)
network
Region proposal is based on selective search
o Proposes 2000 bounding boxes per image
o Very slow
Each region is resized and processed with AlexNet
o Regions may have different shapes, but need to match exactly with input size of
AlexNet
o Pre-trained on ImageNet, which does not have bounding boxes
o Extract a vector of 4096 features from the last fully connected layer from AlexNet
▪ We remove the output layer from AlexNet
We train a linear Support Vector Machine (SVM) per class
o Positive examples: bounding boxes with IoU > 0.5 with ground truth
o Negative examples: all other bounding boxes
o Very slow: one per class
R-CNN performs multiple forward passes
o Fast R-CNN performs one single
forward pass
▪ Use VGG16 to process
whole image
▪ Obtain feature maps
from convolution +
pooling layers
▪ Proposed bounding
boxes are applied to the
feature map
▪ Region of interest is cropped and downscaled (to a fixed size) via a RoI
pooling layer
▪ Rest of the tnetwork processes RoI:
Softmax predicts the class
Bbox regressor refines bounding box
▪ Does not need additional SVM
▪ But still uses region proposal -> slow, most computation is in region proposal
Faster R-CNN: another level of improvements by using a single
network
o Tasks performed by network:
▪ Region proposal
▪ Classification
▪ Bounding box refinement
o New part is region proposal network
▪ Input: feature map
Each feature map is
processed by a small
convolutional layer,
produces a 256 feature
vector
Vector processed to produce object bounding box and score
▪ Output: set of rectangular objects, each with an objectness score
Likelihood of containing an object or background
Anchor boxes: different size boxes -> objects can present in different
sizes. Number of boxes is typically a hyper-parameter
o For example on the left, we have smaller object detections
within the boxes that detect the lungs
o Bottom part is a fully-convolutional network
o Top part is Fast R-CNN

Question 8

Q

YOLO: You Only Look Once

Answer

A

A single network does:
o Bounding box prediction
o Class prediction for each bounding box
The image is divided into an SxS grid
o Each cell predicts B bounding boxes and confidence score for
those boxes
▪ Bounding boxes = anchor boxes
o If the centre of an object falls into a grid cell, that grid cell is
responsible for detecting the object
Each bounding box consists of 5 predictions
o (x,y): centre of the bounding box relative to the bounds of
the grid cell
o (w,h): predicted width and height relative to the whole
image (0 <= (w,h) <= 1)
o Confidence: IoU between predicted box and any ground truth box -> not for class, but
for background-foreground
▪ Probability of having an object in a cell multiplied by IoU of the bounding box
▪ No object: confidence should be 0
▪ If there is an object: confidence should be equal to the IoU with ground
truth
Classification:
o Each cell also predicts C conditional class probabilities
▪ Conditioned on the grid cell containing an object
▪ In practice, a vector of C values is produced for each grid cell
Output as a tensor with several components:
o SxS: grid cell
o B: bounding boxes
o x,y,h,w,confidence: per bounding box
o C probabilities
o shape of output tensor is S x S x (B*5 + C)
24-layer convolutional network, pre-trained on ImageNet
Uses a custom loss function
o Minimises sum-squared error
o Sum over all bounding boxes B and
all grid cells SxS
o First term: want to predict the
centre of the bounding box with a
low error
o Second term: want to predict the
size of the bounding box with low
error
▪ Relevant in small bounding
boxes, not so much in large
bounding boxes (hence we use a square root)
o Third term: confidence should be high when there is an object in the ground truth
▪ But also many grid cells without objects where the confidence is very small
o Fourth term: Accounts for grid cells without objects and avoids instability during
gradient descent
▪ Adds coefficients λcoord and λnoobj
o Fifth term: classification error should be small
Limitations:
o Imposes strong spatial constraints on bounding box
▪ Limits number of nearby objects that model can predict
o Struggles with small objects that appear in groups
Multiple (anchor) boxes can be found for the same
object
o Each has its own probability of containing
an object
o Typically use non-max suppression
algorithm
▪ Ensures we end up with a single
bounding box for an object

Lecture 6 Flashcards

(8 cards)