Lecture 6 Flashcards
- new layer architecture
- Outputs are concatenated
- Problems
o Concatenation makes the
matrix very big
o Max-pooling does not change
the feature maps, only their
size - Thus, we should reduce the feature
maps at some locations in the
architecture - Solution
o Use additional 1x1 convolutions -> does not change feature map size
▪ Given a feature map of NxNxM and a 1x1xM convolutional filter results in a
NxN feature map
▪ Given K 1x1xM convolutional filters and a feature map of
NxNxM results in NxNxK feature map
o Can tune the number of output feature maps -> dimension
o Combines all activations across feature maps and makes a smaller set - Also called inception network -> often repeated multiple times in a bigger network
- Uses two additional classifiers during training
o To counteract very deep network that may not propagate gradients back through all
layers in an effective manner
o Encourages discrimination in lower stages
o Increases gradient signal that gets propagated back
o Fully connected layers are replaced by average pooling
- Deeper networks did not necessarily improve training error
o Gradients no longer flow back to the input signal
▪ Use skip connections
o As we go deeper, representations become difficult to learn
▪ Learn the residual - Skip connection skips one activation function and is added to the
next - In DenseNet, skip connections go to all future activation
o Strong gradient flow
o Computationally efficient -> controlled by constant K
channels per layer
o Low complexity features -> final classifier sees
features from all layers
Using pre-trained Networks
- Common networks (VGG16,Inception,ResNet) are trained on ImageNet and available
- Helps when we have limited data
- Very small dataset: feature extractor
o Only train final layer, freeze all other layers - Medium dataset: fine-tuning
o Train last few layers, freeze all other layers - Much faster, because low-level features are often similar between tasks
Detection Tasks
Detection Tasks
* Find certain structures in an image
o E.g. can we find any nodules within this scan?
* Possible approaches:
o Patch classification (is the patch centred on the structure)
o Segmentation (U-Net, Dilated Networks)
o Predict location
Sampling Strategies
- Often most of the image area is easy to recognise
- Few target objects
- Difficult negatives are difficult to identify upfront
- Thus sampling using a uniform grid or random
sampling is not likely to find target objects or
difficult negatives - Can train a CNN to first identify possible matches
using random sampling - Then, use a second CNN to find the (hard)
negative samples (hard-negative mining)
o Negative samples with a high likelihood are samples more often in CNN-2
o Can use a deep network to identify diffucult negatives (often healthy cells that look a
lot like unhealthy cells)
Intersection over Union (IoU)
- Measure of object detection when using bounding boxes
o Bounding boxes are a box around an object of interest - IoU measures how large the intersection is compared to the total
combined area of the ground truth and detection
o A larger score means that the detection covers more of
the ground truth - Often use a threshold (e.g. 0.5) as hit criteria
o I.e. the bounded box from detection must cover at least half of the ground
truth detection box
o Also depends on the application, for higher accuracy the threshold is often
Region Proposal Networks
- Using ConvNets for detection tasks
- For example the Region with CNN features (R-CNN)
network - Region proposal is based on selective search
o Proposes 2000 bounding boxes per image
o Very slow - Each region is resized and processed with AlexNet
o Regions may have different shapes, but need to match exactly with input size of
o Pre-trained on ImageNet, which does not have bounding boxes
o Extract a vector of 4096 features from the last fully connected layer from AlexNet
▪ We remove the output layer from AlexNet - We train a linear Support Vector Machine (SVM) per class
o Positive examples: bounding boxes with IoU > 0.5 with ground truth
o Negative examples: all other bounding boxes
o Very slow: one per class - R-CNN performs multiple forward passes
o Fast R-CNN performs one single
forward pass
▪ Use VGG16 to process
whole image
▪ Obtain feature maps
from convolution +
pooling layers
▪ Proposed bounding
boxes are applied to the
feature map
▪ Region of interest is cropped and downscaled (to a fixed size) via a RoI
pooling layer
▪ Rest of the tnetwork processes RoI: - Softmax predicts the class
- Bbox regressor refines bounding box
▪ Does not need additional SVM
▪ But still uses region proposal -> slow, most computation is in region proposal - Faster R-CNN: another level of improvements by using a single
o Tasks performed by network:
▪ Region proposal
▪ Classification
▪ Bounding box refinement
o New part is region proposal network
▪ Input: feature map - Each feature map is
processed by a small
convolutional layer,
produces a 256 feature
vector - Vector processed to produce object bounding box and score
▪ Output: set of rectangular objects, each with an objectness score - Likelihood of containing an object or background
- Anchor boxes: different size boxes -> objects can present in different
sizes. Number of boxes is typically a hyper-parameter
o For example on the left, we have smaller object detections
within the boxes that detect the lungs
o Bottom part is a fully-convolutional network
o Top part is Fast R-CNN
YOLO: You Only Look Once
- A single network does:
o Bounding box prediction
o Class prediction for each bounding box - The image is divided into an SxS grid
o Each cell predicts B bounding boxes and confidence score for
those boxes
▪ Bounding boxes = anchor boxes
o If the centre of an object falls into a grid cell, that grid cell is
responsible for detecting the object - Each bounding box consists of 5 predictions
o (x,y): centre of the bounding box relative to the bounds of
the grid cell
o (w,h): predicted width and height relative to the whole
image (0 <= (w,h) <= 1)
o Confidence: IoU between predicted box and any ground truth box -> not for class, but
for background-foreground
▪ Probability of having an object in a cell multiplied by IoU of the bounding box
▪ No object: confidence should be 0
▪ If there is an object: confidence should be equal to the IoU with ground
truth - Classification:
o Each cell also predicts C conditional class probabilities
▪ Conditioned on the grid cell containing an object
▪ In practice, a vector of C values is produced for each grid cell - Output as a tensor with several components:
o SxS: grid cell
o B: bounding boxes
o x,y,h,w,confidence: per bounding box
o C probabilities
o shape of output tensor is S x S x (B*5 + C) - 24-layer convolutional network, pre-trained on ImageNet
- Uses a custom loss function
o Minimises sum-squared error
o Sum over all bounding boxes B and
all grid cells SxS
o First term: want to predict the
centre of the bounding box with a
low error
o Second term: want to predict the
size of the bounding box with low
▪ Relevant in small bounding
boxes, not so much in large
bounding boxes (hence we use a square root)
o Third term: confidence should be high when there is an object in the ground truth
▪ But also many grid cells without objects where the confidence is very small
o Fourth term: Accounts for grid cells without objects and avoids instability during
gradient descent
▪ Adds coefficients λcoord and λnoobj
o Fifth term: classification error should be small - Limitations:
o Imposes strong spatial constraints on bounding box
▪ Limits number of nearby objects that model can predict
o Struggles with small objects that appear in groups - Multiple (anchor) boxes can be found for the same
o Each has its own probability of containing
an object
o Typically use non-max suppression
▪ Ensures we end up with a single
bounding box for an object