09-10 - Segmentation & Object Detection Flashcards

Question 1

Q

Explain the GMM model

Answer

A

You need to initialize the number of clusters as for k means clustering, but instead of spherical (euclidean) distances we model each cluster as k-dimensional ellipses, so we take the mahalanobis distance.

GMM does not only estimate the k means/covariances, but also the mixing coefficients → priors/probabilities.

Question 2

Q

Explain the EM algorithm for GMM optimization

Answer

A

The algorithm for optimizing GMM, called EM, alternates between the Expectation and Maximization step.

Expectation:

calculating the propabilities of all data points belonging to the clusters
For N datapoints and K clusters: NxK probabilities

Maximization:

estimate the number of points assigned to each cluster
Update the cluster means, the covariances and the mixing coefficients

The beauty of it is the automatical detection of both cluster memberships and distribution parameters

Question 3

Q

Explain the update method for the mean shift algorithm

Answer

A

estimate density for all data points (Kernel density estimation)
Make a mean shift vector that points in the direction of the maximum increas in density
Update data points → adding mean shift vector to current position

Question 4

Q

What is the role of the kernel and the bandwidth for mean shift?

Answer

A

The bandwidth h is the window size.
- A smaller bandwidth leads to more detailed, fine-grained clusters
- If it is too small, we will have a problem, and have way to many clusters
- A larger bandwidth results in smoother, coarser clusters.
- if it is way too big, we will just have one big cluster though

There are different options for kernel functions, but the gaussian kernel is the standard one to choose.

Question 5

Q

Know the differences between and pros/cons of k-means, GMMs, and mean shift

Answer

A

k-means and gmm need nr of clusters initialized, which mean shift does not. Mean shift is computationally heavy though. GMM is more flexible to the size of clusters than k-means, mean shift is most flexible.

Question 6

Q

What is the structure of UNet?

Answer

A

A labeled training dataset is needed
- INPUT: color images
- Left: a series of 3x3 convolutions and activations followed by downsampling
- Right: decoding & upsampling of high level features to original image dimensions
- The horizontal “copy and crop” or skip connection arrows which allow for inclusion of both low- and high-level information during upsampling → indicate width + height of feature maps
- OUTPUT: porbability map, so like classification, but we get a probability for all individual pixels, not one for the whole image)

Question 7

Q

What are U-nets Modified loss weighting terms

Answer

A

Two important things about U-net, which make it so good: the structure, and the extra weight.
The extra weighting function consists of two parts:

Class balancing term: w_c
A border term between touching segments w_0

This helps to promote the small wierd/different parts of images and focus on difficult borders between touching segments.
For each pixel, we check the distances to the nearest “blobs”, at an edge two of them are very close. That pixel will get a high weight, which makes the algorithm focus more on that region.

Question 8

Q

What does it take to set up a training/testing pipeline on a new segmentation task using U-Net?

Answer

A

We need labeled data, but fortunately not that much bc u net uses augumentation and it is pixelwise

Question 9

Q

Pros and cons of using U-Net vs. clustering methods for segmentation

Answer

A

Pros:
- Not relying on color consistencies
- multi-scale structure allows it to use both local and semi global information around each pixel in the decision process
- simple to implement
- few training examples

Cons:
- limited interpretability (black box, no reasoning insight)
- prawn to overfitting
- computationally more demanding than clustering algorithms

Question 10

Q

What is the idea behind Linemod and what modalities does it use?

Answer

A

Object detection methods often have problems either with being real time applicable (learning based meathods) or not being able to handle backgournd clutter (template matching), which Linemod is solving by using multiple modalities instead of just one
1. Color features (pixel gradients)
2. depth features (surface normals)

Question 11

Q

How are Linemod features Quantized

Answer

A

The computed image and depth gradients are now normalized so that we are left with only an direction and no magnitude. The sign of the image gradient is omitted.

The result we quantize into predefined orientation bins.

2D bins for the color gradients (left), 3D for the normal vectors (right)

Question 12

Q

Explain how linemod matching works and write the matching function for color/depth gradients, bith pixel-wise and for an image window

Answer

A

Set of reference images {O_m}{m in M} of the object
Set of modalities M
Template T = ({O_m}{m in M},P)
List of pairs P (r (locations),m (modality))

Pixelwise Similarity Measure: The color and depth gradient maps are now used to match between image and object, which is done by looking at the crosscorrelation between the two domains at each pixel location by taking the crossproduct

Window similarity: Finally a matching function is used for a single object template and an image window R centered at pixel position c+r
So we take the sum of the maximum similarity score for each pair (highest similarity score amongst all possible regions in the test image where the template is alligned.

Question 13

Q

How are 2D detections parameterized in centernet, and how this is different from regular anchor-based detectors?

Answer

A

CenterNet is a category level detector, but can be retrained as an instance detector instead.
Diff from anchor based: no predefind bb prediction. Therefore faster
We only look at image data now, no depth anymore. CenterNEt has three heads/components:

Center Point
Width and height
Object Category (label prediction

Question 14

Q

How the three-term loss is built in centernet and what do the terms mean?

Answer

A

L = lambda_size *L_size + L_class + lambda_off *L_offset

Offset Loss/ Heatmap Loss: compensate for having to upscale after downsampling with R = 4 (when finding centerpoint). So if the image just was to be upsampled again, there automatically is an offset from ground truth

Classification Loss: It prioritizes the few foreground pixels and deprioritizes background pixels so they don’t overshadow (bc theres typically more back than forground)

Object Size Loss: Gives a “punishment” for differences betewen predicted width/height and true width/hight

Question 15

Q

How the prediction output is converted back to a bounding box in the full image

Answer

A

the predicted center, size and offset can be used to recover the bounding box

Question 16

Q

How to repurpose CenterNet to other tasks?

Answer

Study These Flashcards

A

3D boxes instead of 2D
- Requires 3 additional attributes pr center point: depth, 3D dimension and orientation

human body pose estimations
- find joint locations by detecting key points of objects and treating them as classes.
- For each class a heatmap is generated, and all heatmaps are combined into one
- Joint location prediction is prediction is crucial for situations with multiple people

09-10 - Segmentation & Object Detection Flashcards

(16 cards)