L10: Object detection Flashcards

Question 1

Q

Linemod

❗️❗️❗️The modalities used by Linemod:

Answer

A

Color features:
local gradients → takes max gradient across R/G/B maps for pixel position (no greyscale convertion)
Depth features
Depth gradient → local surface normal vector, which is estimated form point cloud data (has orientation/direction)

Question 2

Q

Linemod

❗️❗️❗️How features are quantized

Answer

A

The resulting directions are quantized into predefined orientation bins.

Color gradients → Bin direction of the gradients into 180 bins (0 to 180 degrees). Negative color gradients are omitted.
Normal vectors → 3D orientations, the points that point out of the camera are predefined. Lie inside a 3D cone. Bin normal vector into the nearest section of the cone.

Both gradients are normalized so there is ONLY a direction. instead of having coordiantes, all gradients have integers.

Question 3

Q

Linemod

❗️❗️❗️The matching function for color/depth gradients, both pixel-wise and for an image window

Answer

A

Cross-correlation between color- and depth gradient.
- Color domain: each object template gradient dot each image gradient.
- - Depth domain: Same thing but it is anti-parallel (only positive)

Question 4

Q

Linemod

What is Linemod?

Answer

A

Object detection belonging to template matching.
Advantage of Linemod: It reduces the number of templates and speeds it up. Multimodal templates. Can handle scale -, viewpoint -, and illumination change.
- Template matching require a lot of templates

Question 5

Q

What levels are there?

Answer

A

Instance-level → Detect a specific instance of an object

Category-level → Detect an instance of a certain object type (like dog, fridge, oven, dining_table, etc)

Question 6

Q

Linemod

What is spreading and binarization?

Answer

A

Introduces a tolerance. This speeds up the process.

Question 7

Q

Linemod

What is being matched in the gradients?

Answer

A

Compares quantized gradient features of the object templeate with the corresponding gradient feature extracted from the input image.

This can be done for both color - and depth domain

Question 8

Q

CenterNet

What is CenterNet?

Answer

A

It is a category-level detector (can also do it instance level).
It is trained to predict bounding boxes around the detected objects.
- Tries to predict object centers plus sizes within the image.

Question 9

Q

CenterNet

What is anchor boxes?

Answer

A

Smaller set of boxes in smaller pixel locations. It stretches the box and finds the defined object after it has classified it.
- Anchor boxes are fixed initial boundary box guesses.

Question 10

Q

CenterNet

❗️❗️❗️How 2D detections are parameterized, and how this is different from regular anchor-based detectors

Answer

A

CenterNet scans through the image with strides R = 4. At each stride location the classifier predicts whether it’s an object center. The object size is predicted, and an offset correction is made, to compensate for inaccuracies caused by striding.

Anchors count as positive with an overlap IoU > 0.7.

Strides: Betyder at det bevæger sig over billedet med 4 pixel step i vertical og horizontal retning.
IoU: Intersection of Union (the higher the better!)

Question 11

Q

CenterNet

❗️❗️❗️How 2D detections are parameterized, and how this is different from regular anchor-based detectorsHow the three-term loss is built and what the terms mean

Answer

A

Guide training with loss:
1. - Classification/”focal” loss → Focus on objects, helps with too much background.
- Penalizes wrong binary prediction (0,1) with a modified log-loss.
2. Size loss → Predicts the size and the loss is found from the real size
- Penalize the discrepancies between the prediction and true size. Euclidean distance
3. Offset loss → Compensates downsampling from strides R=4. If we scale up predictions 4x there will be an offset from the ground truth center position. The net actively tries to predict the offset. (offset between the downscaled ground truth center and the predicted one)

Question 12

Q

CenterNet

❗️❗️❗️How the prediction output is converted back to a bounding box in the full image

Answer

A

Recovered with the predicted center, size and offset. Nearby pixels can also get classified as object, and 8-point non-maximum suppression is performed to remove overlapping center prediction.

Question 13

Q

CenterNet

❗️❗️❗️How to repurpose CenterNet to other tasks, e.g. 3D detection and human body pose estimation by joint positions

Answer

A

It can produce other things than only a center point:
- Specify 3D boxes → replace regression heads for the new prediction task
- Human body pose estimation → Specify number of joint locations

L10: Object detection Flashcards

(13 cards)