Final Untouched Variation Flashcards

(280 cards)

1
Q

What is the formal definition of Computer Vision?

A

CV is concerned with the computational processes that allow representations of the viewed environment to be recovered from images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some overarching low-level problems for Computer Vision?

A
  • Corner Detection
  • Edge Detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some overarching medium-level problems for Computer Vision?

A
  • 3D Reconstruction
  • Segmentation
  • Tracking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some overarching high-level problems for Computer Vision?

A
  • Object Detection
  • Pose Estimation
  • Semantic Segmentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some use cases for Computer Vision?

A
  • Healthcare: Diagnosis and prognosis
  • Agriculture: Farming, harvesting
  • Automated driving
  • Games, movies
  • Sports
  • Security
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the term Principal or Optical axis mean in regards to pinhole cameras?

A

The Z-axis is often referred to with these terms, and it assumes that the camera orientation is aligned with the Z-axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the point, where the Z-axis hits the image plane, called?

A

The principal point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the principal point located?

A

It is located at (0, 0, -f) from origin, where f is the focal length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the Focal Length relate to Field of View?

A

As focal length increases, field of view decreases i.e. zooming in on something is an example of this.
As focal length decreases, field of view increases i.e. zooming out on a scene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you work out the coordinates of the projection on the image plane mathematically?

A

x = f(X) / Z
y = f(Y) / Z
Where X, Y, Z is the point on the object, and x, y is the same point on the image plane.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens to parallel lines when they ‘head towards’ the horizon?

A

Parallel lines will eventually appear to converge on a vanishing point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some problems with pinhole cameras?

A
  • Pinhole size (aperture) must be ‘very small’ to obtain a clear image. However, if pinhole size is made smaller, then less light is received by the image plane.
  • If pinhole is comparable to wavelength of incoming light, then diffraction effects blur the image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you ensure that the pinhole camera captures the sharpest image possible?

A

Diameter of pinhole camera = 2 square root of (f * wavelength of light)
Example:
If f = 50mm, and wavelength = 600nm, then diameter = ~0.35mm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some advantages in using the pinhole camera?

A
  • Simple to understand
  • Infinite depth of field
  • No lens distortion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are lenses used for?

A

Lenses are used to avoid the problems of using the pinhole camera, by capturing more light from the image plane, but retaining the same projection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the term ‘f’ stand for in terms of lenses?

A

f = Focal length of the lens, which determines the lens’s ability to bend/refract light

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How is intensity measured numerically?

A

Intensity = 0 if pixel is black
Intensity = 255 if pixel is white in an 8-bit image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What factors affect the colour of a pixel in an image?

A

Light sources:
- Emittance spectrum
- Geometry
- Directional attenuation

Objects’ surface properties:
- Reflectance spectrum
- Geometry
- Absorption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some typical use cases for Image Feature Representation?

A
  • Image alignment
  • 3D reconstruction
  • Motion tracking
  • Object/face recognition
  • Indexing and database retrieval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are image features?

A

A feature is a measurable property that describes the characteristics of an image or a region of images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How are image features often represented?

A

Often represented by scalars, vectors, matrices or tensors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What factors make image matching hard to perform?

A
  • Change in lighting
  • Change in viewpoint
  • Occlusions
  • Partial matching
  • Change over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are image regions & patches?

A

Image regions & patches are segments or rectangular image patches that are used to collect a wider area of information from an image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a feature vector?

A

A feature vector defines a set of descriptive features and concatenates them to produce a feature vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the idea behind using feature vectors?
The idea is to remove redundant or irrelevant data
26
What makes histograms a good representation of colour?
- Invariant to translation and rotation - Change slowly as viewing direction changes - Change slowly with object size - Change slowly with occlusion
27
What are texture features?
Texture features measure the frequency with which patterns of colour/grey levels appear
28
What are gradient-based features?
They are areas in an image which typically indicate boundaries of objects due to a spike in intensity i.e. intensity gradient.
29
How do you estimate gradients using spatial filtering?
You take your source pixel, and the area surrounding it, and apply a convolution kernel on top of the area. You then multiply the value at each pixel corresponding with the value in the kernel, before summing all the results together to form the final source pixel value.
30
What is the formal definition of noise?
Small random bits of data added or taken away from the true value
31
How does a mean filter remove noise?
Move a kernel across the image and calculate a new pixel value based on the average of its surrounding neighbours.
32
How does a Gaussian filter work?
It works almost like a mean filter, except it adjusts the kernel to use a weighted average. The weighted average is stronger towards the centre of the area.
33
What is the formal way of finding edges?
Edges are found through the use of difference filtering in order to pick out the areas of high contrast.
34
How does edge detection work?
Edge detection works by looking for sharp changes in intensity
35
How does a Histogram of Oriented Gradients work?
- Divide the patch into smaller cells (8x8 pixels) - Define slightly larger blocks, covering several cells (2x2 cells) - Compute gradient magnitude and orientation at each pixel - Compute a local weighted histogram of gradient orientations for each cell, weighting by some function of magnitude - Concatenate histogram entries to form a HoG vector for each block - Normalise vector values by dividing some function of vector length
36
Why is invariance important?
Invariance dictates that similar results should be produced even if the conditions vary, such as scale, translation, rotation and illumination changes
37
How does Scale Invariance work?
- Find points whose surrounding patches (at some scale) are distinctive - Convolution with a Gaussian mask gives some idea of what is going on around a pixel - Gaussian masks have a natural scale: Their standard deviation
38
What are some key properties of SIFT?
- Fast and Efficient, can run in real time - Can handle: Changes in viewpoint, significant changes in illumination
39
What is Clipping in the sense of Brightness?
Clipping occurs when the pixels are too bright to be correctly recorded in the numeric range available.
40
What is the formal definition of Shutter Speed?
Shutter Speed defines how long the light is allowed onto the film/sensor for
41
How does Shutter Speed, Aperture and Gain relate?
If one goes up, then you can effectively maintain the same brightness level by decreasing the others. However, that does cause other adverse effects e.g. Depth of Field
42
What happens when you increase the Aperture?
A larger aperture means more light, but it also reduces the depth of field
43
What happens when you have a longer shutter speed?
A longer shutter speed means more motion blur
44
What makes a good scientific image for CV?
- Underexposed brightness, prevents clipping, although it introduces more noise - Centre the photo with a simple background - Record calibration target for colour balance - Optimise other settings for increased image clarity
45
What is the general consensus when collecting ranges of images for scientific analysis?
It is often cheaper and faster in the long run to spend a while making sure the images you capture are captured well and stored correctly.
46
Why is segmentation conducted?
Segmentation is used to assist completion of higher-level tasks, such as recognition, tracking, image database retrieval, feature quantification and registration
47
What is the main challenge with segmentation?
Divide the image into regions/segments, where each region presents a distinguished item. Each region should have similar importance
48
How does Otsu's Thresholding work?
Exhaustively search for the threshold that minimises the intra-class variance and maximises the inter-class variance.
49
What is the primary disadvantage of using Otsu's Thresholding?
It's not robust enough on its own for most real-world applications.
50
What key assumption does the Otsu thresholding algorithm make about the data?
It assumes the data forms a bi-modal histogram
51
How does Region Growing work for segmentation?
- Start with one or more seed points - Iteratively check its neighbour points: If intensity difference between the neighbour point and the region is smaller than a threshold, assign the neighbour point to the segmented region. - Stop when there is no new assignment
52
What are the advantages of Region Growing for segmentation?
- Enables multiple class segmentation - Parameter is easy to adjust
53
What are the disadvantages of Region Growing for Segmentation?
- Local region solution - Computationally expensive - Leakage along weak boundaries - Sensitive to seed points
54
What is a Snake in regards to Segmentation?
It is a spline, which is a series of control points, with some function/s that govern the curve between the points. A snake is only interested in the location of the control points, but not the connection between them.
55
How does Active Contour, using Snake, work for segmentation?
- Represent the object boundary as a parametric curve - A cost function is associated with the curve, so the boundary is found by optimising the cost function - The cost function is defined as the sum of the three terms - Iteratively update the contour points using something such as gradient descent
56
What is the mathematical equation that defines a snake in segmentation?
Esnake = (alpha * Einternal) + (beta * Eimage) + (gamma * constraint) Where: - Einternal = Contour smoothness, point spacing, etc... - EImage = Image features e.g. lines, edges - Econstraint = External constraint such as user interaction keypoints
57
How is the Snake term, called Einternal, defined?
Einteral = Econt + Ecurve Where: - Econt = Continuity i.e. control point distribution along the curve - Ecurve = Curvature i.e. promote round curves where possible
58
How is the ECont term defined for Snake in segmentation?
Econt = (Davg - ||Pi - P(i-1)||)^2 Where: Davg is the average distance between adjacent points along the line Pi is a point on the line, and P(i-1) is the point before that.
59
How is the Ecurve term defined for Snake in segmentation?
Ecurve = (||P(i-1) - 2P(i) + P(i+1)||)^2 Where: P stands for a point on the curve
60
What happens if you remove the Davg term from the continuity measure for Snakes in Segmentation?
If you remove it, then the snake will become a closed loop that effectively wraps around the target object
61
How is a closed loop snake used for segmentation?
Closed snakes separate the areas on either side of the Snake. Inside of the shape the Snake creates is the foreground, and outside of the shape is the background.
62
What are some drawbacks of Active Contour?
- Node Distribution - Sharp Corners - Topology Changes
63
How is Explicit Geometry defined?
Explicit Geometry - Parameterised boundaries
64
How is Implicit Geometry defined?
Implicit Geometry - Boundaries given by zero level set
65
What are some key properties of the Chan-Vese model?
- No parameterisation required - Less sensitive to the contour initialisation and noise - Computationally efficient - Topological changes can be handled implicitly - Based on regional statistics rather than boundary information
66
How does Graph Construction/Cut work in segmentation?
- Consider the image as a graph, where it has edges, vertices, and costs between assigned values. - Find the optimal cut which produces the minimum cost
67
What two terms make up an energy function in a graph cut problem in segmentation?
Data term (Unary) - It is a function derived from the observed data that measures the cost of assigning label to pixel p Smoothness Term (Pairwise) - Measures the cost of assigning the labels to adjacent pixels p and q. It's used to impose spatial smoothness.
68
What does the smoothness term measure in graph cut problems in segmentation?
- Check all pairs of neighbour pixels - Penalise adjacent pixels with different labels - Function penalises a lot for discontinuities between pixels of similar intensities - However, if pixels are very different, then the penalty is small
69
What are the key steps of feature-based method using clustering?
- Represent the characteristics of the local region for each pixel e.g. intensity, filtering, SIFT, HOG, etc... - Define similarity function e.g. Euclidean, Cosine, Manhattan, etc... - Region partition is handled as pixel classification problem using clustering analysis, such as K-Means.
70
What is the general algorithm for K-means Supervised Learning?
- Randomly select K points as initial centroids - Repeat: -- Assigning each point to its closest centroid -- Re-compute the centroid of each cluster - Until centroids/sets do not change
71
What are some advantages of using K-Means algorithm?
- Simple and efficient - Solution dependant on the initialisation
72
What are some disadvantages of using K-Means algorithm?
- Need to specify the number of clusters - Sensitive to outliers
73
What are the key steps when using Feature-based supervised learning methods?
- Represent the characteristics of the local region for each pixel e.g. intensity, filtering, HOG, etc... - Region partition is handled as pixel classification problem using supervised machine learning methods such as SVMs, Random Forest, etc... - Requires a training process with ground truth labels
74
What is a key drawback of using Feature-based Supervised Learning Methods?
It is application dependent.
75
Why is image registration useful?
- Information fusion - Information comparison - Transformation estimation - Statistical modelling and analysis based on large sets of aligned images
76
What are some sample applications of Image Registration?
- Medical e.g. pre and post treatment comparison - Remote sensing e.g. road map, satellite map - Augmented Reality e.g. aligning 3D virtual model to 2D images
77
What is the main aim for Image Registration?
To transform a source image to match with a target image
78
What are some key elements used for Image Registration?
- Geometric Transformations e.g. rigid, affine or deformable - Similarity Measurement e.g. point correspondence, intensity based - Parameter optimisation e.g. closed form solution, gradient descent
79
What is the general form for Geometric Transformation, using either Euclidean or Affine?
2D Transformations: Euclidean: 2 translations, 1 rotation Affine: 2 translations, 1 rotation, 2 scale 3D Transformations: Euclidean: 3 translations, 3 rotations Affine: 3 translations, 3 rotations, 3 scale Therefore: First values for each transformation type are always equal i.e. 2D = 2 translations, 1 rotation, and 3D = 3 translations, 3 rotations
80
What makes image registration problems easier in regards to point-based methods?
If correspondence points can be determined either manually or automatically, then the image registration problem becomes easier
81
How does the Iterative Closest Point Algorithm work?
- For each point in the source point cloud, match the closest point in the target point cloud e.g. Euclidean distance - Estimate the transformation, and use a point to point distance metric e.g. root mean square minimisation technique, which will best align each source point to its match found in the previous step - Transform the source points using the obtained transformation - Iterate the previous three steps until transformation parameters remain unchanged
82
What is the fundamental problem with trying to recover the 3D structure of the scene from points matched between two images?
Fundamental ambiguity - Any point on the ray OP projects to the same image location, called P.
83
What is Stereo Correspondence?
Find matching pixels/features in 2 or more images and convert their 2D positions into 3D depths
84
What can resolve the fundamental ambiguity in stereo correspondence?
A second camera can resolve the ambiguity enabling measurement via triangulation
85
How do you achieve depth recovery using two cameras?
You use triangulation, which requires: - Knowledge of absolute and relative camera geometry i.e. Calibration - Point correspondence i.e. which rays to intersect
86
What are the properties of camera calibration?
It recovers the intrinsic parameters of the cameras e.g. focal length, pixel size, principal point, lens distortion Relative poses between cameras, also called extrinsic parameters, are also factored in e.g. rotation, translation, scale that transforms left image on to right
87
What is the easiest way to perform camera calibration?
Simplest approach is to use a known calibration target object
88
What additional geometric distortions are present within lenses used in cameras?
- Decentering errors: Displacement of the lens centre from optical axes - Radial distortion: Variations in light refractions, mostly in wide angle lenses
89
What is the image warping parameter?
Image warping parameter is estimated to warp the ideal projected coordinate to the distorted coordinate. K contains warping parameter.
90
What is the equation for image warping in stereo correspondence?
x' = warp(x, k) Where: - x = Ideal image (no distortion) - x' = Observed image with distortion
91
How are points generally defined in 3D space in stereo correspondence?
Points in the 3D space are expressed in terms of a different coordinate frame known as the world coordinate frame. The relation given between the coordinates of P in camera and world coordinate system are given by: Xcam = R(Xw - c) Where: - c = 3x1 vector representing the coordinates of the camera in the world coordinate system - R = 3x3 matrix representing the orientation of the camera
92
What is the purpose of camera calibration in a mathematical sense?
It is to calculate the intrinsic, extrinsic and distortion parameters.
93
What is epipolar geometry?
Given the two optical centres and a point in one image, we can compute the epipolar plane and so the corresponding epipolar line in the other image
94
Why is epipolar geometry important for camera calibration?
Given two calibrated cameras, it's possible to retrieve the actual 3D coordinate of a corner in the image
95
How can correspondence be used to measure depth of an object in an image?
Correspondence allows measurement of disparity: The difference in the image coordinates of the projections of a given world point into each camera. Depth is inversely proportional to disparity.
96
How does correspondence search work?
- Find a window in the original image - Slide it along the right scanline and compare the content of that window with that of the reference window in the original image
97
What is the effect of a window size in correspondence search?
- Larger window size: Smooth disparity maps but less detail captured - Smaller window size: More detail, but also more noise captured
98
What are some problems with correlation-based stereo?
- Window size is fixed across the image, but viewed objects differ in size and depth - Uniform regions always match - Can provide a dense disparity map, but values are only reliable where there is some local variation in intensity e.g. near edges - Dense disparity is computationally expensive in spatial domain
99
How do you get ground truth data?
- Alternative/competing sensors - Artificial images - Real images
100
What are some problems that can be encountered when gathering ground truth data?
- Automatic methods can have errors - Manual methods are slow, subjective and also error prone - What if standard sets don't have the properties you are attempting to evaluate your images on?
101
What is True Positive defined as?
True Positive - The algorithm makes a correct prediction about the presence of an object in an image
102
What is False Positive defined as?
The algorithm predicts the presence of an object but that object is not present in the image
103
What is False Negative defined as?
The algorithm misses an object
104
What is Precision's equation, and how is it defined?
Precision = TP/(TP + FP) Fraction of responses that were correct
105
What is Recall's equation, and how is it defined?
Recall = TP/(TP + FN) Fraction of correct classifications that were identified
106
How do you evaluate recognition?
Using Precision-Recall curves as a visualisation tool
107
What are some properties of Precision-Recall curves?
- Plot of precision against recall as some parameter is varied - Parameter is the threshold used to decide if the model and image are similar enough to be considered equal - Increasing threshold imposes a tighter requirement on matching, which reduces False Positives, but increases False Negatives
108
How would you measure accuracy in a classification problem?
For classification, you can use a confusion matrix, which shows what category images are confused with others
109
How is ground truth defined?
Ground truth is a set of manually-drawn bounding boxes on an image
110
What are two key elements of measurement and recognition?
- Accuracy - Robustness
111
What does a Precision Plot measure?
A Precision Plot measures the percentage of frames whose estimated location is within a given threshold distance of the ground truth.
112
When using algorithms that learn, what properties must it fulfill?
- Must be representative of the data - Must not be too specific - Must not use training data in the evaluation of performance
113
What does the term 'confining research leftward' mean in terms of Disparity Maps?
When you know which image is the left image and which image is the right image, then you can make safe estimations such as a point x in the left image could never be further to the right in the right image, thereby narrowing the search space.
114
Question
Answer
115
What are the 4 main Recognition problems?
- Recognition: Identifying the main object in an image - Detection: Find the location of all objects - Segmentation: Assign all pixels to objects - Pose: Find the location of the object parts
116
What is the formal definition of the 'Detection' problem in Recognition problems?
Find the location of all objects in the scenes in terms of providing a bounding box
117
What is Semantic Image Segmentation?
It's the process of partitioning the image into 'meaningful' segments You group pixels based on 'common' properties
118
What is Instance Image Segmentation used for?
If you need to differentiate different instances of the same object
119
What three main features are required for a good Object Recognition Model?
- Data: Images containing objects from that class and images from all other classes - Feature Extraction: Work with features extracted from images - Machine Learning: From the features extracted, initiate and train a model that recognises this particular object class
120
How do HOG features work?
- Divide image into a grid of cells e.g. 8x8 - Compute edges and their orientation for every pixel location - Compute histogram of gradient orientations in each cell
121
What is Bag of Features?
Bag of Features methods analyse the large set of very specific features generated by a training set of images and identify a small set of useful, more generic features
122
How does Object Recognition work with Bag of Features?
- Take a bunch of images: Extract features, build up a 'dictionary' of common features - Then, given a new image, extract features: -- For each feature, find the closest visual word in the dictionary -- Build a histogram to represent the image
123
How does Viola-Jones Recognition work in practice?
- Slide a window across the image and evaluate a face model at every location
124
What are the key ideas that were pulled from Viola-Jones Recognition?
- Integral images for fast feature evaluation - Boosting for feature selection - Attentional cascade for fast rejection of non-face windows
125
How does the Integral Image work?
The integral image computes a value at each pixel that is the sum of the pixel values above and to the left of the source pixel inclusive.
126
How is Feature Extraction performed using Integral Images?
Features are extracted from sub-windows of a sample window, and each of the four feature types are scaled and shifted across all possible combinations.
127
What is the formal definition of Boosting?
Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier, where a weak learner is defined as a learner that does only slightly better than random chance.
128
How does Boosting work?
- Need a training set of labelled examples - Start with all examples equally weighted - Learn a series of recognition rules - Re-weight examples so incorrect recognition by nth classifier makes that example more important to the n+1th - No single rule/classifier can separate complex objects from complex backgrounds, but a combination can - Weights are determined automatically
129
What is the Classical Approach in Computer Vision?
Apply learned operations to user-defined features: - Design/choose features - Design/choose a classifier - Train the classifier
130
How does Bag of Words reduce reliance on the user?
It clusters the results of applying the user-defined set of feature-detection operators to form a more generic visual vocabulary.
131
What are the differences between MLPs and CNNs?
MLP: - Wide application scenario, not just images - Neurons are fully connected, but can't scale well to large size data such as images CNN: - Neurons are arranged in '3D', each neuron is only connected to a small region of a previous layer - Typical CNN structure: Input-Convolution Layer-Activation Layer-Pool Layer-Fully Connected layer-Output Layer
132
What are the four parameters in a Convolution Layer?
- Input size (W) - Filter Size (F) - Zero Padding (P) - Stride (S)
133
What is the equation used to calculate the output volume size from a Convolution Layer?
(W - F + 2P) / S + 1
134
What is the primary aim of a Convolution Layer?
Convolution layers are filter banks performing convolutions with the learned kernels/masks.
135
What is the main purpose of a Pooling layer?
A Pooling layer aims to pool the results from many convolution layers, which reduces the resolution of the filter outputs. As such, subsequent convolutional layers therefore access larger areas of the image.
136
What is the effect of the Pooling Layer?
It reduces the spatial size of the representation and reduces the amount of parameters, effectively down-sampling the input to increase the receptive field size.
137
What are the four types of activation functions?
- Sigmoid - Tanh - ReLU - Leaky ReLU
138
What are the properties of Sigmoid?
- Range of values is [0, 1] - Saturates and kill gradients, with small gradients at regions of 0 and 1
139
What are the properties of Tanh?
- Has a zero-centred range between [0, 1] - Saturates and kill gradients, with small gradients at regions of 0 and 1
140
What are the properties of ReLU?
- Solves vanishing/exploding gradient problems - Simple to calculate - Some neurons can be 'dead' with negative input
141
What are the properties of Leaky ReLU?
- Overcomes the dying neuron problem - Performance is not consistent
142
What is the function of the Fully Connected Layer with Softmax?
Fully Connected Layer is fully connected to all activations in the previous layer The Softmax function converts the prediction range to the range of [0, 1] for each class
143
How does a Softmax function work?
A softmax function takes a set of numbers as an input, and outputs a probability distribution spread over a set of k classes., which gives us a probability for each class in the range of [0, 1]
144
What is a key property of the Softmax function?
The probabilities produced all sum to 1 across the classes. Example: Class 1 = 0.4, Class 2 and 3 = 0.3 Therefore 0.4 + 0.3 + 0.3 = 1
145
What is the typical architecture for CNNs in Classification problems?
Typical block: - Normalisation Layer - Filter Bank - Non-linear - Feature Pooling After multiple blocks of the above: - Classifier
146
What makes a CNN effective?
- Increased depth - Smaller filters at the lower levels - More convolutional than fully connected layers
147
What is a hierarchical model in terms of CNNs?
A hierarchical model is a type of CNN which learns details at the pixel level, and then with each block 'zooms out' and finds less 'zoomed in' details. Example: First stage looks at pixel-level details, then the second stage looks at edges. Third stage then looks at object parts i.e. combination of edges, before the final stage looks at object models
148
What is the concept of Transfer Learning?
Transfer Learning is a method that leverages a pre-trained network for new tasks without needing vast amounts of new training data.
149
How does Transfer Learning work?
- A network is initially trained on a large, general-purpose dataset like ImageNet - Then, if the specific dataset is too small, you freeze the pre-trained network's weights and only re-train the classifier layer - For a medium sized dataset, you start with the pre-trained weights, then re-train the whole network or just the higher layers using a reduced learning rate.
150
What are some properties of the ImageNet dataset?
- 1.2 million high-resolution images - 1000 different classes
151
What are some historical examples of Classification CNNs?
- AlexNet - VGG - Residual Network
152
What are some historical examples of Segmentation CNNs?
- U-Net - SegNet
153
What are some historical examples of Object Detection CNNs?
- RCNN - Fast RCNN - Mask RCNN - YOLO - nnU-Net
154
How does Eye Tracking get used effectively in Computer Vision?
Deep learning performs better when regions are marked before the learning process. As such, eye tracking is a cheap and easy way to record where someone has been looking. Once this data is filtered, it can be passed to a suitable deep learning system.
155
What can help to 'pad-out' a dataset if it isn't suitably large enough for training?
Creating and using synthetic data, as it can be used to augment the original dataset with 'new' data
156
What are some factors to consider when rendering synthetic images?
- Accuracy of material - Lighting - Background - How to create annotations
157
What is Saliency in the context of Computer Vision?
Saliency networks try and predict this map of interesting areas in the image, which aims to simulate human attention.
158
What are the definitions for local and global context?
Local - Detailed neighbourhood attention Global - Where the object sits in the scene
159
How does an unsupervised GAN work?
The aim is for the generator to fool the discriminator network, so that it can't tell the difference between real and fake.
160
What are some problems with GANs?
They are tough to train: - If the discriminator behaves badly, then the generator does not have accurate feedback and the loss functions cannot represent the reality - If the discriminator does a great job, then the gradient of the loss function drops down close to 0, and the learning becomes super slow or even jammed. Mode Collapse: - During the training, the generator may collapse to a setting where it produces similar outputs with low variety
161
How do you train a diffusion model?
Training a diffusion model is done by a forward process that destroys the image data over a number of timesteps Afterwards, a neural network learns to perform a reverse process that allows us to re-create the image by removing noise
162
How can you generate new samples when training a diffusion model?
At inference time, new random noise can be used to generate new samples from the learned distribution
163
How does the Forward Process in Diffusion models work?
- Noise is added in a Markov Chain - Each step's noise is only dependent on the previous step - This is a fixed process, containing no learned parameters.
164
What is the Schedule in terms of Diffusion Models?
The schedule is part of the model that determines the rate at which noise is added to the image. Smaller steps are taken when there is little noise, and larger steps are taken when the image is mostly noise
165
How do you perform the Reverse step in a Diffusion Model?
You use a U-net style model that learns the reverse diffusion process. This is done by predicting the noise in the image given the current timestep. By removing the predicted noise from the image, we get an approximate of the original image.
166
What are the main components of the Training Loop in Diffusion Models?
Loop until converged: - Get next sample image from the dataset - Get a random value (t) between 0 and t - Using forward process, get the noise used to get from the first image to the second - Predict the added noise using the denoising UNet - Apply an MSE loss between the predicted and ground truth noise
167
What are the main components of the Inference Loop in Diffusion Models?
- Sample Xt i.e. random Gaussian noise - For all timesteps, t, in range T -> 0 -- Predict all the noise in the images -- Remove all the noise to get an approximate value of x0 i.e. the first image -- Add noise back to the image to get X(t-1)
168
What are some comparison points between Diffusion Models and GANs/VAEs?
- Diffusion models can now generate higher quality images than either GANs or VAE models - Much easier to train and more consistent compared to GANs
169
What are some ethical concerns in using Generative AI?
- Data usage - Misinformation - Privacy & Consent - Impact on the creative industry - Impact on wider society
170
How do text prompts work in the context of Diffusion Models?
- Text prompts rely on learning a mapping between text and image embeddings - A point in that embedding space can be used to condition a diffusion model to produce semantically relevant images - A model called a prior learns to convert text embeddings to image embedding, which can then condition the diffusion model
171
What does Spatial Conditioning do to prompting?
It injects additional conditioning into the denoising UNet, which is then used to alter the prompt and then generate an image.
172
What is one method that can make Diffusion Models more efficient?
You can perform the diffusion process in a latent space, similar to a Variational AutoEncoder (VAE), where the diffusion process takes place in the lower dimensional space.
173
What are some challenges with Video Diffusion models?
- Temporal Consistency: -- Frame to Frame consistency -- Object permanence - Realism: -- Interactions between objects -- Physics and fluid dynamics - Technical Challenges: -- Dataset availability -- Compute cost
174
What is Tracking in Computer Vision?
Tracking involves following a specific object through a video
175
What are some issues with pixel-level motion tracking solutions?
- They do not make predictions for new locations - They do not estimate velocities of objects/targets in the image - They do not handle uncertainty of position, or challenges like occluding objects
176
What are Flow Fields used for?
It is used in tracking, and can be used to infer motion, stabilise images, stabilise a device or help with frame interpolation
177
What are the two types of approaches for Motion Detection?
- Motion Difference - Background Subtraction
178
How does Motion Difference work in Motion Detection?
- Take two images from a sequence - Compute the change in brightness at each pixel in the image - Threshold and filter for noise
179
How does Background Subtraction work in Motion Detection?
- Capture an image of the background - Use the difference between the current frame and the background to find moving objects
180
What is the best way to take background images for Background Subtraction models in Motion Detection?
Take several at different points in time, so that the average of a given pixel's value over some time period is better
181
How do you work out the average value for a background pixel in Motion Detection?
Averaging often becomes a Gaussian model, where you compute deviation as well as just mean: - Given a new pixel value and a Gaussian model, you can estimate the pixel value
182
What is an example algorithm that uses confidence to predict the movement of an object in Tracking, and how does it work?
Kalman Filter - Handles certainty of measurements and prediction - Has a motion model - Assumes we have a unimodal Gaussian i.e. can only predict one location
183
What is a short description of a Particle Filter?
A particle filter uses a set of samples to approximate the target probability density.
184
How does a Particle Filter work?
A Particle Filter treats each sample as a particle, which is a hypothesis with some probability of being correct
185
What is the plain English description of the theory behind a Particle Filter?
- Start with many guesses about where the object could be - Predict where each guess thinks the object could be - Compare those guesses to what you actually see in the new image i.e. the measurement - Update the scores of each guess based on how well they match the real observation - Pick new guesses based on the best scores, and repeat
186
What is the basic algorithm for CONDENSATION in Tracking?
- Select a particle - Project forward in time - Add noise - Compute weight from measurement
187
What are some features of CONDENSATION in Tracking?
- The hypotheses can have any shape probability distribution - The observations are used to update the probabilities of the hypotheses via an appearance model - Many hypotheses are considered at once, which increases chances of finding the target - Works in substantial clutter at or close to video frame rate - Can use different numbers of particles as a trade-off between running very fast and having higher accuracy and greater robustness
188
What is an Appearance model in CONDENSATION?
An appearance model is like a predicted silhouette of what the object will look like. You come up with a measurement which is something like this example: 1 if over a red pixel patch, tends to 0 as you drift from this.
189
What is Contour Tracking?
Contour Tracking establishes a state that contains a set of points representing a contour, so each particle represents a possible contour This state has higher dimensions, which are called the parameters of the contour. The mean state is usually drawn to summarise results
190
What is a mixed-state CONDENSATION model?
It saves multiple transition probabilities of an object switching behaviour, and then builds multiple models using this data into a single CONDENSATION tracker.
191
What is the step-by-step process for a Mixed-State CONDENSATION tracker?
- Select a particle, which includes a gamma(t) - Randomly generate gamma(t + 1) in line with the transition probability table - Project particle forward with motion model associated with gamma(t + 1) - Add noise - Compute pi(t)
192
What happens to the particles in a Mixed-State CONDENSATION tracker?
- Mixed-state CONDENSATION produces a mix of particles, depending on the transition probabilities. - Different particles contain and follow different motion models - Clouds of particles can form
193
What is an issue with using transition tables for transition probability predictions in tracking?
It was hard-coded, and as such, is difficult to do and is inflexible
194
How does the number of particles affect the performance of particle filter methods?
Lower number: - Fast, but coarse - May have unsampled areas Higher Number: - Slower - More potential for representing state space
195
What affects the use of CONDENSATION models/trackers?
The Curse of Dimesionality
196
What is MCMC tracking, and what are some key properties?
Markov Chain Monte Carlo: - Allows us to explore the space of possible tracker/location states more efficiently - Don't apply the motion model to everything inside the state - Improve a particle by selecting one target inside it, and moving only that and seeing if its better or worse than before - Only add improved samples to the new particle set - Effectively a random walk looking for improved target configurations
197
How does the MCMC tracking work?
- Generate a new set of samples using the 'Metropolis-Hastings' algorithm: -- Propose a new state -- Evaluate the acceptance ratio -- If the acceptance ratio is greater than 1, then accept the state and update the target we considered -- Otherwise, accept it if the random variation is high enough -- Otherwise, keep the current state and reject the proposed state
198
What are some key features of MCMC tracking?
- More efficient sampling, as only optimising one target per iteration, so good with high-dimensional joint states - Quality of samples produced tends to increase with each iteration - Can produce an estimate by taking an average position across the samples for each target
199
What is coalescence in tracking?
It's the result of when similar targets interact. When they do, there's a chance that their trackers can become 'distracted' and all track a single entity.
200
How do you handle interactions and attempt to offset the problem of coalescence?
You need to lower the weight of samples that predict a position too close to other targets This assumes that targets cannot occupy the same spot on the image
201
What is the mathematical representation of the solution attempting to deal with coalescence?
- Construct a graph of targets close enough to interact - Use an interaction function to penalise predictions close to other targets, such as exponential fall-off based on distance
202
What is the concept of sharing motion information?
It's like mixed-state condensation, but pairs of targets that are moving similarly can use each other's velocity estimates, which helps to recover from occlusion.
203
Explain the term Epipolar Plane
It's a combination of the image feature and two optical centres, all of which define a plane of interpretation. The world feature generating the known image feature and the corresponding feature in the other view must lie in this plane.
204
Why does the epipolar plane simplify the stereo correspondence problem?
- The intersection of the plane of interpretation and the second image plane is a straight line (called an epipolar line) The epipolar plane determines the epipolar lines. Given a feature extracted from one view, the corresponding feature must lie on the corresponding epipolar line in the other. - Search space is reduced from two dimensions to one
205
How is the Integral Image calculated?
- Computes a value at each pixel, that is the sum of the pixel values above and to the left of the source pixel - Cumulative row sum: s(x, y) = s(x-1, y) + i(x, y) - Integral image/ii(x, y) = ii(x, y-1) + s(x, y)
206
What benefit does using Integral images bring?
- Advantages of speed i.e. more efficient calculations. Only 4 numbers are needed to calculate the sum of intensity values for each rectangle
207
What are advantages of a pinhole camera?
- Simple to understand - No lens distortion due to lack of lens - Infinite depth of field: No depth of field effect distorting the image
208
What is a one line description for segmentation?
Assign all pixels to objects
209
What is a one line description for Recognition?
Identify the main object in the image
210
What is a one line description for Detection?
Find the location of all objects
211
What is a one line description for Pose?
Find all of the object parts
212
What are characteristics of a good feature?
- Invariant to scale and rotation - Reflect useful object properties - Unique and repeatable
213
What are example types of object recognition tasks?
- Segmentation - Pose Estimation - Detection
214
Why are Diffusion Models preferred over GANs?
- Easier to train - Generate high-quality images - Avoid GAN's issues such as mode collapse or tough to train due to discriminator balance
215
What are important concepts in Background Subtraction?
- Gaussian Mixture Models - Thresholding
216
What features are commonly used in Classical Computer Vision?
- HOG descriptors - Colour histograms - Texture Features
217
Describe the structure of a pinhole camera
- The pinhole camera's optical point/centre is located on the principal axis / Z axis - The 3D scene is projected onto the image plane through the pinhole located at the optical point - The focal length narrows or widens the field of view by increasing or decreasing in length respectively.
218
Explain the forward process in a diffusion model
- Noise is added in a Markov Chain - Each step’s noise is only dependent on the previous step - This is a fixed process, containing no learned parameters.
219
Explain the reverse process in a Diffusion model
- Train and use a U-net style model that learns the reverse diffusion process - This is trained by predicting the noise in the image given the current timestep - By removing predicted noise from the image, we get an approximation of the original image
220
Describe the process of object detection for classification
- Features are extracted using feature extraction methods - They are then used to train a classifier model, which identifies regions in test images likely to contain objects and classifies them accordingly
221
Explain how a particle filter updates object tracking
- Each particle represents a hypothesis - Predictions are made based on a motion model - Measured positions are then used to update the particle weights based on how well they match the observation
222
Describe the architecture of a U-net model?
- U-net uses an encoder to compress features - Followed by a decoder to reconstruct segmentation maps - Skip connections between matching levels of encoder and decoder to preserve spatial information
223
Explain both semantic and instance segmentation
- Semantic labels every pixel by class - Instance distinguishes individual objects of the same class
224
What are the challenges of using a single Gaussian Background Model?
- Cannot model dynamic backgrounds - Sensitive to lighting changes - Cannot represent multiple background modes at a pixel
225
What's an advantage and disadvantage of using a Kalman Filter over a Particle Filter?
Advantage - Particle filters handle multi-modal distributions Disadvantage - Requires more computation
226
Why is SIFT useful?
- Invariant to scale, rotation and illumination - It's robust in regards to matching across different views
227
What is the importance of illumination modelling?
Helps accurately interpret scene radiance, ensuring features and colours are not misinterpreted due to lighting changes
228
What factors affect the quality of images captured in digital photography?
- Shutter speed - Aperture - ISO Setting
229
What are valid applications of saliency prediction in Computer Vision?
- Smart image cropping - Object Detection - Content-aware compression
230
In Stereo vision, what helps estimate depth from two images?
- Epipolar geometry: Contains correspondence search - Disparity maps: Allow depth calculation
231
What are components of a Kalman filter used in tracking?
- Motion model - Measurement Update - State Prediction
232
Explain how ISO, Shutter Speed and Aperture interact to control image brightness?
- ISO controls sensor sensitivity - Aperture determines how much light enters the lens - Shutter Speed determines how long the light is exposed for Increasing one typically requires reducing another to maintain consistent exposure
233
Describe how disparity maps are used in Stereo Vision to infer depth
Disparity maps represent pixel shifts between left and right images. Depth is inversely proportional to disparity. The larger the disparity, the closer the object is to the camera that took the image.
234
How does a Kalman Filter handle uncertainty during object tracking?
Kalman filters maintain a mean and a covariance for state estimates. Prediction steps add uncertainty and measurement updates reduce it
235
What is the formal way of explaining the role of Pooling layers?
Pooling layers downsample feature maps to reduce computation and enforce spatial hierarchy.
236
What is the formal way of explaining the role of Activation Functions in CNNs?
Activation functions introduce non-linearity, enabling complex pattern learning.
237
What is the function of a Disparity Map in Stereo Vision?
Disparity maps capture the shift of corresponding pixels between stereo image pairs. This disparity allows calculation of depth via triangulation
238
Compare background subtraction using a single Gaussian model against a Gaussian Mixed Model
Single Gaussian: - Assumes a static background Mixed Gaussian: - Can represent dynamic backgrounds like moving leaves by modelling multiple modes per pixel
239
How does Motion Difference capture movement?
Motion difference compares sequential frames to detect pixel intensity changes, indicating movement
240
What is a major challenge of evaluating saliency maps?
Human attention is subjective, making ground truths hard to define. Multiple metrics exist, but interpreting results remains complex
241
What is the main advantage of using U-Net in image segmentation tasks?
U-Net preserves spatial resolution through skip connections and so they are very efficient for small dataset segmentation tasks.
242
What is the epipolar plane?
Given two optical centres and a point in an image, you can compute the epipolar plane.
243
What is an epipolar line?
An epipolar line is defined by both the epipolar plane and the image plane for a camera. The epipolar line emerges where the image plane intersects the epipolar plane.
244
How does an epipolar plane help with stereo vision?
It reduces the dimension for the correspondence problem from 2D to 1D, making it more efficient.
245
What is the cost volume in regards to correspondence search?
The cost volume stores matching costs for each pixel over a range of disparities, which represents how well that pixel matches with a shifted pixel in the other image.
246
How is the cost volume used in regards to correspondence search?
The cost volume is used to compute the disparity match by selecting the disparity with the lowest cost for each pixel
247
What is the fine-grained version of how a U-net is trained on image data for segmentation?
- Feed it image-label pairs, where each label is a pixel-wise segmentation map - Architecture uses an encoder to downsample features, and a decoder with skip connections to reconstruct spatial details - During training, a loss function compares predictions with the ground truth network - Network updates via backpropagation - Performance evaluated using a test set
248
How does Background Subtraction work in Object Tracking?
Background subtraction works by comparing current frames to a background model using changes in pixel intensities. It uses Gaussian Mixed Models, which allows it to handle dynamic backgrounds.
249
What are advantages of using SIFT descriptors over raw pixel intensities?
- Invariant to scale, rotation and minor illumination changes - Uses local gradient orientation histograms, which are resistant to noise and misalignment
250
What are the benefits from using drop-out when training networks?
- Improves generalisation - Reduces overfitting - Forces the network to learn more robust representations
251
How does a Gaussian Mixed Model work?
- It models each pixel as a mixture of several Gaussian distributions, representing different background states. - Each pixel is then compared to these different Gaussian models to determine whether it fits into the background or the foreground i.e. an object of interest.
252
How does histogram equalisation work?
- Redistributes pixel intensity values so that they span the full range of possible values - Computes the cumulative distribution function (CDF) of the image histogram, and maps the original intensities to new ones - Spreads out frequent intensity values, improving contrast in low-contrast images
253
What is the equation used to calculate disparity between two images?
Disparity = Xl - Xr Where: Xl = Point observed in left image Xr = Point observed in right image
254
What is the equation used to calculate depth, using disparity?
Z = f(T)/D Where: Z = Depth f = Focal Length T = Real-world distance between two cameras D = Disparity
255
How does a Particle Filter work?
- Generates particles, each of which represents a different hypothesis that 'guesses' the position of the object in the next time-frame - Each particle has an assigned weight, which is computed through the use of an observation/motion model that compares predicted measurements with real sensor data - Over time, particles with lower weights contribute less, leading to degeneracy
256
How does resampling alleviate the problem of degeneracy in Particle Filters?
It duplicates high-weight particles and discards low-weight particles when generating the next set of particles. It thereby focuses computation on more likely hypotheses and maintains tracking accuracy
257
What is semantic segmentation?
Assigns a class label to each pixel in an image, grouping pixels by category without differentiating between individual objects.
258
What is instance segmentation?
Assigns a class label to each pixel in an image, and also identifies and separates each instance of an object i.e. it identifies each object separately unlike semantic segmentation
259
How is training loss computed for a denoising diffusion model?
- Generates images with predicted levels of noise - Compares predicted noise in image with actual noise in same image in sequence using the MSE loss function. - Repeats this process at each timestep
260
What are some disadvantages of using VAEs for generating images?
- Blurry image generation due to reconstruction loss through the use of KL divergence terms and the use of a probabilistic decoder. - Utilise a latent space regularisation term, which also contributes to blurry images
261
How is illumination invariance implemented in feature detection?
- HOG descriptors - SIFT descriptors Both of these are able to implement illumination invariance through focusing on edge orientations rather than absolute intensities, as well as normalising local patches to reduce brightness variations
262
What is a formal definition of lower-level tasks?
Low-level tasks involve basic image processing such as edge detection or noise reduction
263
What is a formal definition of mid-level tasks?
Mid-level tasks involve interpreting groups of pixels, including segmentation and depth estimation
264
What is a formal definition of high-level tasks?
High-level tasks refer to semantic understanding such as object detection, recognition and pose estimation
265
What is the sensing stage in Computer Vision?
It involves acquiring raw image data through devices like cameras or depth sensors, and provides the initial input to the CV system.
266
What is in a basic image processing pipeline?
- Image acquisition - Pre-processing - Feature extraction - Classification
267
What are the pinhole camera's limitations?
- Low brightness, due to a lack of light - Image blue if the hole is too large - Diffraction effects if the hole is too small
268
How are 2D coordinates derived from 3D coordinates?
- It is derived through the use of a projection process that uses the camera projection matrix. - The matrix combines intrinsic and extrinsic parameters from the camera, and transforms the 3D point into the camera coordinate system and then projected onto the image plane
269
What are some edge detection algorithms that are commonly used?
- Sobel - Canny
270
How does Sobel work?
Sobel uses convolutional kernels to compute gradients in the horizontal and vertical directions
271
What is an advantage and disadvantage of using Sobel to detect edges?
Advantage - Simple and computationally efficient Disadvantage - Sensitive to noise
272
What are keypoints in feature detection?
Keypoints are distinct and repeatable locations in an image, such as corners or edges, that are stable under various transformations like rotation and scale.
273
What are local descriptors?
Local descriptors capture distinctive information from small regions around keypoints in an image, which are robust to changes in scale, rotation and illumination.
274
What is Thresholding in regards to grayscale images?
Thresholding converts a greyscale image into a binary image by selecting a threshold value. Pixels with intensity values above the threshold are considered as foreground objects, whereas pixels below the threshold are considered as part of the background.
275
Describe the Region Growing segmentation method
- Start with user-defined or automatic selection of seed points in the image - Expand regions by including neighbouring pixels that meet pre-defined criteria, such as intensity - Growth continues until no more similar pixels are found, resulting in segmented regions.
276
What is the formal definition of Affine Transformations?
They are a linear mapping method that preserves points, straight lines and planes. It includes transformations such as rotation, translation, scaling and shearing.
277
How could affine transformations be used in image registration?
They are used to align two images to each other by correcting geometric distortions, enabling one image to be mapped to the other
278
How does intensity-based registration work?
It utilises the pixel intensity values and compares them directly between two images, rather than relying on extracted features. It is typically achieved through minimising the sum of squared differences.
279
Why are geometric transformation techniques important for image alignment problems?
They are important because they bring different views of the same scene or object into a common coordinate system.
280
What is a basic pipeline for image alignment?
- Image acquisition & loading - Use a similarity measurement - Use a registration algorithm to find the best transformation parameters - Apply geometric transformations