Final Untouched Variation Flashcards

Question

What is the idea behind using feature vectors?

Answer 1

The idea is to remove redundant or irrelevant data

Answer 2

- Invariant to translation and rotation - Change slowly as viewing direction changes - Change slowly with object size - Change slowly with occlusion

Answer 3

Texture features measure the frequency with which patterns of colour/grey levels appear

Answer 4

They are areas in an image which typically indicate boundaries of objects due to a spike in intensity i.e. intensity gradient.

Answer 5

You take your source pixel, and the area surrounding it, and apply a convolution kernel on top of the area. You then multiply the value at each pixel corresponding with the value in the kernel, before summing all the results together to form the final source pixel value.

Answer 6

Small random bits of data added or taken away from the true value

Answer 7

Move a kernel across the image and calculate a new pixel value based on the average of its surrounding neighbours.

Answer 8

It works almost like a mean filter, except it adjusts the kernel to use a weighted average. The weighted average is stronger towards the centre of the area.

Answer 9

Edges are found through the use of difference filtering in order to pick out the areas of high contrast.

Answer 10

Edge detection works by looking for sharp changes in intensity

Answer 11

- Divide the patch into smaller cells (8x8 pixels) - Define slightly larger blocks, covering several cells (2x2 cells) - Compute gradient magnitude and orientation at each pixel - Compute a local weighted histogram of gradient orientations for each cell, weighting by some function of magnitude - Concatenate histogram entries to form a HoG vector for each block - Normalise vector values by dividing some function of vector length

Answer 12

Invariance dictates that similar results should be produced even if the conditions vary, such as scale, translation, rotation and illumination changes

Answer 13

- Find points whose surrounding patches (at some scale) are distinctive - Convolution with a Gaussian mask gives some idea of what is going on around a pixel - Gaussian masks have a natural scale: Their standard deviation

Answer 14

- Fast and Efficient, can run in real time - Can handle: Changes in viewpoint, significant changes in illumination

Answer 15

Clipping occurs when the pixels are too bright to be correctly recorded in the numeric range available.

Answer 16

Shutter Speed defines how long the light is allowed onto the film/sensor for

Answer 17

If one goes up, then you can effectively maintain the same brightness level by decreasing the others. However, that does cause other adverse effects e.g. Depth of Field

Answer 18

A larger aperture means more light, but it also reduces the depth of field

Answer 19

A longer shutter speed means more motion blur

Answer 20

- Underexposed brightness, prevents clipping, although it introduces more noise - Centre the photo with a simple background - Record calibration target for colour balance - Optimise other settings for increased image clarity

Answer 21

It is often cheaper and faster in the long run to spend a while making sure the images you capture are captured well and stored correctly.

Answer 22

Segmentation is used to assist completion of higher-level tasks, such as recognition, tracking, image database retrieval, feature quantification and registration

Answer 23

Divide the image into regions/segments, where each region presents a distinguished item. Each region should have similar importance

Answer 24

Exhaustively search for the threshold that minimises the intra-class variance and maximises the inter-class variance.

Answer 25

It's not robust enough on its own for most real-world applications.

Answer 26

It assumes the data forms a bi-modal histogram

Answer 27

- Start with one or more seed points - Iteratively check its neighbour points: If intensity difference between the neighbour point and the region is smaller than a threshold, assign the neighbour point to the segmented region. - Stop when there is no new assignment

Answer 28

- Enables multiple class segmentation - Parameter is easy to adjust

Answer 29

- Local region solution - Computationally expensive - Leakage along weak boundaries - Sensitive to seed points

Answer 30

It is a spline, which is a series of control points, with some function/s that govern the curve between the points. A snake is only interested in the location of the control points, but not the connection between them.

Answer 31

- Represent the object boundary as a parametric curve - A cost function is associated with the curve, so the boundary is found by optimising the cost function - The cost function is defined as the sum of the three terms - Iteratively update the contour points using something such as gradient descent

Answer 32

Esnake = (alpha * Einternal) + (beta * Eimage) + (gamma * constraint) Where: - Einternal = Contour smoothness, point spacing, etc... - EImage = Image features e.g. lines, edges - Econstraint = External constraint such as user interaction keypoints

Answer 33

Einteral = Econt + Ecurve Where: - Econt = Continuity i.e. control point distribution along the curve - Ecurve = Curvature i.e. promote round curves where possible

Answer 34

Econt = (Davg - ||Pi - P(i-1)||)^2 Where: Davg is the average distance between adjacent points along the line Pi is a point on the line, and P(i-1) is the point before that.

Answer 35

Ecurve = (||P(i-1) - 2P(i) + P(i+1)||)^2 Where: P stands for a point on the curve

Answer 36

If you remove it, then the snake will become a closed loop that effectively wraps around the target object

Answer 37

Closed snakes separate the areas on either side of the Snake. Inside of the shape the Snake creates is the foreground, and outside of the shape is the background.

Answer 38

- Node Distribution - Sharp Corners - Topology Changes

Answer 39

Explicit Geometry - Parameterised boundaries

Answer 40

Implicit Geometry - Boundaries given by zero level set

Answer 41

- No parameterisation required - Less sensitive to the contour initialisation and noise - Computationally efficient - Topological changes can be handled implicitly - Based on regional statistics rather than boundary information

Answer 42

- Consider the image as a graph, where it has edges, vertices, and costs between assigned values. - Find the optimal cut which produces the minimum cost

Answer 43

Data term (Unary) - It is a function derived from the observed data that measures the cost of assigning label to pixel p Smoothness Term (Pairwise) - Measures the cost of assigning the labels to adjacent pixels p and q. It's used to impose spatial smoothness.

Answer 44

- Check all pairs of neighbour pixels - Penalise adjacent pixels with different labels - Function penalises a lot for discontinuities between pixels of similar intensities - However, if pixels are very different, then the penalty is small

Answer 45

- Represent the characteristics of the local region for each pixel e.g. intensity, filtering, SIFT, HOG, etc... - Define similarity function e.g. Euclidean, Cosine, Manhattan, etc... - Region partition is handled as pixel classification problem using clustering analysis, such as K-Means.

Answer 46

- Randomly select K points as initial centroids - Repeat: -- Assigning each point to its closest centroid -- Re-compute the centroid of each cluster - Until centroids/sets do not change

Answer 47

- Simple and efficient - Solution dependant on the initialisation

Answer 48

- Need to specify the number of clusters - Sensitive to outliers

Answer 49

- Represent the characteristics of the local region for each pixel e.g. intensity, filtering, HOG, etc... - Region partition is handled as pixel classification problem using supervised machine learning methods such as SVMs, Random Forest, etc... - Requires a training process with ground truth labels

Answer 50

It is application dependent.

Answer 51

- Information fusion - Information comparison - Transformation estimation - Statistical modelling and analysis based on large sets of aligned images

Answer 52

- Medical e.g. pre and post treatment comparison - Remote sensing e.g. road map, satellite map - Augmented Reality e.g. aligning 3D virtual model to 2D images

Answer 53

To transform a source image to match with a target image

Answer 54

- Geometric Transformations e.g. rigid, affine or deformable - Similarity Measurement e.g. point correspondence, intensity based - Parameter optimisation e.g. closed form solution, gradient descent

Answer 55

2D Transformations: Euclidean: 2 translations, 1 rotation Affine: 2 translations, 1 rotation, 2 scale 3D Transformations: Euclidean: 3 translations, 3 rotations Affine: 3 translations, 3 rotations, 3 scale Therefore: First values for each transformation type are always equal i.e. 2D = 2 translations, 1 rotation, and 3D = 3 translations, 3 rotations

Answer 56

If correspondence points can be determined either manually or automatically, then the image registration problem becomes easier

Answer 57

- For each point in the source point cloud, match the closest point in the target point cloud e.g. Euclidean distance - Estimate the transformation, and use a point to point distance metric e.g. root mean square minimisation technique, which will best align each source point to its match found in the previous step - Transform the source points using the obtained transformation - Iterate the previous three steps until transformation parameters remain unchanged

Answer 58

Fundamental ambiguity - Any point on the ray OP projects to the same image location, called P.

Answer 59

Find matching pixels/features in 2 or more images and convert their 2D positions into 3D depths

Answer 60

A second camera can resolve the ambiguity enabling measurement via triangulation

Answer 61

You use triangulation, which requires: - Knowledge of absolute and relative camera geometry i.e. Calibration - Point correspondence i.e. which rays to intersect

Answer 62

It recovers the intrinsic parameters of the cameras e.g. focal length, pixel size, principal point, lens distortion Relative poses between cameras, also called extrinsic parameters, are also factored in e.g. rotation, translation, scale that transforms left image on to right

Answer 63

Simplest approach is to use a known calibration target object

Answer 64

- Decentering errors: Displacement of the lens centre from optical axes - Radial distortion: Variations in light refractions, mostly in wide angle lenses

Answer 65

Image warping parameter is estimated to warp the ideal projected coordinate to the distorted coordinate. K contains warping parameter.

Answer 66

x' = warp(x, k) Where: - x = Ideal image (no distortion) - x' = Observed image with distortion

Answer 67

Points in the 3D space are expressed in terms of a different coordinate frame known as the world coordinate frame. The relation given between the coordinates of P in camera and world coordinate system are given by: Xcam = R(Xw - c) Where: - c = 3x1 vector representing the coordinates of the camera in the world coordinate system - R = 3x3 matrix representing the orientation of the camera

Answer 68

It is to calculate the intrinsic, extrinsic and distortion parameters.

Answer 69

Given the two optical centres and a point in one image, we can compute the epipolar plane and so the corresponding epipolar line in the other image

Answer 70

Given two calibrated cameras, it's possible to retrieve the actual 3D coordinate of a corner in the image

Answer 71

Correspondence allows measurement of disparity: The difference in the image coordinates of the projections of a given world point into each camera. Depth is inversely proportional to disparity.

Answer 72

- Find a window in the original image - Slide it along the right scanline and compare the content of that window with that of the reference window in the original image

Answer 73

- Larger window size: Smooth disparity maps but less detail captured - Smaller window size: More detail, but also more noise captured

Answer 74

- Window size is fixed across the image, but viewed objects differ in size and depth - Uniform regions always match - Can provide a dense disparity map, but values are only reliable where there is some local variation in intensity e.g. near edges - Dense disparity is computationally expensive in spatial domain

Answer 75

- Alternative/competing sensors - Artificial images - Real images

Answer 76

- Automatic methods can have errors - Manual methods are slow, subjective and also error prone - What if standard sets don't have the properties you are attempting to evaluate your images on?

Answer 77

True Positive - The algorithm makes a correct prediction about the presence of an object in an image

Answer 78

The algorithm predicts the presence of an object but that object is not present in the image

Answer 79

The algorithm misses an object

Answer 80

Precision = TP/(TP + FP) Fraction of responses that were correct

Answer 81

Recall = TP/(TP + FN) Fraction of correct classifications that were identified

Answer 82

Using Precision-Recall curves as a visualisation tool

Answer 83

- Plot of precision against recall as some parameter is varied - Parameter is the threshold used to decide if the model and image are similar enough to be considered equal - Increasing threshold imposes a tighter requirement on matching, which reduces False Positives, but increases False Negatives

Answer 84

For classification, you can use a confusion matrix, which shows what category images are confused with others

Answer 85

Ground truth is a set of manually-drawn bounding boxes on an image

Answer 86

- Accuracy - Robustness

Answer 87

A Precision Plot measures the percentage of frames whose estimated location is within a given threshold distance of the ground truth.

Answer 88

- Must be representative of the data - Must not be too specific - Must not use training data in the evaluation of performance

Answer 89

When you know which image is the left image and which image is the right image, then you can make safe estimations such as a point x in the left image could never be further to the right in the right image, thereby narrowing the search space.

Answer 90

- Recognition: Identifying the main object in an image - Detection: Find the location of all objects - Segmentation: Assign all pixels to objects - Pose: Find the location of the object parts

Answer 91

Find the location of all objects in the scenes in terms of providing a bounding box

Answer 92

It's the process of partitioning the image into 'meaningful' segments You group pixels based on 'common' properties

Answer 93

If you need to differentiate different instances of the same object

Answer 94

- Data: Images containing objects from that class and images from all other classes - Feature Extraction: Work with features extracted from images - Machine Learning: From the features extracted, initiate and train a model that recognises this particular object class

Answer 95

- Divide image into a grid of cells e.g. 8x8 - Compute edges and their orientation for every pixel location - Compute histogram of gradient orientations in each cell

Answer 96

Bag of Features methods analyse the large set of very specific features generated by a training set of images and identify a small set of useful, more generic features

Answer 97

- Take a bunch of images: Extract features, build up a 'dictionary' of common features - Then, given a new image, extract features: -- For each feature, find the closest visual word in the dictionary -- Build a histogram to represent the image

Answer 98

- Slide a window across the image and evaluate a face model at every location

Answer 99

- Integral images for fast feature evaluation - Boosting for feature selection - Attentional cascade for fast rejection of non-face windows

Answer 100

The integral image computes a value at each pixel that is the sum of the pixel values above and to the left of the source pixel inclusive.

Answer 101

Features are extracted from sub-windows of a sample window, and each of the four feature types are scaled and shifted across all possible combinations.

Answer 102

Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier, where a weak learner is defined as a learner that does only slightly better than random chance.

Answer 103

- Need a training set of labelled examples - Start with all examples equally weighted - Learn a series of recognition rules - Re-weight examples so incorrect recognition by nth classifier makes that example more important to the n+1th - No single rule/classifier can separate complex objects from complex backgrounds, but a combination can - Weights are determined automatically

Answer 104

Apply learned operations to user-defined features: - Design/choose features - Design/choose a classifier - Train the classifier

Answer 105

It clusters the results of applying the user-defined set of feature-detection operators to form a more generic visual vocabulary.

Answer 106

MLP: - Wide application scenario, not just images - Neurons are fully connected, but can't scale well to large size data such as images CNN: - Neurons are arranged in '3D', each neuron is only connected to a small region of a previous layer - Typical CNN structure: Input-Convolution Layer-Activation Layer-Pool Layer-Fully Connected layer-Output Layer

Answer 107

- Input size (W) - Filter Size (F) - Zero Padding (P) - Stride (S)

Answer 108

(W - F + 2P) / S + 1

Answer 109

Convolution layers are filter banks performing convolutions with the learned kernels/masks.

Answer 110

A Pooling layer aims to pool the results from many convolution layers, which reduces the resolution of the filter outputs. As such, subsequent convolutional layers therefore access larger areas of the image.

Answer 111

It reduces the spatial size of the representation and reduces the amount of parameters, effectively down-sampling the input to increase the receptive field size.

Answer 112

- Sigmoid - Tanh - ReLU - Leaky ReLU

Answer 113

- Range of values is [0, 1] - Saturates and kill gradients, with small gradients at regions of 0 and 1

Answer 114

- Has a zero-centred range between [0, 1] - Saturates and kill gradients, with small gradients at regions of 0 and 1

Answer 115

- Solves vanishing/exploding gradient problems - Simple to calculate - Some neurons can be 'dead' with negative input

Answer 116

- Overcomes the dying neuron problem - Performance is not consistent

Answer 117

Fully Connected Layer is fully connected to all activations in the previous layer The Softmax function converts the prediction range to the range of [0, 1] for each class

Answer 118

A softmax function takes a set of numbers as an input, and outputs a probability distribution spread over a set of k classes., which gives us a probability for each class in the range of [0, 1]

Answer 119

The probabilities produced all sum to 1 across the classes. Example: Class 1 = 0.4, Class 2 and 3 = 0.3 Therefore 0.4 + 0.3 + 0.3 = 1

Answer 120

Typical block: - Normalisation Layer - Filter Bank - Non-linear - Feature Pooling After multiple blocks of the above: - Classifier

Answer 121

- Increased depth - Smaller filters at the lower levels - More convolutional than fully connected layers

Answer 122

A hierarchical model is a type of CNN which learns details at the pixel level, and then with each block 'zooms out' and finds less 'zoomed in' details. Example: First stage looks at pixel-level details, then the second stage looks at edges. Third stage then looks at object parts i.e. combination of edges, before the final stage looks at object models

Answer 123

Transfer Learning is a method that leverages a pre-trained network for new tasks without needing vast amounts of new training data.

Answer 124

- A network is initially trained on a large, general-purpose dataset like ImageNet - Then, if the specific dataset is too small, you freeze the pre-trained network's weights and only re-train the classifier layer - For a medium sized dataset, you start with the pre-trained weights, then re-train the whole network or just the higher layers using a reduced learning rate.

Answer 125

- 1.2 million high-resolution images - 1000 different classes

Answer 126

- AlexNet - VGG - Residual Network

Answer 127

- U-Net - SegNet

Answer 128

- RCNN - Fast RCNN - Mask RCNN - YOLO - nnU-Net

Answer 129

Deep learning performs better when regions are marked before the learning process. As such, eye tracking is a cheap and easy way to record where someone has been looking. Once this data is filtered, it can be passed to a suitable deep learning system.

Answer 130

Creating and using synthetic data, as it can be used to augment the original dataset with 'new' data

Answer 131

- Accuracy of material - Lighting - Background - How to create annotations

Answer 132

Saliency networks try and predict this map of interesting areas in the image, which aims to simulate human attention.

Answer 133

Local - Detailed neighbourhood attention Global - Where the object sits in the scene

Answer 134

The aim is for the generator to fool the discriminator network, so that it can't tell the difference between real and fake.

Answer 135

They are tough to train: - If the discriminator behaves badly, then the generator does not have accurate feedback and the loss functions cannot represent the reality - If the discriminator does a great job, then the gradient of the loss function drops down close to 0, and the learning becomes super slow or even jammed. Mode Collapse: - During the training, the generator may collapse to a setting where it produces similar outputs with low variety

Answer 136

Training a diffusion model is done by a forward process that destroys the image data over a number of timesteps Afterwards, a neural network learns to perform a reverse process that allows us to re-create the image by removing noise

Answer 137

At inference time, new random noise can be used to generate new samples from the learned distribution

Answer 138

- Noise is added in a Markov Chain - Each step's noise is only dependent on the previous step - This is a fixed process, containing no learned parameters.

Answer 139

The schedule is part of the model that determines the rate at which noise is added to the image. Smaller steps are taken when there is little noise, and larger steps are taken when the image is mostly noise

Answer 140

You use a U-net style model that learns the reverse diffusion process. This is done by predicting the noise in the image given the current timestep. By removing the predicted noise from the image, we get an approximate of the original image.

Answer 141

Loop until converged: - Get next sample image from the dataset - Get a random value (t) between 0 and t - Using forward process, get the noise used to get from the first image to the second - Predict the added noise using the denoising UNet - Apply an MSE loss between the predicted and ground truth noise

Answer 142

- Sample Xt i.e. random Gaussian noise - For all timesteps, t, in range T -> 0 -- Predict all the noise in the images -- Remove all the noise to get an approximate value of x0 i.e. the first image -- Add noise back to the image to get X(t-1)

Answer 143

- Diffusion models can now generate higher quality images than either GANs or VAE models - Much easier to train and more consistent compared to GANs

Answer 144

- Data usage - Misinformation - Privacy & Consent - Impact on the creative industry - Impact on wider society

Answer 145

- Text prompts rely on learning a mapping between text and image embeddings - A point in that embedding space can be used to condition a diffusion model to produce semantically relevant images - A model called a prior learns to convert text embeddings to image embedding, which can then condition the diffusion model

Answer 146

It injects additional conditioning into the denoising UNet, which is then used to alter the prompt and then generate an image.

Answer 147

You can perform the diffusion process in a latent space, similar to a Variational AutoEncoder (VAE), where the diffusion process takes place in the lower dimensional space.

Answer 148

- Temporal Consistency: -- Frame to Frame consistency -- Object permanence - Realism: -- Interactions between objects -- Physics and fluid dynamics - Technical Challenges: -- Dataset availability -- Compute cost

Answer 149

Tracking involves following a specific object through a video

Answer 150

- They do not make predictions for new locations - They do not estimate velocities of objects/targets in the image - They do not handle uncertainty of position, or challenges like occluding objects

Answer 151

It is used in tracking, and can be used to infer motion, stabilise images, stabilise a device or help with frame interpolation

Answer 152

- Motion Difference - Background Subtraction

Answer 153

- Take two images from a sequence - Compute the change in brightness at each pixel in the image - Threshold and filter for noise

Answer 154

- Capture an image of the background - Use the difference between the current frame and the background to find moving objects

Answer 155

Take several at different points in time, so that the average of a given pixel's value over some time period is better

Answer 156

Averaging often becomes a Gaussian model, where you compute deviation as well as just mean: - Given a new pixel value and a Gaussian model, you can estimate the pixel value

Answer 157

Kalman Filter - Handles certainty of measurements and prediction - Has a motion model - Assumes we have a unimodal Gaussian i.e. can only predict one location

Answer 158

A particle filter uses a set of samples to approximate the target probability density.

Answer 159

A Particle Filter treats each sample as a particle, which is a hypothesis with some probability of being correct

Answer 160

- Start with many guesses about where the object could be - Predict where each guess thinks the object could be - Compare those guesses to what you actually see in the new image i.e. the measurement - Update the scores of each guess based on how well they match the real observation - Pick new guesses based on the best scores, and repeat

Answer 161

- Select a particle - Project forward in time - Add noise - Compute weight from measurement

Answer 162

- The hypotheses can have any shape probability distribution - The observations are used to update the probabilities of the hypotheses via an appearance model - Many hypotheses are considered at once, which increases chances of finding the target - Works in substantial clutter at or close to video frame rate - Can use different numbers of particles as a trade-off between running very fast and having higher accuracy and greater robustness

Answer 163

An appearance model is like a predicted silhouette of what the object will look like. You come up with a measurement which is something like this example: 1 if over a red pixel patch, tends to 0 as you drift from this.

Answer 164

Contour Tracking establishes a state that contains a set of points representing a contour, so each particle represents a possible contour This state has higher dimensions, which are called the parameters of the contour. The mean state is usually drawn to summarise results

Answer 165

It saves multiple transition probabilities of an object switching behaviour, and then builds multiple models using this data into a single CONDENSATION tracker.

Answer 166

- Select a particle, which includes a gamma(t) - Randomly generate gamma(t + 1) in line with the transition probability table - Project particle forward with motion model associated with gamma(t + 1) - Add noise - Compute pi(t)

Answer 167

- Mixed-state CONDENSATION produces a mix of particles, depending on the transition probabilities. - Different particles contain and follow different motion models - Clouds of particles can form

Answer 168

It was hard-coded, and as such, is difficult to do and is inflexible

Answer 169

Lower number: - Fast, but coarse - May have unsampled areas Higher Number: - Slower - More potential for representing state space

Answer 170

The Curse of Dimesionality

Answer 171

Markov Chain Monte Carlo: - Allows us to explore the space of possible tracker/location states more efficiently - Don't apply the motion model to everything inside the state - Improve a particle by selecting one target inside it, and moving only that and seeing if its better or worse than before - Only add improved samples to the new particle set - Effectively a random walk looking for improved target configurations

Answer 172

- Generate a new set of samples using the 'Metropolis-Hastings' algorithm: -- Propose a new state -- Evaluate the acceptance ratio -- If the acceptance ratio is greater than 1, then accept the state and update the target we considered -- Otherwise, accept it if the random variation is high enough -- Otherwise, keep the current state and reject the proposed state

Answer 173

- More efficient sampling, as only optimising one target per iteration, so good with high-dimensional joint states - Quality of samples produced tends to increase with each iteration - Can produce an estimate by taking an average position across the samples for each target

Answer 174

It's the result of when similar targets interact. When they do, there's a chance that their trackers can become 'distracted' and all track a single entity.

Answer 175

You need to lower the weight of samples that predict a position too close to other targets This assumes that targets cannot occupy the same spot on the image

Answer 176

- Construct a graph of targets close enough to interact - Use an interaction function to penalise predictions close to other targets, such as exponential fall-off based on distance

Answer 177

It's like mixed-state condensation, but pairs of targets that are moving similarly can use each other's velocity estimates, which helps to recover from occlusion.

Answer 178

It's a combination of the image feature and two optical centres, all of which define a plane of interpretation. The world feature generating the known image feature and the corresponding feature in the other view must lie in this plane.

Answer 179

- The intersection of the plane of interpretation and the second image plane is a straight line (called an epipolar line) The epipolar plane determines the epipolar lines. Given a feature extracted from one view, the corresponding feature must lie on the corresponding epipolar line in the other. - Search space is reduced from two dimensions to one

Answer 180

- Computes a value at each pixel, that is the sum of the pixel values above and to the left of the source pixel - Cumulative row sum: s(x, y) = s(x-1, y) + i(x, y) - Integral image/ii(x, y) = ii(x, y-1) + s(x, y)

Answer 181

- Advantages of speed i.e. more efficient calculations. Only 4 numbers are needed to calculate the sum of intensity values for each rectangle

Answer 182

- Simple to understand - No lens distortion due to lack of lens - Infinite depth of field: No depth of field effect distorting the image

Answer 183

Assign all pixels to objects

Answer 184

Identify the main object in the image

Answer 185

Find the location of all objects

Answer 186

Find all of the object parts

Answer 187

- Invariant to scale and rotation - Reflect useful object properties - Unique and repeatable

Answer 188

- Segmentation - Pose Estimation - Detection

Answer 189

- Easier to train - Generate high-quality images - Avoid GAN's issues such as mode collapse or tough to train due to discriminator balance

Answer 190

- Gaussian Mixture Models - Thresholding

Answer 191

- HOG descriptors - Colour histograms - Texture Features

Answer 192

- The pinhole camera's optical point/centre is located on the principal axis / Z axis - The 3D scene is projected onto the image plane through the pinhole located at the optical point - The focal length narrows or widens the field of view by increasing or decreasing in length respectively.

Answer 193

- Noise is added in a Markov Chain - Each step’s noise is only dependent on the previous step - This is a fixed process, containing no learned parameters.

Answer 194

- Train and use a U-net style model that learns the reverse diffusion process - This is trained by predicting the noise in the image given the current timestep - By removing predicted noise from the image, we get an approximation of the original image

Answer 195

- Features are extracted using feature extraction methods - They are then used to train a classifier model, which identifies regions in test images likely to contain objects and classifies them accordingly

Answer 196

- Each particle represents a hypothesis - Predictions are made based on a motion model - Measured positions are then used to update the particle weights based on how well they match the observation

Answer 197

- U-net uses an encoder to compress features - Followed by a decoder to reconstruct segmentation maps - Skip connections between matching levels of encoder and decoder to preserve spatial information

Answer 198

- Semantic labels every pixel by class - Instance distinguishes individual objects of the same class

Answer 199

- Cannot model dynamic backgrounds - Sensitive to lighting changes - Cannot represent multiple background modes at a pixel

Answer 200

Advantage - Particle filters handle multi-modal distributions Disadvantage - Requires more computation

Answer 201

- Invariant to scale, rotation and illumination - It's robust in regards to matching across different views

Answer 202

Helps accurately interpret scene radiance, ensuring features and colours are not misinterpreted due to lighting changes

Answer 203

- Shutter speed - Aperture - ISO Setting

Answer 204

- Smart image cropping - Object Detection - Content-aware compression

Answer 205

- Epipolar geometry: Contains correspondence search - Disparity maps: Allow depth calculation

Answer 206

- Motion model - Measurement Update - State Prediction

Answer 207

- ISO controls sensor sensitivity - Aperture determines how much light enters the lens - Shutter Speed determines how long the light is exposed for Increasing one typically requires reducing another to maintain consistent exposure

Answer 208

Disparity maps represent pixel shifts between left and right images. Depth is inversely proportional to disparity. The larger the disparity, the closer the object is to the camera that took the image.

Answer 209

Kalman filters maintain a mean and a covariance for state estimates. Prediction steps add uncertainty and measurement updates reduce it

Answer 210

Pooling layers downsample feature maps to reduce computation and enforce spatial hierarchy.

Answer 211

Activation functions introduce non-linearity, enabling complex pattern learning.

Answer 212

Disparity maps capture the shift of corresponding pixels between stereo image pairs. This disparity allows calculation of depth via triangulation

Answer 213

Single Gaussian: - Assumes a static background Mixed Gaussian: - Can represent dynamic backgrounds like moving leaves by modelling multiple modes per pixel

Answer 214

Motion difference compares sequential frames to detect pixel intensity changes, indicating movement

Answer 215

Human attention is subjective, making ground truths hard to define. Multiple metrics exist, but interpreting results remains complex

Answer 216

U-Net preserves spatial resolution through skip connections and so they are very efficient for small dataset segmentation tasks.

Answer 217

Given two optical centres and a point in an image, you can compute the epipolar plane.

Answer 218

An epipolar line is defined by both the epipolar plane and the image plane for a camera. The epipolar line emerges where the image plane intersects the epipolar plane.

Answer 219

It reduces the dimension for the correspondence problem from 2D to 1D, making it more efficient.

Answer 220

The cost volume stores matching costs for each pixel over a range of disparities, which represents how well that pixel matches with a shifted pixel in the other image.

Answer 221

The cost volume is used to compute the disparity match by selecting the disparity with the lowest cost for each pixel

Answer 222

- Feed it image-label pairs, where each label is a pixel-wise segmentation map - Architecture uses an encoder to downsample features, and a decoder with skip connections to reconstruct spatial details - During training, a loss function compares predictions with the ground truth network - Network updates via backpropagation - Performance evaluated using a test set

Answer 223

Background subtraction works by comparing current frames to a background model using changes in pixel intensities. It uses Gaussian Mixed Models, which allows it to handle dynamic backgrounds.

Answer 224

- Invariant to scale, rotation and minor illumination changes - Uses local gradient orientation histograms, which are resistant to noise and misalignment

Answer 225

- Improves generalisation - Reduces overfitting - Forces the network to learn more robust representations

Answer 226

- It models each pixel as a mixture of several Gaussian distributions, representing different background states. - Each pixel is then compared to these different Gaussian models to determine whether it fits into the background or the foreground i.e. an object of interest.

Answer 227

- Redistributes pixel intensity values so that they span the full range of possible values - Computes the cumulative distribution function (CDF) of the image histogram, and maps the original intensities to new ones - Spreads out frequent intensity values, improving contrast in low-contrast images

Answer 228

Disparity = Xl - Xr Where: Xl = Point observed in left image Xr = Point observed in right image

Answer 229

Z = f(T)/D Where: Z = Depth f = Focal Length T = Real-world distance between two cameras D = Disparity

Answer 230

- Generates particles, each of which represents a different hypothesis that 'guesses' the position of the object in the next time-frame - Each particle has an assigned weight, which is computed through the use of an observation/motion model that compares predicted measurements with real sensor data - Over time, particles with lower weights contribute less, leading to degeneracy

Answer 231

It duplicates high-weight particles and discards low-weight particles when generating the next set of particles. It thereby focuses computation on more likely hypotheses and maintains tracking accuracy

Answer 232

Assigns a class label to each pixel in an image, grouping pixels by category without differentiating between individual objects.

Answer 233

Assigns a class label to each pixel in an image, and also identifies and separates each instance of an object i.e. it identifies each object separately unlike semantic segmentation

Answer 234

- Generates images with predicted levels of noise - Compares predicted noise in image with actual noise in same image in sequence using the MSE loss function. - Repeats this process at each timestep

Answer 235

- Blurry image generation due to reconstruction loss through the use of KL divergence terms and the use of a probabilistic decoder. - Utilise a latent space regularisation term, which also contributes to blurry images

Answer 236

- HOG descriptors - SIFT descriptors Both of these are able to implement illumination invariance through focusing on edge orientations rather than absolute intensities, as well as normalising local patches to reduce brightness variations

Answer 237

Low-level tasks involve basic image processing such as edge detection or noise reduction

Answer 238

Mid-level tasks involve interpreting groups of pixels, including segmentation and depth estimation

Answer 239

High-level tasks refer to semantic understanding such as object detection, recognition and pose estimation

Answer 240

It involves acquiring raw image data through devices like cameras or depth sensors, and provides the initial input to the CV system.

Answer 241

- Image acquisition - Pre-processing - Feature extraction - Classification

Answer 242

- Low brightness, due to a lack of light - Image blue if the hole is too large - Diffraction effects if the hole is too small

Answer 243

- It is derived through the use of a projection process that uses the camera projection matrix. - The matrix combines intrinsic and extrinsic parameters from the camera, and transforms the 3D point into the camera coordinate system and then projected onto the image plane

Answer 244

- Sobel - Canny

Answer 245

Sobel uses convolutional kernels to compute gradients in the horizontal and vertical directions

Answer 246

Advantage - Simple and computationally efficient Disadvantage - Sensitive to noise

Answer 247

Keypoints are distinct and repeatable locations in an image, such as corners or edges, that are stable under various transformations like rotation and scale.

Answer 248

Local descriptors capture distinctive information from small regions around keypoints in an image, which are robust to changes in scale, rotation and illumination.

Answer 249

Thresholding converts a greyscale image into a binary image by selecting a threshold value. Pixels with intensity values above the threshold are considered as foreground objects, whereas pixels below the threshold are considered as part of the background.

Answer 250

- Start with user-defined or automatic selection of seed points in the image - Expand regions by including neighbouring pixels that meet pre-defined criteria, such as intensity - Growth continues until no more similar pixels are found, resulting in segmented regions.

Answer 251

They are a linear mapping method that preserves points, straight lines and planes. It includes transformations such as rotation, translation, scaling and shearing.

Answer 252

They are used to align two images to each other by correcting geometric distortions, enabling one image to be mapped to the other

Answer 253

It utilises the pixel intensity values and compares them directly between two images, rather than relying on extracted features. It is typically achieved through minimising the sum of squared differences.

Answer 254

They are important because they bring different views of the same scene or object into a common coordinate system.

Answer 255

- Image acquisition & loading - Use a similarity measurement - Use a registration algorithm to find the best transformation parameters - Apply geometric transformations

Final Untouched Variation Flashcards

(280 cards)