Computer Vision Flashcards

Question 1

Q

What are the key components of a computer vision system, and how do they interact with each other?

Answer

A

** Input Data**: This refers to the raw data that the computer vision system receives, which is typically in the form of images or videos. It can come from various sources such as cameras, sensors, or pre-recorded datasets.

Preprocessing: Before performing any analysis, the input data often requires preprocessing to enhance its quality or extract relevant information. This may involve tasks like resizing, cropping, color correction, denoising, or normalization.

Feature Extraction: In this step, meaningful features are extracted from the preprocessed data. These features can be specific patterns, textures, edges, corners, or higher-level descriptors that are representative of the objects or characteristics of interest in the image.

Object Detection/Recognition: Once the features are extracted, object detection or recognition algorithms are applied to identify and locate specific objects or patterns within the image. This may involve using techniques such as template matching, feature matching, or more advanced methods like convolutional neural networks (CNNs).

Post-processing: After object detection, post-processing steps can be applied to refine the results, filter out false positives or noise, and improve the accuracy of the system. Techniques like non-maximum suppression, clustering, or geometric constraints can be used for this purpose.

Interpretation/Decision Making: The final step involves interpreting the results obtained from the previous stages and making decisions based on the analyzed data. This can include tasks like object classification, scene understanding, behavior prediction, or any other task specific to the application domain.

Question 2

Q

How do you handle challenges related to image quality, such as noise, blurriness, or low resolution?

Answer

A

Handling challenges related to image quality is a crucial aspect of computer vision engineering. Here are some common strategies for addressing such challenges:

Noise Reduction: Noise in images can be caused by various factors, such as sensor limitations, compression artifacts, or environmental conditions. Techniques like median filtering, Gaussian filtering, or denoising algorithms such as BM3D can be employed to reduce noise and enhance image quality.

Image Enhancement: When dealing with blurry or low-resolution images, enhancement techniques can be applied to improve their quality. These techniques may involve deconvolution algorithms, super-resolution methods, or adaptive filtering to sharpen the image and enhance details.

Image Registration: In cases where images are misaligned or have geometric distortions, image registration techniques can be used to align them properly. This can involve algorithms like feature-based registration, intensity-based registration, or geometric transformations to align images accurately.

Illumination Normalization: Lighting conditions can significantly impact image quality and the performance of computer vision algorithms. Techniques such as histogram equalization, adaptive histogram equalization, or more advanced methods like Retinex-based algorithms can be used to normalize and enhance the illumination in images.

Upsampling and Interpolation: In situations where images have low resolution or need to be resized, upsampling and interpolation techniques can be employed. These methods use interpolation algorithms, such as bilinear or bicubic interpolation, to increase the resolution and improve the visual quality of the image.

Deep Learning-Based Approaches: Deep learning models, such as generative adversarial networks (GANs), can be utilized for image restoration and enhancement tasks. These models can learn to reconstruct high-quality images from noisy or degraded inputs, effectively addressing challenges related to image quality.

It’s important to note that the specific approach taken to handle image quality challenges depends on the nature of the problem, available resources, and the desired outcome. Computer vision engineers often experiment with different techniques and combinations to find the most suitable approach for a given scenario.

Question 3

Q

different computer vision tasks

Answer

A

Image Classification: This task involves assigning a label or a category to an input image. The goal is to train a model to accurately classify images into predefined classes, such as distinguishing between different objects, scenes, or patterns.

Object Detection: Object detection involves identifying and localizing multiple objects within an image. The output typically includes bounding boxes around the detected objects, along with their corresponding class labels. Object detection is widely used in applications like autonomous driving, surveillance, and robotics.

Image Segmentation: Image segmentation involves partitioning an image into multiple regions or segments based on similar visual characteristics. Each segment corresponds to a specific object or region of interest. Semantic segmentation assigns a class label to each pixel, while instance segmentation differentiates between individual instances of objects.

Object Tracking: Object tracking focuses on following the trajectory of a specific object across consecutive frames in a video or a sequence of images. It is useful in applications like surveillance, action recognition, and video analysis.

Pose Estimation: Pose estimation aims to determine the spatial position and orientation of objects or human body joints in an image or video. It is used in applications like augmented reality, robotics, and motion capture.

Image Captioning: Image captioning combines computer vision and natural language processing to generate a textual description or caption for an input image. The model needs to understand the visual content of the image and generate a coherent and relevant description.

Face Recognition: Face recognition involves identifying and verifying individuals based on their facial features. It can be used for identity verification, access control, surveillance, and personalized user experiences.

Image Generation: Image generation involves creating new images based on a given set of constraints or learned patterns. Generative models, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), are commonly used for image synthesis tasks.

Optical Character Recognition (OCR): OCR is the task of extracting text from images or documents. It involves detecting and recognizing characters in images and converting them into machine-readable text.

3D Reconstruction: 3D reconstruction aims to create a three-dimensional representation of objects or scenes from multiple images or depth information. It is used in applications such as 3D modeling, virtual reality, and medical imaging.

Question 4

Q

performance metrics for computer vision problems

Answer

A

Accuracy/Classification Accuracy: This metric is widely used in image classification tasks and measures the percentage of correctly classified images. It is calculated as the ratio of the number of correctly predicted samples to the total number of samples.

Precision, Recall, and F1 Score: These metrics are commonly used in object detection, segmentation, and instance segmentation tasks.

    Precision: It represents the fraction of true positive detections out of all positive detections. It measures the accuracy of the model in identifying objects.

    Recall (also known as sensitivity or true positive rate): It measures the fraction of true positive detections out of all actual positive instances. It represents the ability of the model to find all relevant objects.

    F1 Score: It combines precision and recall into a single metric by taking their harmonic mean. It provides a balanced measure of both metrics and is useful when there is an uneven class distribution or when false positives and false negatives have different costs.

Intersection over Union (IoU)/Jaccard Index: IoU is used in object detection, segmentation, and instance segmentation tasks to measure the overlap between predicted and ground truth regions. It is calculated as the intersection area divided by the union area of the predicted and ground truth regions. Higher IoU values indicate better object localization or segmentation accuracy.

Mean Average Precision (mAP): mAP is a common metric in object detection tasks that measures the overall performance of the model across multiple object categories. It considers precision and recall at various detection thresholds and averages the results over different object categories.

Mean Squared Error (MSE) or Root Mean Squared Error (RMSE): These metrics are used for tasks like image denoising, super-resolution, or image reconstruction. MSE measures the average squared difference between predicted and ground truth pixel values, while RMSE takes the square root of the MSE.

Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC): These metrics are used for tasks involving binary classification, such as face recognition. The ROC curve plots the true positive rate against the false positive rate at different classification thresholds. The AUC summarizes the ROC curve and provides a single metric to evaluate the model’s performance.

Mean Average Precision at Different Intersection over Union (mAP@[IoU Thresholds]): This metric is used in object detection and instance segmentation tasks to evaluate the accuracy at different IoU thresholds. It provides insights into the model’s performance at different levels of object overlap.

Question 5

Q

What are CNNs

Answer

A

Convolutional Neural Networks (CNNs) are a class of deep learning models that have been highly successful in various computer vision tasks, including image classification, object detection, segmentation, and more. CNNs are specifically designed to process and analyze visual data, leveraging their unique architecture and convolutional layers.

The key components of a CNN include:

Convolutional Layers: These layers perform the core operation of convolving input images with learnable filters, also known as kernels or feature detectors. Convolutional operations capture local spatial patterns by sliding the filters over the input data and computing element-wise multiplications and summations. This process allows the network to learn hierarchical representations of visual features, such as edges, textures, or shapes, at different levels of abstraction.
Pooling Layers: Pooling layers reduce the spatial dimensions of the feature maps produced by the convolutional layers, while retaining the essential information. Common pooling techniques include max pooling (selecting the maximum value within each pooling region) or average pooling (taking the average value). Pooling helps to downsample the feature maps, making the network more robust to variations in object location and scale.
Activation Functions: Activation functions introduce non-linearity into the network, enabling it to learn complex and nonlinear relationships in the data. Rectified Linear Units (ReLU) is one commonly used activation function in CNNs, which sets negative values to zero and keeps positive values unchanged. This activation promotes sparse and efficient representations in the network.
Fully Connected Layers: These layers are typically placed towards the end of the CNN architecture. They are traditional neural network layers where each neuron is connected to every neuron in the previous layer. Fully connected layers help in capturing higher-level abstractions and making class predictions based on the learned features.

The training process of a CNN involves forward propagation, where input data is passed through the layers, and the network makes predictions. The predicted outputs are then compared to the ground truth labels, and an appropriate loss function, such as cross-entropy, is used to measure the discrepancy between the predicted and actual values. Backpropagation is then employed to iteratively update the network’s parameters (weights and biases) by minimizing the loss function, optimizing the network’s ability to make accurate predictions.

CNNs excel in learning hierarchical representations of visual data, automatically extracting meaningful features at different levels of abstraction. They have shown remarkable performance in various computer vision tasks, often outperforming traditional feature engineering-based approaches. The ability to capture spatial relationships and exploit local dependencies in the data makes CNNs particularly suitable for tasks that involve understanding images, videos, or any form of grid-like data.

Question 6

Q

Loss functions in computer vision problems

Answer

A

Loss functions play a crucial role in training computer vision models by quantifying the discrepancy between the predicted outputs and the ground truth labels or targets. The choice of an appropriate loss function depends on the specific computer vision task and the nature of the data. Here are some commonly used loss functions in computer vision problems:

Cross-Entropy Loss: Cross-entropy loss is widely used in image classification tasks. It measures the dissimilarity between predicted class probabilities and the true class labels. The loss is calculated as the negative logarithm of the predicted probability assigned to the correct class. It encourages the model to assign high probabilities to the correct classes and penalizes incorrect predictions.
Binary Cross-Entropy Loss: Binary cross-entropy loss is used in binary classification tasks, such as object detection with foreground-background classification. It measures the dissimilarity between predicted probabilities and binary ground truth labels. It is computed as the negative logarithm of the predicted probability for the true class.
Mean Squared Error (MSE) Loss: MSE loss is often used in regression tasks in computer vision, such as image denoising or super-resolution. It calculates the average squared difference between predicted values and ground truth targets. It penalizes larger errors more heavily than smaller errors.
Smooth L1 Loss: Smooth L1 loss is a variation of the traditional L1 loss and is commonly used in object detection tasks, such as bounding box regression. It combines the advantages of both L1 and L2 losses by providing a smooth transition between the two. It is less sensitive to outliers compared to the traditional L1 loss.
Dice Loss: Dice loss is frequently used in image segmentation tasks. It measures the overlap between predicted and ground truth segmentation masks. The loss is calculated based on the Dice coefficient, which is twice the intersection of the predicted and ground truth masks divided by the sum of their areas. Dice loss encourages better alignment between the predicted and ground truth segmentation masks.
Kullback-Leibler (KL) Divergence: KL divergence is used in tasks like image generation, where the goal is to approximate a given distribution. It measures the difference between the predicted probability distribution and the target distribution. KL divergence encourages the model to match the target distribution.
Adversarial Loss: Adversarial loss, commonly used in generative models like Generative Adversarial Networks (GANs), quantifies the discrepancy between the generator’s output and the real data distribution. It is calculated based on the ability of a discriminator network to classify generated samples as real or fake.

Question 7

Q

Why CNNs do not work with Object Detection

Answer

A

The major reason why you cannot proceed with this problem by building a standard convolutional network followed by a fully connected layer is that, the length of the output layer is variable — not constant, this is because the number of occurrences of the objects of interest is not fixed.

Question 8

Q

Object Dection: Naive Approach

Answer

A

A naive approach to solve this problem would be to take different regions of interest from the image, and use a CNN to classify the presence of the object within that region. The problem with this approach is that the objects of interest might have different spatial locations within the image and different aspect ratios. Hence, you would have to select a huge number of regions and this could computationally blow up. Therefore, algorithms like R-CNN, YOLO etc have been developed to find these occurrences and find them fast.

Question 9

Q

Object Detection: R-CNN

Answer

A

To bypass the problem of selecting a huge number of regions, Ross Girshick et al. proposed a method where we use selective search to extract just 2000 regions from the image and he called them region proposals. Therefore, now, instead of trying to classify a huge number of regions, you can just work with 2000 regions. These 2000 region proposals are generated using the selective search algorithm which is written below.

Algorithm
These 2000 candidate region proposals are warped into a square and fed into a convolutional neural network that produces a 4096-dimensional feature vector as output. The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the image and the extracted features are fed into an SVM to classify the presence of the object within that candidate region proposal. In addition to predicting the presence of an object within the region proposals, the algorithm also predicts four values which are offset values to increase the precision of the bounding box. For example, given a region proposal, the algorithm would have predicted the presence of a person but the face of that person within that region proposal could’ve been cut in half. Therefore, the offset values help in adjusting the bounding box of the region proposal.

Problems with R-CNN

It still takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.
It cannot be implemented real time as it takes around 47 seconds for each test image.
The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

Question 10

Q

Object Detection: Fast R-CNN

Answer

A

The approach is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map. From the convolutional feature map, we identify the region of proposals and warp them into squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

Question 11

Q

Object Detection: Faster R-CNN

Answer

A

Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network. Therefore, Shaoqing Ren et al. came up with an object detection algorithm that eliminates the selective search algorithm and lets the network learn the region proposals.

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

Question 12

Q

Object Detection: You Only Look Once (YOLO)

Answer

A

All of the previous object detection algorithms use regions to localize the object within the image. The network does not look at the complete image. Instead, parts of the image which have high probabilities of containing the object. YOLO or You Only Look Once is an object detection algorithm much different from the region based algorithms seen above. In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes.

How YOLO works is that we take an image and split it into an SxS grid, within each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.

YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms. The limitation of YOLO algorithm is that it struggles with small objects within the image, for example it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm.

Computer Vision Flashcards

(12 cards)