Deep Learning for Computer Vision Flashcards

Question

What is the benefit of using dense connectivity?

Answer 1

- Loss is propagated through all the layers in a dense block, which creates strong gradient - It ensures that gradients can flow directly to earlier layers during backpropagation, addressing vanishing gradient issues.

Answer 2

- squeeze: performs pooling for each feature map then flattens these into 1D vector to get compact representation of image - excitation: two fully connected layers with a non-linearity layer (ReLu) inbetween to generate weights, that indicate the importance of each channel. - uses sigmoid to up-scale to original feature map size, amplifying important channels and suppressing irrelevant ones

Answer 3

They improve representational power of features and produces better model performance

Answer 4

Depth-wise and point-wise convolution

Answer 5

given the input with 3 channels, and a filter of 3 channels, we do the convolution channel by channel. E.g. The first filter only convolves with the first channel.

Answer 6

Can be treated as a one by one convolution, the filter size is 1x1, this mixes all the channel information. If we want 256 output channels we need to use 256 filter channels.

Answer 7

Uses depthwise and pointwise convolution - this largely reduces the number of parameters needed

Answer 8

A ML technique where a model that's been trained for a specific task is repurposed as a starting point for another similar task

Answer 9

Training using traditional ML uses isolated single task learning: learning is performed without considering past learnt information from other tasks. Training using transfer learning, is where learning of a new task relies on a previous learned task.

Answer 10

- The new model will typically borrow several layers from a past model which are fixed ("frozen") - Only the last few layers of the new model will be trained

Answer 11

The old model typically uses a large dataset and the transfer learning model typically uses a small dataset

Answer 12

1) Design a NN and train a large dataset on it 2) Borrow the first several layers of this trained NN, for the new transfer learning model, borrowing parameters, weights and bias. 3) Train only the last layers of the new NN on a small dataset 4) evaluate the model on a new small dataset, measure classification accuracy.

Answer 13

Transfer Learning

Answer 14

- Ensures that pre-trained weights are not drastically altered, allowing the model to fine-tune its existing knowledge to the new task gradually. - Ensures gradient updates are smooth and that the pre-trained features adapt without disrupting the stability of the network.

Answer 15

- faster training time for model (as the new model borrows optimised weights and bias from trained model) - it performs very well on small datasets

Answer 16

If the dataset is very large

Answer 17

A ML method that involves learning a distance function or metric over a dataset. Used for face verification - Uses loss to move similar objects as close as possible and dissimilar objects far away from each other, in clusters - for x categories, they'll be x clusters

Answer 18

- take an input pair of samples that are either similar or dissimilar - use loss formula to bring similar samples closer and dissimilar samples far apart - if samples are similar, minimise distance - if samples are dissimilar, maximise distance

Answer 19

1) design a NN to extract facial features 2) turn this into a 1D vector using pooling 3) Use this to predict the category 4) Compare similarities to decide if the feature is the same or not 5) Calculate overall similarity. If it meets a threshold we classify it as the same person.

Answer 20

Contrastive loss image pretraining: is a ML model that uses metric learning - associates images with textual descriptions from a large dataset of text and images

Answer 21

- Uses two neural networks: a text encoder and an image encoder, these generate text and image embeddings and project them onto a joint space - During training, the model is given image-text pairs - use contrastive loss function to bring embeddings of corresponding pairs closer together and distance embeddings of mismatched pairs

Answer 22

1) image is passed through image encoder to obtain its embedding in the shared space. 2) A set of text descriptions is encoded using the text encoder. Each candidate description results in a separate text embedding. 3) computes similarity between image embedding and each text embedding 4) text with highest similarity score to the image embedding is selected as predicted description.

Answer 23

Because the first stage involves getting proposals for objects and the second stage involves using a neural network to refine the position and predict the category

Answer 24

R-CNN and Fast R-CNN use selective search, Faster R-CNN uses a small neural network known as the region proposal network

Answer 25

- R-CNN: Selective search + CNN + SVM - Fast R-CNN: Selective search + CNN + ROI - Faster R-CNN: Region proposal NN + CNN + ROI`

Answer 26

Nearest neighbour only needs to store the pixel value whereas max unpooling requires the pixel value and the pixel position, so you need to also store the position

Answer 27

To learn complex distributions in order to generate new realistic images.

Answer 28

- Classification - Localisation First perform feature extraction, then perform classification to predict labels and bounding boxes

Answer 29

R-CNN, Fast R-CNN, Faster R-CNN

Answer 30

Given an image, crop a region of the image to a predefined window size, then forward it through the classifier/CNN to predict the category and position. Slide the window along the image, forwarding every subregion through the CNN.

Answer 31

- repeatedly applying the CNN to many cropped regions is very computationally expense/time consuming

Answer 32

- resize image to a predefined size - forward it through a CNN to extract features - divide the feature map into an SxS grid - for each grid cell, predict two branches, one for bounding box position (and if foreground/background) and predict scores for each category - combine position and category to get final detections - perform non-maximum suppression to remove duplicate detections

Answer 33

1) sort bounding boxes based on highest confidence score 2) save bounding box with highest confidence score as detection 3) remove all bounding boxes that meet a threshold of IOU with the selected bounding box - Repeat steps 2 and 3 until one bounding box remains

Answer 34

1) selective search produces an amount of proposal regions in an image 2) for each proposed region, crop and resize this to a predefined size 3) Forward region through the CNN to extract features 4) use SVMs to predict category scores and bounding box position

Answer 35

- generate many candidate regions - use region growing to combine similar regions into larger regions used as final region proposals

Answer 36

1) forwarding every proposal region through the CNN separately is very time consuming 2) selective search is a fixed algorithm, no learning takes place, could make bad region proposals and is very time consuming

Answer 37

1) selective search produces an amount of proposal regions for the image 2) Fast R-CNN is much faster than R-CNN, as it forwards the whole image through the CNN to get the feature map 3) the proposal regions are marked as regions of interest on the CNN output and forwarded through a ROI pooling layer 4) A fully connected layer is used to predict the class scores and bounding box position

Answer 38

1) the image is fed through a region proposal network to produce the region proposals 2) the whole image is forwarded through a CNN network to get the feature map 3) the proposal regions are marked as regions of interest on the CNN output and forwarded through ROI pooling layer 4) a fully connected layer is used to predict the class scores and bounding box position

Answer 39

- candidate bounding boxes are generated - uses sliding windows on the feature map to refine candidate bounding boxes - uses anchors: predefined bounding boxes of different sizes placed at each position on the feature map

Answer 40

- Fast R-CNN is faster and more accurate than R-CNN - Faster R-CNN is faster and more accurate than Fast R-CNN and R-CNN

Answer 41

- thresholding - region based - K-means clustering - sliding window - fully convolutional networks (U-Net, SegNet, PSPNet, Mask R-CNN)

Answer 42

1) given an image, crop it to a predefined window size. 2) pass the cropped region through a CNN to get class scores 3) slide the window across the whole image, passing every region through the CNN

Answer 43

It's inefficient and very time consuming as you have to forward every subregion separately through the CNN

Answer 44

1) Given an image and it's label, pass it through a classifier to get feature maps 2) replace FC layer with 1x1 convolution to get feature map 3) use upsampling to resize feature map to original size 4) use ground truth to calculate loss

Answer 45

- Nearest neighbour interpolation: copy the nearest neighbour to surrounding region - Max unpooling: requires storing values and locations from original image, values are 0 apart from pixel with max value - deconvolution: learnable upsampling: inverse of convolution

Answer 46

- Symmetrical architecture, requires fewer training samples - downsamples to learn context and upsamples to restore spatial details with skip connections - used for medical image segmentation

Answer 47

- Has no fully connected layers - Uses max pooling in encoder and max unpooling in decoder - uses softmax in final layer - gives pixel wise predictions - uses simpler more efficient upsampling

Answer 48

- uses pyramid pooling module which captures global context from different spatial scales - features captured are upsampled and concatenated to learn segmentations - doesn't use fully connected layers, which is more efficient - high accuracy for segregating similar regions

Answer 49

- Same architecture as Faster R-CNN but adds a third branch - branches for Faster R-CNN are class scores and bounding box position. - third branch generates pixel-level masks for each region of interest - uses ROI align - very good at instance segmentation

Answer 50

Given a dataset they generate images - they learn complex distributions in attempt to generate realistic looking images. - example: GANs

Answer 51

Generative Adversarial networks: - A two player game, consisting of two neural networks: - a generator uses a dataset to generate realistic looking images - these are passed to the discriminator who classifies if an image is real or fake.

Answer 52

- because they involve two competing neural networks: - the generator is seen as player one, trying to minimise the objective by generating realistic images to fool the discriminator - the discriminator is seen as player two, and tries to catch out the generator by telling which images are real and fake, maximising the objective

Answer 53

- They're trained in iteratively in a minimax game. - They're trained in turns, so when the discriminator is training, the generator is fixed and vice versa. - the generator tries to minimise the objective - the discriminator tries to maximise the objective

Deep Learning for Computer Vision Flashcards

(78 cards)