Quiz 3 Flashcards

1
Q

As you add more convolution + pooling layers, what do each pixel represent?

A

Each pixel of a deep layer represents a larger receptive field from a previous layer/input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ImageNet

A

1.2 million images, 1000 classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Type of errors: Optimization error

A

Not find good weights to model a function
(Bad optimization algorithm)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Type of errors: Estimation error

A

Minimizing training error but doesn’t generalize to test set.
(Overfitting, learning features that don’t generalize well)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Type of errors: Modeling error

A

Given simple model, no set of weights can model the real world task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Type of errors: Case study of multi-class logistic regression (MCLR) vs AlexNet

Which has high modeling error?

A

MCLR has high modeling error because model is very simple. Just can’t model complexity of real world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Type of errors: Case study of multi-class logistic regression vs AlexNet

What kind of errors would AlexNet have, and why?

A

AlexNet may have smaller modeling error than MCLR but same degree of estimation error could occur.

Possibly higher optimization error because a complex architecture is harder to optimize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Key idea of transfer learning

A

Reuse features learned on large dataset to learn new things

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe transfer learning in 3 steps

A
  1. Train on large-scale dataset (may be provided for you)
  2. Take custom data and initialize the network with weights trained in step 1
  3. Continue to train on new dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Limitations of transfer learning

A

Won’t work well if target task is very different (e.g. using pretrained model learned to classify natural image to sketches)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Benefit of transfer learning

A

Significantly reduces amount of labeled data needed to accomplish a task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Using a larger capacity model will always reduce estimation error

A

False. No regularization could lead to increasing estimation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Transfer learning: Example of what network changes you may need to make from a pretrained model to your own

A

Replace last layer with fully-connected for output nodes per new category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Transfer learning: Ways to train from pretrained model’s weights

A
  1. Update all parameters
  2. Freeze parts of the network (e.g. only tune fully connected layers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Transfer learning: Why would you want to “freeze” parts of your network

A

Reduces the number of parameters that you need to learn given you new data set.

(If you don’t have enough data, you may not be able to fine-tune all the features in your network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Transfer learning: T/F - If you have a large data set for a target domain, training from random initialization may result in faster convergence

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Transfer learning: Expalin the three data regimes with respect to data set size and generalization error

A
  1. Small data region - not enough data, hard to reduce error
  2. Power-law region - training data size continues to linearly improve error
  3. Irreducible error region - useful data saturated to point of irreducible error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Modern networks: What was the key innovation introduced by AlexNet that made it a breakthrough in deep learning?

A

ReLU activation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Modern networks: Which one of these architectures is known for its simplicity with a focus on using only 3x3 convolutional filters?

A

VGGNet used 3x3 convolutional filters exclusively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Modern networks: Which architecture introduced the concept of residual learning, addressing the vanishing gradient problem and allowing the training of very deep networks?

A

ResNet introduced the concept of residual learning, where shortcut connections (or skip connections) were added to the network, allowing the gradient to flow more directly during training, thus addressing the vanishing gradient problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Modern networks: Which architecture uses inception modules? Explain what they are

A

InceptionNet.

Uses multiple filter sizes in parallel to capture different features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Modern networks: Which architecture was known for removing FC layers at the end of the network? What did it replace it with?

A

ResNet

Used global average pooling instead of FC layers. Global average pooling reduces overfitting and the total number of parameters in the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

CNN: During forward propagation in a convolutional layer, what operation(s) is performed between the input and the kernel?

A

element-wise multiplication and summation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

CNN: What is the purpose of backpropagation in the context of convolutional layers?

A

To compute the gradients for the kernel/filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

CNN: During backpropagation in a convolutional layer, what operation is performed to compute the gradients for the kernel?

A

Element-wise multiplication betwen gradients of the loss wrt output and input, then summed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

CNN: What is the purpose of padding in a CNN?

A

To preserve spatial dimensions. Otherwise deep layers becomes smaller and smaller.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

CNN: Valid padding vs same padding

A

Valid: No padding, window always within input image
Same: Padding added to keep output size equal to input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

CNN: Why use max-pooling

A

Reduces spatial dimensions through downsampling. Adds invariance to translation of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

CNN: Invariance

A

Property where a model is robust to certain transformations in the input.

Practically, this explains how a CNN may be able to classify an object in an image regardless of where in the image it is located.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

CNN: Equivariance

A

Property where a model can maintain the relationship between different elements after a transformation occurs (e.g. scaling, rotation, time shift)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

CNN: How is invariance achieved by CNNs

A

Shared weights and bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

CNN: Equivariance

A

Convolution layers maintain spatial relationships between features.

E.g. If an image rotates, the convolution will also rotate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

CNN: CNN vs FC - which has higher memory usage

A

CNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

CNN: CNN vs FC - which has more parameters

A

FC

35
Q

CNN: How to calculate gradient of a kernel during backwards pass

A

Multiply downstream gradient elements into corresponding receptive field. Then add all the receptive fields together.

36
Q

CNN: Given a 3x3 kernel, the top-left cell’s kernel weight affects all pixels in the image

A

True

37
Q

CNN: T/F - Given a constant kernel size, adding more layers increases the receptive field exponentially.

A

False. Adding more convolutional layers
increases the receptive field size linearly, as each extra layer increases the receptive field size by the kernel size.

38
Q

How to visualize FC layer

A

Reshape weights for a node back into size of image

39
Q

How to visualize CNN layer

A

For each kernel scale values from 0-255 and visualize. Each kernel becomes a feature map.

40
Q

t-SNE

A

Performs non-linear mapping of high dimensional data to 2D space. Preserve pair-wise distances.

41
Q

What can a visualization output (aka activation/filter) map show with respect to the input?

A

Given an input image and a convolution kernel in the network, we can view what area of the kernel had the highest activation.

42
Q

Why can visualization interpretability be difficult?

A
  1. No intrinsic measure of utility. Need user studies to measure usefulness of visualization.
  2. Neural networks learn “distributed representation” - 1:1 mapping of node to feature not guaranteed.
43
Q

Gradient ascent

A

Updates the input in the direction of the gradient (rather than opposite in gradient descent)

44
Q

Guided backprop

A

Applies ReLU forward and zeroes out negative gradients in addition of it.

Improves visualization by only keeping positive gradients.

45
Q

Saliency map

A

Visualizes area of the image with high gradients

46
Q

How to use saliency map for bias

A

See which area of the image the network focused on (using dog vs snow to classify wolf example)

47
Q

Grad-CAM

A

Generates heat maps highlighting regions of an input image that contribute the most to a specific class prediction

48
Q

Grad-CAM - How does it work?

A

Computes gradient of the target class score with respect to the feature maps of the last convolutional layer. Reweight feature maps per channel and apply ReLU.

49
Q

Difference between Grad-CAM and Guided Grad-CAM

A

Guided Grad-CAM multiplies guided backprop and Grad-CAM.

50
Q

One practical use of gradient ascent

A

Class visualization

51
Q

White-box attacks

A

Attacker has complete picture of the target model (network, params, data)

52
Q

Black-box attacks

A

Attacker has limited or no picture of the target model.
Generally uses trial-and-error attempts to craft adversarial examples.

53
Q

Key idea from Geirhos, “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness”

A

CNNs tend to be more biased towards texture than shape. Remediating this bias improves accuracy and robustness.

54
Q

Losses in style transfer

A

Style-loss function - minimize squared diff between gram matrices
Content-loss function - match features of content image and generate image

55
Q

Gram matrix

A

Square matrix that represents relationships between vectors.

56
Q

Difference of segmentation networks vs classification

A

Predicts classes for each pixel.

57
Q

Encoder-decoder CNN architecture - key idea

A

Decoders are symmetrical to forward. Takes small feature maps and upsamples them back to the original image.

58
Q

Max unpooling

A

Puts back the max output value back into the receptive field when decoding. Non-max pixels are left as zero.

59
Q

What does max unpooling and deconvolution do with overlapping windows

A

Sums them

60
Q

Deconvolution (transposed convolution)

A

Each pixel in the input is multiplied across all kernels values, then “stamped” to the output dimension.

61
Q

U-net

A

Uses skip connections like ResNet but in a encoder-decoder network.

62
Q

Single-stage object detection

A

Task of identifying and setting a bounding box for an identified object.

63
Q

Single-stage object detection - what are its losses?

A

Cross-entropy loss for classification + Mean squared error for bounding box

64
Q

Multi-headed architecture

A

When an architecture performs several tasks with shared features.

65
Q

Single-shot detector (SSD) - key idea

A

Uses a grid and for each grid makes K bounding boxes. Estimates refined boxes across multiple layers. Selects box with highest confidence score among group of overlapping boxes for an object.

66
Q

YOLO - what makes it unique

A

Predict bounding box + classification in a single pass.
Special loss function to minimize both errors at once.

67
Q

Mean average precision in the context of bounding boxes

A

Take intersection of bounding box (pred vs truth) and divide it by the union to determine wellness of fit. Calculate precision/recall curve and calculate its average precision over all classes.

68
Q

Two-stage object detection

A

Step 1 - determine regions of interest
Step 2 - classify those regions

69
Q

One way two-stage object detection detect objects. But slow.

A

Unsupervised learning

70
Q

Fast R-CNN - key idea

A

Use bounding boxes within feature maps, then map to input image.

71
Q

Fast R-CNN - what is its benefit

A

Reuses computation

72
Q

ROI Pooling - key idea

A

Applies a fixed grid to the feature map and applies max pooling to each cell in the grid with respect to the corresponding feature map.

73
Q

ROI Pooling - what is its benefit

A

Can backpropagate

74
Q

Faster R-CNN - key idea

A

Uses a region proposal network (RPN) to generate candidate regions. Take top-K and classify.

75
Q

Mask R-CNN - Key Idea

A

Applies mask to boxes to detect which pixels is an object

76
Q

Given an input image
1 2 3
4 5 6
7 8 9

and filter:
1 0
0 -1

Compute the forward operation

A

For the top-left element of the output:

(11) + (20)
(40) + (5(-1))
Result: -5

For the top-right element of the output:
(21) + (30)
(50) + (6(-1))
Result: -6

For the bottom-left element of the output:
(41) + (50)
(70) + (8(-1))
Result: -12

For the bottom-right element of the output:
(51) + (60)
(80) + (9(-1))
Result: -9

The resulting 2x2 output matrix:
-5 -6
-12 -9

77
Q

Given a gradient:
1 2 3
4 5 6
7 8 9

and filter:
1 0 -1
2 0 -2
1 0 -1

Compute gradient with respect to the filter for the top-left element of the gradient (dL/d(1))

A

dL/d(1) = (11) + (22) + (33) + (44) + (55) + (66) + (77) + (88) + (9*9)

= 285

78
Q

Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is:

The output shape

A

Output Size= (4 - 2) / 2 + 1

79
Q

Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is the shape of:

Downstream gradient

A

2x2 (same as output)

80
Q

Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is:

Gradient wrt kernel

A

2x2 (same shape as kernel)

81
Q

Given a backward pass with stride 1 of a 4x4 input, 2x2 kernel and 2x2 gradient, what is the shape of:

Gradient wrt input

A

4x4

82
Q

Given:
Input - 32x32x3
Kernel: 5x5
Padding: 2
Stride: 1
Number of filters: 10

What is the parameter size?

A

760

Formula = (Channels * Kernel * Kernel + Bias) * Filters
= (3 * 5 * 5 +1) * 10
= 760

83
Q

Given an input (28x28x3) what is the memory requirement

A

2353

Memory requirement is the product of input and channel