Quiz 3 Flashcards

1
Q

As you add more convolution + pooling layers, what do each pixel represent?

A

Each pixel of a deep layer represents a larger receptive field from a previous layer/input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ImageNet

A

1.2 million images, 1000 classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Type of errors: Optimization error

A

Not find good weights to model a function
(Bad optimization algorithm)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Type of errors: Estimation error

A

Minimizing training error but doesn’t generalize to test set.
(Overfitting, learning features that don’t generalize well)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Type of errors: Modeling error

A

Given simple model, no set of weights can model the real world task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Type of errors: Case study of multi-class logistic regression (MCLR) vs AlexNet

Which has high modeling error?

A

MCLR has high modeling error because model is very simple. Just can’t model complexity of real world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Type of errors: Case study of multi-class logistic regression vs AlexNet

What kind of errors would AlexNet have, and why?

A

AlexNet may have smaller modeling error than MCLR but same degree of estimation error could occur.

Possibly higher optimization error because a complex architecture is harder to optimize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Key idea of transfer learning

A

Reuse features learned on large dataset to learn new things

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe transfer learning in 3 steps

A
  1. Train on large-scale dataset (may be provided for you)
  2. Take custom data and initialize the network with weights trained in step 1
  3. Continue to train on new dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Limitations of transfer learning

A

Won’t work well if target task is very different (e.g. using pretrained model learned to classify natural image to sketches)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Benefit of transfer learning

A

Significantly reduces amount of labeled data needed to accomplish a task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Using a larger capacity model will always reduce estimation error

A

False. No regularization could lead to increasing estimation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Transfer learning: Example of what network changes you may need to make from a pretrained model to your own

A

Replace last layer with fully-connected for output nodes per new category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Transfer learning: Ways to train from pretrained model’s weights

A
  1. Update all parameters
  2. Freeze parts of the network (e.g. only tune fully connected layers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Transfer learning: Why would you want to “freeze” parts of your network

A

Reduces the number of parameters that you need to learn given you new data set.

(If you don’t have enough data, you may not be able to fine-tune all the features in your network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Transfer learning: T/F - If you have a large data set for a target domain, training from random initialization may result in faster convergence

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Transfer learning: Expalin the three data regimes with respect to data set size and generalization error

A
  1. Small data region - not enough data, hard to reduce error
  2. Power-law region - training data size continues to linearly improve error
  3. Irreducible error region - useful data saturated to point of irreducible error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Modern networks: What was the key innovation introduced by AlexNet that made it a breakthrough in deep learning?

A

ReLU activation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Modern networks: Which one of these architectures is known for its simplicity with a focus on using only 3x3 convolutional filters?

A

VGGNet used 3x3 convolutional filters exclusively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Modern networks: Which architecture introduced the concept of residual learning, addressing the vanishing gradient problem and allowing the training of very deep networks?

A

ResNet introduced the concept of residual learning, where shortcut connections (or skip connections) were added to the network, allowing the gradient to flow more directly during training, thus addressing the vanishing gradient problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Modern networks: Which architecture uses inception modules? Explain what they are

A

InceptionNet.

Uses multiple filter sizes in parallel to capture different features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Modern networks: Which architecture was known for removing FC layers at the end of the network? What did it replace it with?

A

ResNet

Used global average pooling instead of FC layers. Global average pooling reduces overfitting and the total number of parameters in the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

CNN: During forward propagation in a convolutional layer, what operation(s) is performed between the input and the kernel?

A

element-wise multiplication and summation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

CNN: What is the purpose of backpropagation in the context of convolutional layers?

A

To compute the gradients for the kernel/filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
CNN: During backpropagation in a convolutional layer, what operation is performed to compute the gradients for the kernel?
Element-wise multiplication betwen gradients of the loss wrt output and input, then summed.
26
CNN: What is the purpose of padding in a CNN?
To preserve spatial dimensions. Otherwise deep layers becomes smaller and smaller.
27
CNN: Valid padding vs same padding
Valid: No padding, window always within input image Same: Padding added to keep output size equal to input
28
CNN: Why use max-pooling
Reduces spatial dimensions through downsampling. Adds invariance to translation of features.
29
CNN: Invariance
Property where a model is robust to certain transformations in the input. Practically, this explains how a CNN may be able to classify an object in an image regardless of where in the image it is located.
30
CNN: Equivariance
Property where a model can maintain the relationship between different elements after a transformation occurs (e.g. scaling, rotation, time shift)
31
CNN: How is invariance achieved by CNNs
Shared weights and bias
32
CNN: Equivariance
Convolution layers maintain spatial relationships between features. E.g. If an image rotates, the convolution will also rotate.
33
CNN: CNN vs FC - which has higher memory usage
CNN
34
CNN: CNN vs FC - which has more parameters
FC
35
CNN: How to calculate gradient of a kernel during backwards pass
Multiply downstream gradient elements into corresponding receptive field. Then add all the receptive fields together.
36
CNN: Given a 3x3 kernel, the top-left cell's kernel weight affects all pixels in the image
True
37
CNN: T/F - Given a constant kernel size, adding more layers increases the receptive field exponentially.
False. Adding more convolutional layers increases the receptive field size linearly, as each extra layer increases the receptive field size by the kernel size.
38
How to visualize FC layer
Reshape weights for a node back into size of image
39
How to visualize CNN layer
For each kernel scale values from 0-255 and visualize. Each kernel becomes a feature map.
40
t-SNE
Performs non-linear mapping of high dimensional data to 2D space. Preserve pair-wise distances.
41
What can a visualization output (aka activation/filter) map show with respect to the input?
Given an input image and a convolution kernel in the network, we can view what area of the kernel had the highest activation.
42
Why can visualization interpretability be difficult?
1. No intrinsic measure of utility. Need user studies to measure usefulness of visualization. 2. Neural networks learn "distributed representation" - 1:1 mapping of node to feature not guaranteed.
43
Gradient ascent
Updates the input in the direction of the gradient (rather than opposite in gradient descent)
44
Guided backprop
Applies ReLU forward and zeroes out negative gradients in addition of it. Improves visualization by only keeping positive gradients.
45
Saliency map
Visualizes area of the image with high gradients
46
How to use saliency map for bias
See which area of the image the network focused on (using dog vs snow to classify wolf example)
47
Grad-CAM
Generates heat maps highlighting regions of an input image that contribute the most to a specific class prediction
48
Grad-CAM - How does it work?
Computes gradient of the target class score with respect to the feature maps of the last convolutional layer. Reweight feature maps per channel and apply ReLU.
49
Difference between Grad-CAM and Guided Grad-CAM
Guided Grad-CAM multiplies guided backprop and Grad-CAM.
50
One practical use of gradient ascent
Class visualization
51
White-box attacks
Attacker has complete picture of the target model (network, params, data)
52
Black-box attacks
Attacker has limited or no picture of the target model. Generally uses trial-and-error attempts to craft adversarial examples.
53
Key idea from Geirhos, "ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness"
CNNs tend to be more biased towards texture than shape. Remediating this bias improves accuracy and robustness.
54
Losses in style transfer
Style-loss function - minimize squared diff between gram matrices Content-loss function - match features of content image and generate image
55
Gram matrix
Square matrix that represents relationships between vectors.
56
Difference of segmentation networks vs classification
Predicts classes for each pixel.
57
Encoder-decoder CNN architecture - key idea
Decoders are symmetrical to forward. Takes small feature maps and upsamples them back to the original image.
58
Max unpooling
Puts back the max output value back into the receptive field when decoding. Non-max pixels are left as zero.
59
What does max unpooling and deconvolution do with overlapping windows
Sums them
60
Deconvolution (transposed convolution)
Each pixel in the input is multiplied across all kernels values, then "stamped" to the output dimension.
61
U-net
Uses skip connections like ResNet but in a encoder-decoder network.
62
Single-stage object detection
Task of identifying and setting a bounding box for an identified object.
63
Single-stage object detection - what are its losses?
Cross-entropy loss for classification + Mean squared error for bounding box
64
Multi-headed architecture
When an architecture performs several tasks with shared features.
65
Single-shot detector (SSD) - key idea
Uses a grid and for each grid makes K bounding boxes. Estimates refined boxes across multiple layers. Selects box with highest confidence score among group of overlapping boxes for an object.
66
YOLO - what makes it unique
Predict bounding box + classification in a single pass. Special loss function to minimize both errors at once.
67
Mean average precision in the context of bounding boxes
Take intersection of bounding box (pred vs truth) and divide it by the union to determine wellness of fit. Calculate precision/recall curve and calculate its average precision over all classes.
68
Two-stage object detection
Step 1 - determine regions of interest Step 2 - classify those regions
69
One way two-stage object detection detect objects. But slow.
Unsupervised learning
70
Fast R-CNN - key idea
Use bounding boxes within feature maps, then map to input image.
71
Fast R-CNN - what is its benefit
Reuses computation
72
ROI Pooling - key idea
Applies a fixed grid to the feature map and applies max pooling to each cell in the grid with respect to the corresponding feature map.
73
ROI Pooling - what is its benefit
Can backpropagate
74
Faster R-CNN - key idea
Uses a region proposal network (RPN) to generate candidate regions. Take top-K and classify.
75
Mask R-CNN - Key Idea
Applies mask to boxes to detect which pixels is an object
76
Given an input image 1 2 3 4 5 6 7 8 9 and filter: 1 0 0 -1 Compute the forward operation
For the top-left element of the output: (1*1) + (2*0) (4*0) + (5*(-1)) Result: -5 For the top-right element of the output: (2*1) + (3*0) (5*0) + (6*(-1)) Result: -6 For the bottom-left element of the output: (4*1) + (5*0) (7*0) + (8*(-1)) Result: -12 For the bottom-right element of the output: (5*1) + (6*0) (8*0) + (9*(-1)) Result: -9 The resulting 2x2 output matrix: -5 -6 -12 -9
77
Given a gradient: 1 2 3 4 5 6 7 8 9 and filter: 1 0 -1 2 0 -2 1 0 -1 Compute gradient with respect to the filter for the top-left element of the gradient (dL/d(1))
dL/d(1) = (1*1) + (2*2) + (3*3) + (4*4) + (5*5) + (6*6) + (7*7) + (8*8) + (9*9) = 285
78
Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is: The output shape
Output Size= (4 - 2) / 2 + 1
79
Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is the shape of: Downstream gradient
2x2 (same as output)
80
Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is: Gradient wrt kernel
2x2 (same shape as kernel)
81
Given a backward pass with stride 1 of a 4x4 input, 2x2 kernel and 2x2 gradient, what is the shape of: Gradient wrt input
4x4
82
Given: Input - 32x32x3 Kernel: 5x5 Padding: 2 Stride: 1 Number of filters: 10 What is the parameter size?
760 Formula = (Channels * Kernel * Kernel + Bias) * Filters = (3 * 5 * 5 +1) * 10 = 760
83
Given an input (28x28x3) what is the memory requirement
2353 Memory requirement is the product of input and channel