Quiz 3 - CNN Architecture, Visualization, Advanced CV Architecture Flashcards

1
Q

T/F: Visualization makes assessing interpretability easy

A

False

  • Visualization leads to some interpretable representations, bt they may be misleading or uninformative
  • Assessing interpretability is difficult
    • Requires user studies to show usefulness
  • Neural networks learn distributed representation
    • no one node represents a particular feature
    • makes interpretation difficult
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps to obtaining Gradient of Activation with respect to input

A
  • Pick a neuron
  • Run forward method up to layer we care about
  • Find gradient of its activation w.r.t input image
  • Can first find highest activated image patches using its corresponding neuron (based on receptive field)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

T/F: A single-pixel change can make a NN wrong

A

True (single-pixel attacks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Shape vs. Texture Bias

A
  • Ex: take picture of cat and apply texture of elephant
    • Humans are biased towards shape (will see cat)
    • Neural Networks are biased towards texture (will classify cat as elephant, likely)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Estimation Error

A

Even with the best weights to minimize training error, doesn’t mean it will generalize to the testing set (ie. overfit or non-generalizable features in training)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Limitations to Transfer Learning

A
  • If source dataset you train on is very different from target dataset
  • If you have enough data for the target domain, it just results in faster convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

____ can be used to detect dataset bias

A

Gradient-based visualizations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Saliency Maps

A
  • Shows us what we think the neural network may find important in the input
    • sensitivity of loss to individual pixel changes
    • large sensitivity imples important pixels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is non-semantic shift for label data?

A

Two images of the same thing, but different

Ex: Two pictures of bird but different – one a picture one a sketch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T/F: CNNs have scale invariance

A

True - but only some

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

low-labeled setting: domain generalization

A
  • Source
    • multiple labeled
  • target
    • unknown
  • shift
    • non-semantic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F: For larger networks, estimation error can increase

A

True - With a small amount of data and a large amount of parameters, we could overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Backward Pass: Deconvnet

A
  • Pass back only the positive gradients
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AlexNet - Key aspects

A
  • ReLU instead of sigmoid/tanh
  • Specialized normalization layers
  • PCA-based data augmentation
  • Dropout
  • Ensembling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gram Matrix

A
  • Take a pair of channels in a feature map of n layers
    • Get correlation (dot product) between features and then sum it up
  • Feed into larger matrix (Gram) to get correlation of all features
  • Get Gram matrix loss for style image with respect to generated image
  • Get Gram matrix loss for content image with respect to generated image
  • Sum up the losses with parameters (alpha, beta) for proportion of total loss contributed by each Gram matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Low-labeled setting: Semi-supervised learning

A
  • Source
    • single labeled (usually much less)
  • target
    • single unlabeled
  • shift
    • none
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

low-labeled setting: cross-category transfer

A
  • Source
    • single labeled
  • target
    • single unlabeled
  • shift
    • semantic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

T/F: We can generate images from scratch using gradients to obtain an image with maximized score for a given class?

A

True - Image optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Creating alternating layers in a CNN (convolution/non-linear, pooling, and fully connect layers at the end) results in a ________ receptive field .

A

It results in an increasing receptive field for a particular pixel deep inside the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the problem for visualization in modern Neural Networks?

A

Small filters such as 3x3

Small convolution outputs are hard to interpet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Increasing the depth of a NN leads to ___ error (higher/lower)

A

higher - hard to optimize (but can be mitigated with residual blocks/skip connections)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Since the output of of convolution and pooling layers are ______ we can __________ them

A

Since the output of of convolution and pooling layers are (multi-channel) images we can sequence them just as any other layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is semantic shift for labeled images?

A

Both objects are image but different things

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Most parameters in the ___ layer of a CNN

A

Fully Connected Layer - input x output dimensionality + bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Normal backpropagation is not always the best choice for gradient-based visualizations because…?

A
  • You may get parts of image that decrease the feature activation
    • likely lots of these input pixels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Grad-CAM

A
  1. Feed image through CNN (only convolution part) for last Convolution Feature Map (most abstract features closest to classification on the network).
  2. Following CNN with any Task-specific network (classification, question/answering)
  3. Backprop until convolution
    1. Obtain a feature map the size of the original feature maps
    2. Obtain per-channel weighting (global average pooling for each channel of gradient) for neuron importance, then normalize
  4. Multiply feature maps with their weighting
  5. Feed through ReLU to obtain only positive features
  6. Final result, values that are important will have higher values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

VGG - Key Aspects

A
  • Repeating particular blocks of layers
    • 3x3 conv with small strides
    • 2x2 max pooling stride 2
  • Very large number of parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Convolution layers have the property of _____ and output has the property of _______

(choose translation equivariance or invariance for each)

A

Convolution layers have the property of translation equivariance and output has the property of invariance

Note: Some rotation invariance and scale invariance (only some)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Visualizing Neural Network Methods

A
  • Weights (kernels)
    • See what edges are detected in kernels
  • Activations
    • What does image look like in activation layer
  • Gradients
    • Assess what is used for the optimization itself
  • Robustness
    • See what weaknesses/bias are of NN
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

The gradient of the Convolution layer Kernel is equivalent to the _________

A

Cross-Correlation between the upstream gradient and input (until K1xK2 output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Defenses for adversarial attacks

A
  • training with adversarial examples
  • perturbations, noies, or re-encoding of inputs
  • there are no universal methods to prevent attacks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

T/F: Computer vision segmentation algorithms can be applied directly to gradients to get image segments

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Exploring the space of possible architecture (methods)

A
  • Evolutionary Learning and Reinforcement Learning
  • Prune over-parameterized networks
  • Learning of repeated blocks is typical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

The gradient of the loss with respect to the input image is equivalent to ____

A

Convolution between the upstream gradint and the kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Backward Pass:

Guided Backpropagation

A
  • Zero out gradient for negative values in forward pass
  • Zero out negative gradients
  • Only propagate positive influence
  • Like a combination of backprop and deconvnet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Gradient Ascent

A
  • Compute the gradient of the score for a particular class with respect to the input image
    • Add the learning rate times gradient to maximize score (not subtracting)
  • Algorithm
    • Start from random/zero image
    • Compute forward pass
    • Compute gradients
    • Perform Ascent
    • Iterate
  • Note: Uses scores to avoid minimizing other class scores
  • Need regularization as well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How do we represent similarity in terms of textures?

A
  • Should remove most spatial information
  • Key ideas revolved around summary statistics
  • Gram Matrix
    • feature correlations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

We can take the activations of any layer (FC, conv, etc.) and perform _____________

A
  • dimensionality reduction
    • often to reduce to two dimensions for plotting
    • PCA
    • t-SNA (most common)
      • non-linear mapping to preserve pair-wise distances
  • good for visualizing decision boundaries (esp non-linear)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the power-law region for data effectiveness?

A

Region where generalization error (log-scale) decreases linearly with sufficient data

40
Q

Modeling Error

A

Given a NN architecture, actual model that represents the real world may not be in that space. There may be no set of weights that model the real world.

Ie. a simple architecture or function may not be able to model complex reality (potentially low capacity)

41
Q

What can you do to train a CNN if you don’t have enough data?

A

Transfer Learning -

  1. Train on large-scale dataset and optimize parameters
  2. Take custom data set and initialize the network with weights trained before (step 1)
  3. Replace last layer with new fully-connected layer for output nodes per category
  4. Continue to train on new dataset (finetune - update parameters, freeze feature layer - update only last layer weights if not enough data)
42
Q

low-labeled setting: few-shot learning

A
  • Source
    • single labeled
  • target
    • single few-labeled
  • shift
    • semantic
43
Q

Most memory usage is in the ___ layers of a CNN

A

convolution layers - large output

44
Q

Residual block/ skip connections

A

Allow information from a layer to propagate to any future layer (with identity (ie no transform) )

can help with better gradient flow

45
Q

low-labeled setting: domain adaptation

A
  • Source
    • single labeled
  • target
    • single unlabeled
  • shift
    • non-semantic
46
Q

T/F: Saliency maps use the loss to assess importance of input pixels

A

False

  • In practice, saliency maps find gradient of the classifier scores (pre-softmax)
  • softmax and then loss function adds some complexity (weird effects in terms of the gradient)
47
Q

How to preserve the content of an image

A
  • Match features at different layers
  • Use a loss for this
    • optimize image by minimizing the difference between the images (content and generated images)
  • Multiple losesses
    • Backward edges going to same node are summed
    • Loss is sum of the difference across the identified layers
48
Q

Optimization Error

A

Optimization algorithm may not be able to find the weights that 100% model the world

49
Q

T/F: We have reached the point in complex CNN architectures where more data is not/barely improving performance

A

False - The ‘Irreducible Error Region’ has not been reached

50
Q

What does an input pixel affect at the output in convolution?

A

Neighborhood around it (where part of the kernel touches it)

51
Q

Visualizing Weights for CNN Layers

A
  • Fully Connect Layers
    • Reshape weights for a node back into size of image, then scale to 0-255
  • Convolution Layers
    • For each kernel, scale values from 0-255 and observe:
      • oriented edges
      • color
      • texture
52
Q

Receptive Field

A

Defines what set of input pixels in the original image affect the value of a particular node deep in the neural network.

53
Q

Where does a kernel pixel affect an output image during the convolution operation?

A

Everywhere!

The pixels in the kernel stride across the entire input image

54
Q

low-labeled setting: un/self-supervised

A
  • Source
    • single labeled
  • target
    • many labeled
  • shift
    • both/task
55
Q

For larger networks, optimization error will likely ___ in size

A

increase - dynamics of optomization could get more difficult with deeper network

56
Q

AlexNet - Architecture

A

Horizontal split architecture - couldn’t fit into one GPU

conv -> max pool -> norm (x2)

conv x 3 -> max pool

fully connected x3

57
Q

T/F: CNNs do not have rotation invariance

A

False - They have some

58
Q

A way to increase class scores or activations for an image

A

Gradient Ascent - optimization of an image to increase score for a particular class

59
Q

Effectiveness of Transfer Learning

A

Surprisingly effective

Features learned for 1000 object categories will work well for the 1001st!

Generalizes even across tasks (classification to object detection)

60
Q

For larger networks, modeling error will ___ in size

A

likely increase in size.

61
Q

What was used to show the benefits of Neural Networks?

A

Large-scale data benchmarking

62
Q

Inception Architecture

A
  • Repeated blocks composed of simple layers
  • parallel filters of different sizes
    • 1x1 convolution, 3x3 convolution, 5x5 convolution, 3x3 max pooling -> filter concatenation
    • increases computational complexity (4 times)
63
Q

T/F: You need a large amount of pixel changes to make a network confidently wrong

A

False - Gradient ascent perturbations can make model confidently wrong (adversarial noise)

64
Q

Key elements of practical application of saliency maps

A
  • Find gradient of classifier scores (pre soft-max), instead of loss
  • take absolute value of gradients
  • sum across channels
    • We don’t care specifically about RBG specifics
65
Q

Visualizing Output Maps

A
  • Visualization of activation/filter
  • Larger early in the network
  • Looking at activations across the input
    • which images have the highest activation?
66
Q

Computing the gradient of the loss with respect to the inputs for Convolution

A
67
Q

Semantic Segmentation

A
68
Q

Object Detection

A
69
Q

Instance Segmentation

A
70
Q

T/F: Fully connected layers explicitly retain spatial information

A

False

71
Q

Converting Fully Connected Layers to Convolution Layers

A
  • Each kernel has size of entire input
    • Equivalent to Wx+b
    • output is one scalar
  • One kernel per output node
72
Q

Resulting output for Image Segmentation Networks

A

Probability distribution over classes for each pixel.

73
Q

Convolutions work on ____ input sizes

A

Convolutions work on arbitrary input sizes (because of striding)

74
Q

Max Unpooling

A
75
Q

In max-unpooling/deconvolution, contributions from multiple windows are ____

A

In max-unpooling, contributions from multiple windows are summed.

76
Q

Deconvolution (“transposed convolution”)

A

Take each input pixel, multiply by learnable kernel, “stamp” it on output

77
Q

Transfer Learning

A

Begin with a pre-trained trunk/backbone (e.g. network pretrained on ImageNet)

78
Q

For encoder/decoder connections, you can ___ to bypass bottlenecks

A

skip connections

79
Q

Object Detection

A

Given an image, output a list of bounding boxes with probability distribution over classes per box

80
Q

What are the key problems to address with object detection?

A

Variable number of boxes

Need to determine candidate regions (position and scale) first

81
Q

Architecture for Object Detection

A
  • multi-headed
    • classification
      • predicting distribution over class labels
    • regression
      • predicting bounding box for each image region
  • both heads share features
  • jointly optimized (summing gradients)
82
Q

Non-Maximal suppresssion (NMS)

A

Combining redundant boxes to find bounding box for object in image

83
Q

Single-Shot Detector (SSD)

A
  • uses grid idea as anchors
    • different scales
    • different aspect ratios
  • tricks used to increase resolution (decrease subsampling ratio)
84
Q

You Only Look Once (YOLO)

A

Single-scale

faster for same size than SSD

85
Q

Coco Dataset

A

large-scle object detection, segmentation, and captioning dataset

86
Q

Evaluation of bounding box for image threshold (steps)

A
  1. For each bounding box, calculate intersection over union (IoU)
    • extract intersection over union with closest ground truth
  2. Keep only those with IoI > threshold
  3. Calculate Precision/Recall curve across classification probability threshold
  4. Calculate average precision (AP) over recall of [0, 0.1, 0.2, …, 1.0]
  5. Average over all categories to get mean Average Precision (mAP)
87
Q

R-CNN

A
  • Find regions of interests (ROIs) with object-like things
  • Classify those regions (refine their bounding boxes)
88
Q

Method to extract region of interest in an image

A
  • unsupervised (non-learned) algorithms
  • downsides
    • 1+ second per image
    • returns thousands of mostly backgrund images
  • resize each candidate to full input size and classify
89
Q

Downside of R-CNN

A
  • Takes 1+ second per image
  • return thousands of (mostly background) boxes
90
Q

Inefficiency of R-CNN

A

Computations for convolutions are re-done for each image patch, even if overlapping

91
Q

Fast R-CNN difference

A
  • Reuse computation by finding regions in feature maps
    • feature extraction once per image
92
Q

Problem with R-CNN

A
  • Variable input size to FC layers due to different feature map sizes
93
Q

R-CNN fix for differing feature map sizes

A
  • ROI Pooling
    • Given an arbitraryily-sized feature map, we can use pooling across a grid (ROI Pooling Layer) to convert to fixed-sized representation
94
Q

Faster R-CNN key difference

A
  • Use Neural Networks for the region proposal
    • Region Proposal Network (RPN)
      • output: objectness score
      • top k selected for classification
      • complexity in implementation due to some non differentiable parts (gradient with respect to bounding box coordinates)
95
Q

Region Proposal Network (RPN)

A
  • Neural Network model to find regions of objects
  • Uses anchors in a grid
    • k anchor boxes
      • various sizes and shapes
        • hyperparameters
    • 2k scores
      • object or not-object like
    • 4k coordinates
96
Q
A
97
Q

Two-stage object detection methods are ___ compared to single-stage methods (YOLO/SSD)

A

Two-stage object detection methods are slower but more accurate