Quiz 3 - CNN Architecture, Visualization, Advanced CV Architecture Flashcards

1
Q

T/F: Visualization makes assessing interpretability easy

A

False

  • Visualization leads to some interpretable representations, bt they may be misleading or uninformative
  • Assessing interpretability is difficult
    • Requires user studies to show usefulness
  • Neural networks learn distributed representation
    • no one node represents a particular feature
    • makes interpretation difficult
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps to obtaining Gradient of Activation with respect to input

A
  • Pick a neuron
  • Run forward method up to layer we care about
  • Find gradient of its activation w.r.t input image
  • Can first find highest activated image patches using its corresponding neuron (based on receptive field)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

T/F: A single-pixel change can make a NN wrong

A

True (single-pixel attacks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Shape vs. Texture Bias

A
  • Ex: take picture of cat and apply texture of elephant
    • Humans are biased towards shape (will see cat)
    • Neural Networks are biased towards texture (will classify cat as elephant, likely)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Estimation Error

A

Even with the best weights to minimize training error, doesn’t mean it will generalize to the testing set (ie. overfit or non-generalizable features in training)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Limitations to Transfer Learning

A
  • If source dataset you train on is very different from target dataset
  • If you have enough data for the target domain, it just results in faster convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

____ can be used to detect dataset bias

A

Gradient-based visualizations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Saliency Maps

A
  • Shows us what we think the neural network may find important in the input
    • sensitivity of loss to individual pixel changes
    • large sensitivity imples important pixels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is non-semantic shift for label data?

A

Two images of the same thing, but different

Ex: Two pictures of bird but different – one a picture one a sketch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T/F: CNNs have scale invariance

A

True - but only some

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

low-labeled setting: domain generalization

A
  • Source
    • multiple labeled
  • target
    • unknown
  • shift
    • non-semantic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F: For larger networks, estimation error can increase

A

True - With a small amount of data and a large amount of parameters, we could overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Backward Pass: Deconvnet

A
  • Pass back only the positive gradients
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AlexNet - Key aspects

A
  • ReLU instead of sigmoid/tanh
  • Specialized normalization layers
  • PCA-based data augmentation
  • Dropout
  • Ensembling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gram Matrix

A
  • Take a pair of channels in a feature map of n layers
    • Get correlation (dot product) between features and then sum it up
  • Feed into larger matrix (Gram) to get correlation of all features
  • Get Gram matrix loss for style image with respect to generated image
  • Get Gram matrix loss for content image with respect to generated image
  • Sum up the losses with parameters (alpha, beta) for proportion of total loss contributed by each Gram matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Low-labeled setting: Semi-supervised learning

A
  • Source
    • single labeled (usually much less)
  • target
    • single unlabeled
  • shift
    • none
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

low-labeled setting: cross-category transfer

A
  • Source
    • single labeled
  • target
    • single unlabeled
  • shift
    • semantic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

T/F: We can generate images from scratch using gradients to obtain an image with maximized score for a given class?

A

True - Image optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Creating alternating layers in a CNN (convolution/non-linear, pooling, and fully connect layers at the end) results in a ________ receptive field .

A

It results in an increasing receptive field for a particular pixel deep inside the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the problem for visualization in modern Neural Networks?

A

Small filters such as 3x3

Small convolution outputs are hard to interpet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Increasing the depth of a NN leads to ___ error (higher/lower)

A

higher - hard to optimize (but can be mitigated with residual blocks/skip connections)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Since the output of of convolution and pooling layers are ______ we can __________ them

A

Since the output of of convolution and pooling layers are (multi-channel) images we can sequence them just as any other layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is semantic shift for labeled images?

A

Both objects are image but different things

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Most parameters in the ___ layer of a CNN

A

Fully Connected Layer - input x output dimensionality + bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Normal backpropagation is not always the best choice for gradient-based visualizations because...?
* You may get parts of image that decrease the feature activation * likely lots of these input pixels
26
Grad-CAM
1. Feed image through CNN (only convolution part) for last Convolution Feature Map (most abstract features closest to classification on the network). 2. Following CNN with any Task-specific network (classification, question/answering) 3. Backprop until convolution 1. Obtain a feature map the size of the original feature maps 2. Obtain per-channel weighting (global average pooling for each channel of gradient) for neuron importance, then normalize 4. Multiply feature maps with their weighting 5. Feed through ReLU to obtain only positive features 6. Final result, values that are important will have higher values
27
VGG - Key Aspects
* Repeating particular blocks of layers * 3x3 conv with small strides * 2x2 max pooling stride 2 * Very large number of parameters
28
Convolution layers have the property of _____ and output has the property of \_\_\_\_\_\_\_ (choose translation equivariance or invariance for each)
Convolution layers have the property of _translation equivariance_ and output has the property of _invariance_ Note: Some rotation invariance and scale invariance (only some)
29
Visualizing Neural Network Methods
* Weights (kernels) * See what edges are detected in kernels * Activations * What does image look like in activation layer * Gradients * Assess what is used for the optimization itself * Robustness * See what weaknesses/bias are of NN
30
The gradient of the Convolution layer Kernel is equivalent to the \_\_\_\_\_\_\_\_\_
Cross-Correlation between the upstream gradient and input (until K1xK2 output)
31
Defenses for adversarial attacks
* training with adversarial examples * perturbations, noies, or re-encoding of inputs * there are *no* universal methods to prevent attacks
32
T/F: Computer vision segmentation algorithms can be applied directly to gradients to get image segments
True
33
Exploring the space of possible architecture (methods)
* Evolutionary Learning and Reinforcement Learning * Prune over-parameterized networks * Learning of repeated blocks is typical
34
The gradient of the loss with respect to the input image is equivalent to \_\_\_\_
Convolution between the upstream gradint and the kernel
35
Backward Pass: Guided Backpropagation
* Zero out gradient for negative values in forward pass * Zero out negative gradients * Only propagate positive influence * Like a combination of backprop and deconvnet
36
Gradient Ascent
* Compute the gradient of the score for a particular class with respect to the input image * Add the learning rate times gradient to maximize score (not subtracting) * Algorithm * Start from random/zero image * Compute forward pass * Compute gradients * Perform Ascent * Iterate * Note: Uses scores to avoid minimizing other class scores * Need regularization as well
37
How do we represent similarity in terms of textures?
* Should remove most spatial information * Key ideas revolved around summary statistics * Gram Matrix * feature correlations
38
We can take the activations of any layer (FC, conv, etc.) and perform \_\_\_\_\_\_\_\_\_\_\_\_\_
* dimensionality reduction * often to reduce to two dimensions for plotting * PCA * t-SNA (most common) * non-linear mapping to preserve pair-wise distances * good for visualizing decision boundaries (esp non-linear)
39
What is the power-law region for data effectiveness?
Region where generalization error (log-scale) decreases linearly with sufficient data
40
Modeling Error
Given a NN architecture, actual model that represents the real world may not be in that space. There may be no set of weights that model the real world. Ie. a simple architecture or function may not be able to model complex reality (potentially low capacity)
41
What can you do to train a CNN if you don't have enough data?
Transfer Learning - 1. Train on large-scale dataset and optimize parameters 2. Take custom data set and initialize the network with weights trained before (step 1) 3. Replace last layer with new fully-connected layer for output nodes per category 4. Continue to train on new dataset (finetune - update parameters, freeze feature layer - update only last layer weights if not enough data)
42
low-labeled setting: few-shot learning
* Source * single labeled * target * single few-labeled * shift * semantic
43
Most memory usage is in the ___ layers of a CNN
convolution layers - large output
44
Residual block/ skip connections
Allow information from a layer to propagate to any future layer (with identity (ie no transform) ) can help with better gradient flow
45
low-labeled setting: domain adaptation
* Source * single labeled * target * single unlabeled * shift * non-semantic
46
T/F: Saliency maps use the loss to assess importance of input pixels
False * In practice, saliency maps find gradient of the classifier *scores* (pre-softmax) * softmax and then loss function adds some complexity (weird effects in terms of the gradient)
47
How to preserve the content of an image
* Match features at different layers * Use a loss for this * optimize image by minimizing the difference between the images (content and generated images) * Multiple losesses * Backward edges going to same node are summed * Loss is sum of the difference across the identified layers
48
Optimization Error
Optimization algorithm may not be able to find the weights that 100% model the world
49
T/F: We have reached the point in complex CNN architectures where more data is not/barely improving performance
False - The 'Irreducible Error Region' has not been reached
50
What does an input pixel affect at the output in convolution?
Neighborhood around it (where part of the kernel touches it)
51
Visualizing Weights for CNN Layers
* Fully Connect Layers * Reshape weights for a node back into size of image, then scale to 0-255 * Convolution Layers * For each kernel, scale values from 0-255 and observe: * oriented edges * color * texture
52
Receptive Field
Defines what set of input pixels in the original image affect the value of a particular node deep in the neural network.
53
Where does a kernel pixel affect an output image during the convolution operation?
Everywhere! The pixels in the kernel stride across the entire input image
54
low-labeled setting: un/self-supervised
* Source * single labeled * target * many labeled * shift * both/task
55
For larger networks, optimization error will likely ___ in size
increase - dynamics of optomization could get more difficult with deeper network
56
AlexNet - Architecture
Horizontal split architecture - couldn't fit into one GPU conv -\> max pool -\> norm (x2) conv x 3 -\> max pool fully connected x3
57
T/F: CNNs do not have rotation invariance
False - They have some
58
A way to increase class scores or activations for an image
Gradient Ascent - optimization of an image to increase score for a particular class
59
Effectiveness of Transfer Learning
Surprisingly effective Features learned for 1000 object categories will work well for the 1001st! Generalizes even across tasks (classification to object detection)
60
For larger networks, modeling error will ___ in size
likely increase in size.
61
What was used to show the benefits of Neural Networks?
Large-scale data benchmarking
62
Inception Architecture
* Repeated blocks composed of simple layers * parallel filters of different sizes * 1x1 convolution, 3x3 convolution, 5x5 convolution, 3x3 max pooling -\> filter concatenation * increases computational complexity (4 times)
63
T/F: You need a large amount of pixel changes to make a network confidently wrong
False - Gradient ascent perturbations can make model confidently wrong (adversarial noise)
64
Key elements of practical application of saliency maps
* Find gradient of classifier scores (pre soft-max), instead of loss * take absolute value of gradients * sum across channels * We don't care specifically about RBG specifics
65
Visualizing Output Maps
* Visualization of activation/filter * Larger early in the network * Looking at activations across the input * which images have the highest activation?
66
Computing the gradient of the loss with respect to the inputs for Convolution
67
Semantic Segmentation
68
Object Detection
69
Instance Segmentation
70
T/F: Fully connected layers explicitly retain spatial information
False
71
Converting Fully Connected Layers to Convolution Layers
* Each kernel has size of entire input * Equivalent to Wx+b * output is one scalar * One kernel per output node
72
Resulting output for Image Segmentation Networks
Probability distribution over classes for each pixel.
73
Convolutions work on ____ input sizes
Convolutions work on arbitrary input sizes (because of striding)
74
Max Unpooling
75
In max-unpooling/deconvolution, contributions from multiple windows are \_\_\_\_
In max-unpooling, contributions from multiple windows are summed.
76
Deconvolution ("transposed convolution")
Take each input pixel, multiply by learnable kernel, "stamp" it on output
77
Transfer Learning
Begin with a pre-trained trunk/backbone (e.g. network pretrained on ImageNet)
78
For encoder/decoder connections, you can ___ to bypass bottlenecks
skip connections
79
Object Detection
Given an image, output a list of bounding boxes with probability distribution over classes per box
80
What are the key problems to address with object detection?
Variable number of boxes Need to determine candidate regions (position and scale) first
81
Architecture for Object Detection
* multi-headed * classification * predicting distribution over class labels * regression * predicting bounding box for each image region * both heads share features * jointly optimized (summing gradients)
82
Non-Maximal suppresssion (NMS)
Combining redundant boxes to find bounding box for object in image
83
Single-Shot Detector (SSD)
* uses grid idea as anchors * different scales * different aspect ratios * tricks used to increase resolution (decrease subsampling ratio)
84
You Only Look Once (YOLO)
Single-scale faster for same size than SSD
85
Coco Dataset
large-scle object detection, segmentation, and captioning dataset
86
Evaluation of bounding box for image threshold (steps)
1. For each bounding box, calculate intersection over union (IoU) * extract intersection over union with closest ground truth 2. Keep only those with IoI \> threshold 3. Calculate Precision/Recall curve across classification probability threshold 4. Calculate average precision (AP) over recall of [0, 0.1, 0.2, ..., 1.0] 5. Average over all categories to get mean Average Precision (mAP)
87
R-CNN
* Find regions of interests (ROIs) with object-like things * Classify those regions (refine their bounding boxes)
88
Method to extract region of interest in an image
* unsupervised (non-learned) algorithms * downsides * 1+ second per image * returns thousands of mostly backgrund images * resize each candidate to full input size and classify
89
Downside of R-CNN
* Takes 1+ second per image * return thousands of (mostly background) boxes
90
Inefficiency of R-CNN
Computations for convolutions are re-done for each image patch, even if overlapping
91
Fast R-CNN difference
* Reuse computation by finding regions in **feature maps** * **​**feature extraction once per image
92
Problem with R-CNN
* Variable input size to FC layers due to different feature map sizes
93
R-CNN fix for differing feature map sizes
* ROI Pooling * Given an arbitraryily-sized feature map, we can use pooling across a grid (ROI Pooling Layer) to convert to fixed-sized representation
94
Faster R-CNN key difference
* Use Neural Networks for the region proposal * Region Proposal Network (RPN) * output: objectness score * top k selected for classification * complexity in implementation due to some non differentiable parts (gradient with respect to bounding box coordinates)
95
Region Proposal Network (RPN)
* Neural Network model to find regions of objects * Uses anchors in a grid * *k* anchor boxes * various sizes and shapes * hyperparameters * *2k* scores * object or not-object like * *4k* coordinates
96
97
Two-stage object detection methods are ___ compared to single-stage methods (YOLO/SSD)
Two-stage object detection methods are slower but more accurate