Quiz 2 - Optimization, CNN Flashcards

1
Q

Loss surface geometries difficult for optimization

A
  • local minima
  • plateaus
  • saddle points
    • gradients of orthogonal directions are zero
      • (min for one direction, max for another)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

tanh function

A
  • min: -1, max: 1
    • centered
  • saturates at both ends
  • gradients
    • vanishes at both ends
  • computationally heavy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

parameter sharing

A

regularize parameters to be close together by forcing sets of parameters to be equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Normalization is helpful as it can

A
  • improve gradient flow
  • improve learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Color jitter

A
  • Used for data augmentation
  • add/subtract a small or large value to RGB channels in an image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Augmentation

A
  • Perform a range of transformations to a dataset
    • increases data for free
    • should not change meaning of data
    • ex: flip image, black/white, crop
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The key principle for NN training

A
  • Monitor everything to understand what is goin gon
    • loss/accuracy curves
    • gradient statistics/characteristics
    • other aspects of computation graph
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sanity checks for learning after optimization

A
  • Check bounds of loss function
    • Check initial loss at small random weight values (-log(p) for CE)
  • start w/o regularization and make sure loss increases
  • simplify dataset to make sure model can properly (over)fit before applying regularization
    • to ensure that model capacity is enough
    • model should be able to memorize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

L2 regularization results in a solution that is ___ sparse than L1

A

L2 regularization results in a solution that is less sparse than L1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is initialization of parameters important

A
  • determines how statistics of outputs (given inputs) behave
  • determines if the gradients vanish at the beginning (dampening learning)
    • ie. gradient flow
  • allows bias at the start (linear)
  • faster convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What suggests overfitting when looking at validation/training curve?

A
  • validation loss/accuracy starts to get worse afer a while
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Shared Weights

A
  • Advantage
    • reduce params
    • explicitly maintain spatial information
  • Use same weights/params in computation graph
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

sigmoid function

A
  • Gradient will be vanishingly small
  • Partial derivative of loss wrt weights (used for gradient descent) will be a very small number (multiplied by a small upstream gradient)
    • pass back the small gradients
  • Forward pass high values
    • causes larger and larger forward values
  • Issues in both directions
  • computationally heavy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ReLU

A
  • min: 0, max: infinity
  • outputs always positive
  • no saturation on the positive end
    • better gradient flow
  • gradients
    • 0 if x <= 0 (dead ReLU)
      • other ReLUs can make up for this
  • computationally cheap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sigmoid is typically avoided unless ___

A

you want to clamp values from [0,1] (ie. logistic regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Simpler Xavier initialization (Xavier2)

A

N(0,1) * square root (1 / nj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to Prevent Co-Adapted Features

A
  • Dropout Regularization
    • Keep nodes with probability p
      • nodes less than p get set to 0 activation
    • Choose nodes to mask out at each iteration
    • multiply the nodes by a [0 1] mask
    • Note: no nodes are dropped during testing
    • Scale Weights at test time by p so that input/outputs have similar distributions
      • At test time, all nodes are active so need a way to account for this
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fully Connected Neural Network

A

more and more abstract features from raw input

not well-suited for images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why does dropout work

A
  • Model should not rely too heavily on a particular feature
    • Probability (1 - p) of losing the feature it relies on
    • Equalizes the weights across all of the feature
  • Training 2n neural networks
    • n - number of nodes
    • 2n distinct variations of mask
    • ensemble effect
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Pooling Layer

A

Layer to explicitly down-sample image or feature maps (dimensionality reduction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the number of parameters for a CNN with Kn kernels and 3 channels?

A

N * ( k1 * k2 *…* kn * 3 + 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

L2 regularization

A
  • L2 norm
  • encourage small weights (but less zeros than L1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Sigmoid Function Key facts

A
  • min: 0, max: 1
  • outputs are always positive
  • saturates at both ends
  • gradient
    • vanishes at both ends
    • always positive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Definition of accuracy with respect to TP, TN, FP, FN

A

TP + TN / (TP + TN + FP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Per-Parameter Learning Rate

A
  • Dynamic learning rate for each weight
  • Examples
    • RMSProp
    • Adagrad
    • Adam
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How can you mitigate the problem with Adam

A
  • Time-varying bias correction
    • beta1 = 0.9
    • beta2 = 0.999
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Difference between Convolution and Cross-Correlation

A
  • Convolution: starts at end of the kernel and move back
  • Cross-correlation: start in the beginning of the kernel and move forward (same as for image)
    • as if applying already flipped kernel
    • dot product moving along the image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

T/F: The existence of local minima is the main issue in optimization

A
  • False - Other aspects of the loss surface cause issues
    • Noisy gradient estimates (ie. from mini-batches)
    • Saddle points
    • ill-conditioned loss surface
      • curvature/gradients higher in some directions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Normalization as a layer (algorithm)

A

note: small epsilon used for numerical stability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Each node in a NN for Convolution NN receives ___

A
  • Input from a K2 x K1 window (image patch)
  • region of input is called “receptive field”
  • Advantage
    • reduce parameters
    • explicitly maintain spatial information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

T/F: With dropdout regularization, nodes are dropped during testing

A

False - All nodes are kept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What does a tiny loss change suggest?

A

too small of a learning rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Which non-linearity is the most common starting point?

A

ReLU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

T/F: In backprop and auto diff, the learning algorithm needs to be modified depending on what’s inside

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

L1 Regularization

A
  • L1 Norm
  • encourages sparcity (lots of small close to zero values in weights, only a few non zeros)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Convolution has the property of _____

A
  • equivariance
    • if feature translated a little bit, output values move by the same translation
  • regardless of if pooling layer is involved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Method to get around loss geometries (ie. pleataus or saddle points)

A
  • Momentum
    • Decay the velocity over time for past gradients and add the current gradient
    • Weight update uses the velocity (not the gradient)
    • Used to “pop out” of local plateaus or saddle points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Nesterov Momentum

A
  • Rather than combine velocity with current gradient, go along with velocity first and then calculate gradient at new point
    • We know velocity is probably a reasonable direction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the False Positive rate (FPR)?

A

fp / (fp + tn)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Deep learning involves complex, compositional, non-linear functions, which cause the loss landscape to be ____

A

extremely non-convex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Change in loss indicates speed of ____

A

learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Adam

A
  • Combining ideas of other algorithms
  • Maintain both first and second moment statistics for gradients
43
Q

Limitation of Linear Layers

A
  • Images
    • 1024 x 1024 = ~ a million elements (M)
    • Fully connected layer (N)
    • Parameters = M*N (weights) + N
      • hundreds of million params for one layer
  • More parameters => more data needed to fit
44
Q

Ways to analyze non-linear functions in DL models

A
  • min/max
  • correspondence between input & output stats
  • gradients
    • at initialization
    • at extremes
  • computational complexities
45
Q

Normalization methods

A
  • subtract mean, divide by standard deviation
    • (most common)
    • this can be done per dimension
  • whitening (ie. through PCA)
    • (not common)
46
Q

Combining Convolution and Pooling layers - Benefit?

A
  • Invariance
    • Pooling layer has invariance to translation of the features
    • If feature is translated (moved) a bit, output values still remain the same
      • pooling layer (ie max pooling) retains max values in patches as long as movement is not larger than pooling window
47
Q

The velocity term is an ____ of the gradient

A

exponential moving average

48
Q

Complexities of Batch Normalization

A
  • During training, compute empirical mean and variance of mini batch over iteration (ie. normalizing a different amount each time)
    • causes noise in estimation/mean variance
  • During inference, stored mean/variance calculated on training set
  • Sufficient batch sizes must be used to get stable per-batch estimates during training.
    • issue with multi-GPU or multi-machine training
    • pytorch uses centralized batch statistics to fix
49
Q

Where should Batch Normalization be applied and why?

A
  • where
    • before every non-linearity
  • why
    • low/high values (un-normalized, imbalanced) cause saturation issues
50
Q

Relationship between forward pass and gradients in CNNs

A
  • Forward pass and gradients are opposite
    • If forward = cross correlation, then gradients are convolutions (vice versa)
51
Q

What is cowmask?

A
  • Combine images with patterened masks to determine which images to take pixels from
    • result is a complex transformation
    • forces NN to be robust to occlusion (ie. object hidden behind another object in an image)
    • Ground truth proportional to how many of pixels came from which image
52
Q

Learning Rate Schedules

A
  • Hand-coded ways to schedule learning rate
  • Theoretical results rely on annealed learning rate
    • (learning rate with decay)
  • Empirically derived learning rate
    • graduate student
      • stare at loss curve and determine convergence
    • step-scheduler
      • ex: divide learning rate by 10 every few epoch
    • exponential-scheduler
    • cyclical scheduler (cosine scheduler)
      • alternate base to max learning_rate on step size
53
Q

2D Discrete Convolution

A
  • Image
    • input image
  • Kernel
    • applied to image
    • initialize randomly and optimize
    • our params (plus bias)
  • Output filter / feature map
    • output of image and kernel
54
Q

Deeper networks are ___ sensitive to initialization

(more or less?)

A
  • Deeper networks are more sensitive to initialization
  • Activation gets smaller as you move in the layers, resulting in standard deviation shrinking
    • result: smaller update
  • Larger initial values
    • result: saturation
55
Q

T/F: We can have a flat loss curve but increasing accuracy

A

True - the correct class score only has to be slightly higher (argmax of P(Y = yi | X = xi) as the argmax of probability used in CE loss.

56
Q

Recurrent Neural Network

A

Better for non-fixed size inputs (ie. sentences, phrases, etc.)

alternate architecture: transformers

57
Q

Why don’t multi-kernel CNNs have kernels that learn the same filters?

A

The kernels are initialized to different vaues and have different local minima and gradients

58
Q

T/F: You can have higher training loss

A
  • True - Validation has no regularization so may perform better and is typically measured at the end of an epoch while training is measured as you go and averaged across the iterations (value is lowered by early in the epoch)
59
Q

Why is depth important in a neural network

A
  • modeling compositionality
  • parameter efficiency
  • dimensionality reduction
60
Q

How to optimize to find good weights

A
  • different optimizers have different biases
    • different weight updates
  • weight optimization
  • regularization
  • loss functions
61
Q

noisy gradients

A
  • caused by use of subset of data at each iteration to calculate the loss
  • unbiased estimator with high variance
  • slower convergence
62
Q

Geometric transformation layers are more important for this type of DL problem

A

computer vision

63
Q

How do you balance the standard deviation across the DL layers?

A
  • Xavier Initialization
    • Sample from a uniform distribution
  • nj - fan in
    • number of input nodes
  • nj+1 - fan out
    • number of output nodes
64
Q

Condition Number

A
  • Ratio of the largest and smallest value of the eigenvalue
  • Tells us how different the curvature is along different dimensions
  • high value
    • SGD will make big steps in some dims, small in others
  • second-order optimization methods divide steps by curvature
    • expensive to compute
65
Q

What are ROC Curves?

A
  • TPR/FPR curves represent the inherent tradeoff between number of positive predictions and correctness of predictions
  • AUC is area under curve - common single-number metric to summarize
66
Q

The number of channels in the output map is equal to _____

A

the number of kernels

67
Q
A
68
Q

What is the problem with adagrad?

A
  • learning rate will go to zero (because denominator is sum up gradients over iterations) as gradients accumulate
69
Q

Relationship between loss function and other metrics

A
  • Can be complex
  • Metrics (not loss) are often not differentiable
    • accuracy
    • precision/recall
70
Q

what is model capacity?

A

Number of parameters

71
Q

regularization

A

any modification we make to a

learning algorithm that is intended to reduce its generalization error but not its

training error.

72
Q

What is the problem with Adam?

A
  • unstable in the beginning
  • one or both moments will be tiny values
73
Q

Max Pooling

A

Stride window across image but perform per-path max operation

(ie. take the max of pixels in the patch)

74
Q

First step of designing the architecture

A
  • understand data
  • ask experts
    • data types already have good architectures
    • use what others have discovered
  • understand the flow of gradients
    • learning is not equal across the architecture
    • could be bottlenecks
75
Q

what happens when initializing to a constant value

A
  • weights will be the same
    • shared weights
  • gradients will be the same
  • as a result: all weights will be updated the same
76
Q

Hessian

A
  • Matrix of second-order gradients
  • Gives information about the curvature of the loss surface
  • Not often used in Deep Learning (computationally inefficient)
    *
77
Q

Geometric transformation

A
  • Used for data augmentation
    • translation
    • rotation
    • scale
    • shear
78
Q

Normalization can be done with learnable parameters

“Batch Normalization”

A
  • gamma (scale)
  • beta (shift)
  • determine what extent to normalize
    • or if not at all
79
Q

What is the True Positive Rate (TPR)?

A

tp / (tp + fn)

80
Q

Learning Many Features

A
  • Weights are not shared across different feature extractors
  • params (K1 x K2 + 1) * M
    • M - number of features to learn
81
Q

What suggests underfitting when looking at a validation/training curve?

A
  • validation loss very close to training loss, or both are high
  • should be able to get very low training loss in NN
82
Q

leaky ReLU

A
  • min: -infinity, max: infinity
  • slightly negative slope with x <= 0
    • slope can be parameterized
  • no saturation
  • gradients
    • no dead neurons
  • still cheap to compute
  • subgradients
    • not fully differentiable
83
Q

T/F: There is one activation function best for all applications

A

False

84
Q

How many parameters are learned in the max pooling layer?

A

None!

85
Q

When to use cross-validatio

A
  • expensive, not done often in NN
  • useful if you may not have a lot of data
86
Q

What is convolution

A
  • Mathematical operation of two functions f and g producing a tthird function
    • third function is modified version of original functions
    • gives area of overlap between initial functions
  • similar to cross-correlation
87
Q

Convolution hyperparameters

A
  • in_channels
    • # channels in input image
  • out_channels
    • # channels produced by convolution
  • kernel_size
    • size of convolving kernel
  • stride
    • stride of the convolution (default: 1)
    • stride when moving across image
  • padding
    • zero-padding added to both sides of the input
  • padding_mode
    • zeros, reflect, replicate, circular
88
Q

What are the most crucial hyperparameters to tune?

A
  • learning rate
  • weight decay
89
Q

Convolutional Neural Networks

A

Feature extractors for small local images

Better for images

Applied to sentences too

90
Q

How does SGD + momentum work compared to adaptive methods?

A
  • Typically generalizes better, but requires more tuning
91
Q

Complexity of a model is only limited by _____

A

computation and memory

92
Q

T/F: Hyperparameters cannot be independently tuned

A
  • True - hyperparameters are interdependent
    • ex - batch norm/dropout maybe not needed together
    • learning rate should be changed proportionally to batch size
      • gradients are more reliable/smooth
93
Q

How can you mitigate the adagrad problem with gradients?

A
  • Keep a moving average of squared gradients to avoid saturating the learning rate
  • Doesn’t go to zero but decays
94
Q

How do you tune hyperparameters?

A
  • Start with coarser search
    • learning rate - {0.1, 0.05, 0.03, 0.01, 0.003, 0.001, 0.0005, 0.0001}
95
Q

Adagrad

A
  • Use gradient statistics to reduce learning rate across iterations
  • Accumulator takes previous accumulator plus the square of the gradient
  • Weight update is learning rate / (square root of accumulator + epsilon)
    • denominator sums up gradients over iterations
    • directions with high curvature will have higher gradients and reduce learning rate
96
Q

T/F: combinations of only linear layers have the same representational powers as single layer non-linear models

A

False - Combination of only linear layers has the same representational power as one linear layer

97
Q

Striding across an image using larger steps results in _______

A

loss of information, dimensionality reduction(not recommended for dimred)

98
Q

Convolutional & Pooling layers

A
  • Alternating Convolution and Pooling layers
  • Convolution + Non-linear layer
    • feature extraction
  • Pooling
    • reduce dim of data (images patches 3x3 reduced to 1)
  • End with a fully connected layer to classify
99
Q

Why are images less conducive to fully connected linear layers?

A
  • images features are spatially localized
    • edge/contour detectors can extract data from image patches
      • gradients (dark to light on images)
    • small features are repeated
      • color
      • motifs (corners, etc.)
      • edges
  • no reason to believe one feature tends to appear in one location of image
    • ie. bird beaks are not in the center of every image
100
Q

T/F: In convolution NN we do not need to learnb location-specific features

A
  • True
  • Nodes in different locations can share features
    • ie. image edges can be similar on different parts of a bird image (ie. 2 wings)
      *
101
Q

T/F: For a learned kernel, Convolution and Cross-Correlation are treated differently

A
  • False
    • Kernels are randomly initialized and learned
    • Doesn’t matter if we flip the kernel or not
      • no difference between convolution and cross-correlation
102
Q

what do loss (and then weights) turn to NaNs suggest?

A
  • too high of a learning rate
  • divide by zero
  • forget the log of the loss (causing divergence)
103
Q

How do you adapt the kernel for multi-channel input images?

A
  • Use 3-channel kernels
  • Use dot product with 3x3x3 kernel
    • element-wise multiplication between kernel and image patch, summing them up
104
Q

Backprop and auto diff allows us to optimize ___ composed of ____

A

Backprop and auto diff allows us to optimize any function composed of differentiable blocks