Quiz 2 Flashcards

1
Q

Receptive fields

A

Each node only receives input from a 𝐾1×𝐾2 window (image patch).

The region from which a node receives its input from is called a receptive field.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Shared Weights

A

Nodes in different locations can share features.

Uses the same weights/parameters in the computation graph.

  • Reduce parameters to (𝐾1×𝐾2+1)
  • Explicitly maintain spatial information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Learning Many Features

A

Weights are not shared across different feature extractors.

Reduce parameters to (𝐾1×𝐾2+1)βˆ—π‘€ where M is the number of features to be learned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Convolution

A

In mathematics a convolution is an operation on two functions f and g producing a third function that is typically viewed as a modified version of one of the original functions, giving the area of overlap between the two functions as a function of the amount that one of the original functions is translated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

T or F: Convolutions are linear operations

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are CNN hyperparameters

A
  1. in_channels(int): Number of channels in the input image
  2. out_channels(int): Number of channels produced by the convolution
  3. Kernel_size(int;tuple): Size of convolving kernel
  4. stride (int;tuple;optional): denotes the size of the stride used by the convolution (default is 1)
  5. padding (int;tuple;optional): Zero padding added to both sides of the input (default is 0)
  6. padding_mode (string):
    β€˜zeros’,’reflect’,’replicate’,’circular’ (default β€˜zeros’)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Output size formula for a vanilla convolution operation

A

Specifically:

(N - F + 2P) / S + 1 = output_dim_1 * output_dim_2 * N_channels

N: Input dimension
F: Filter dimension
P: Padding
S: Stride
1: Bias term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

β€œvalid” convolution

A

where the kernel fits inside the image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

T or F: Larger the filter the smaller the shrinkage

A

False. Larger filter = larger shrinkage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

β€œSame” convolution

A

zero-padding the image borders to produce an output the same size as the raw input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

CNN: Max pooling

A

For each window, calculate its max.

Pros: No parameters to learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

CNN: Stride

A

Movement of the convolution layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CNN: Pooling layer

A

Make the representations smaller and more manageable through downsampling.

Only pools width and height, not depth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CNN: Cross-correlation

A

Takes the dot product of a small filter (also called a kernel or weights) and an overlapping region of the input image or feature map.

Doesn’t flip the convolution layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CNN: T or F - Using a stride greater than 1 results in loss of information.

A

True. Stride > 1 implies jumping over some pixels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CNN: Output size of vanilla/valid convolution vs full convolution

A

Vanilla: m-k+1
Full: m+k-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

CNN: Benefit of pooling

A

Makes the representation invariant to small changes in the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

CNN: Full convolution

A

Enough zeros are added to the borders such that, every pixel is visited k times in each direction. This results in an image of size m+k-1.

Full = Bigger size than original

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Sigmoid

A

Min=0; max=1

Output is always positive

Saturates at both ends

Gradients vanish at each end (converging to 0 or 1 - gradient approaches zero)

Always positive

Computationally complexity high due to exponential term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

tanh

A

min=-1; max=1; and we note that is centred
Saturates at both ends (-1,1)
Gradients: vanish at both ends ; always positive
medium compexity as tanh is not as simple as say multiplication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ReLU

A

Min=0, Max= ∞; always positive
Not saturated on the positive side
gradients: 0 when X <= 0 (aka dead ReLU); constant otherwise (doesn’t vanish which is good)
Cheap: doesn’t come much easier than max function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

T or F: ReLU is differentiable

A

Technically no, but only at zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Initialization: What happens if you initialize close to a bad local minima

A

Poor gradient flow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Initialization: What happens if you initialize with large activations

A

Reach saturation quickly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Initialization: What happens if you initialize with small activations
Be in the linear regime or close to it in the nonlinear space, and you will have a strong gradient to learn from
26
Initialization: What happens if you initialize all weights with a constant
All learns the same thing.
27
Initialization: Common practice
1) Random sample from small normal distribution 𝑁(πœ‡=0,𝜎=0.01) 2) Random sample from uniform distribution
28
Initialization: Why are equal (in terms of sampling from distribution), small weights preferred
No a priori reason why some weights should be greater
29
Initialization: T or F - Deeper networks are less sensitive to initialization
False. More sensitive with deeper network because activations increasingly get smaller.
30
Initialization: Fan-in Fan-out rule
Maintain the variance at the output to be similar to that of the input. Keeps each layer's variance the same.
31
Optimization: Issues that hinder optimization
Noisy gradient estimates (due to taking MiniBatches) Saddle points Ill conditioned loss surface, where the curvature is high in one direction but not the other
32
Optimization: Loss surfaces that can cause problems
Local minima plateaus saddle points, a point that is a min in one axis but a max in another
33
Optimization: Momentum
Overcomes plateaus by adding exponential moving average of the gradient. Helps move off of areas with low gradients.
34
Optimization: Nesterov Momentum
Calculates gradient AFTER applying momentum term
35
Optimization: How to use Hessian
Use 2nd order derivatives to get information about the loss surface
36
Optimization: Condition number
The ratio between the smallest and largest eigenvalue of a hessian Tell us how different the curvature is along different dimensions
37
Optimization: General idea of techniques like Adam
Apply per-parameter learning rates
38
Optimization: Adagrad
Adapts the learning rate for each parameter based on the historical gradients Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease. Pro: Works well in a gently sloped parameter space. Con: Can prematurely make learning rate too small.
39
Optimization: RMSProp
Takes AdaGrad but replaces calculation by an exponential moving average
40
Optimization: Adam
Like Adagrad / RMSProp but includes momentum terms
41
Regularization: L1 norm
Applies the a sign function to the weights in addition to regularization term. Results in sparse parameters 𝐽=𝛼 sign( π‘Š^{π½βˆ’1}+𝐽(π‘Š^{π½βˆ’1}) )
42
Regularization: L2 norm
Applies a regularization term to the weights at each update 𝐽=𝛼 π‘Š^{π½βˆ’1}+𝐽(π‘Š^{π½βˆ’1})
43
Regularization: Dropout
Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0's) to prevent a subset of params from learning. Makes the model less reliant on specifically effective parameters
44
Regularization: What needs to be done with dropout during inference
1) Scale outputs or weights by the masking probability "p" W_test = W * p 2) Scale by 1/p during training.
45
Batch norm: Diff between batch vs layer norm
Batch: Normalizes activations along the batch dimension Layer: Normalizes activations along the feature (channel) dimension for each data point in the mini-batch
46
Batch Norm: Definition
Normalizes the activations of a layer across a mini-batch of data, which helps stabilize and speed up training by reducing internal covariate shift.
47
Batch Norm: How to use batch norm during inference
Take running average taken during training.
48
Batch norm: Pros
Improves gradient flow Allows higher learning rates Reduces dependence on initialization Differentiable
49
Batch norm: Cons
Sufficient batch sizes must be used to get stable per-batch mean/variance.
50
Batch norm: Where to apply in network
Right before activation
51
Batch norm: T or F - Batch norm is useful for linear networks
False. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence
52
Batch norm: T or F - Batch norm is useful for deep networks
True. In a deep neural network with nonlinear activation functions, the lower layers can perform nonlinear transformations of the data, so they remain useful.
53
Batch norm: T or F - The bias term should be omitted during batch norm
True. The bias term should be omitted because it becomes redundant with the Ξ² parameter applied by the batch normalization reparametrization.
54
Batch norm: T or F - For CNNs, you should use different mean/variance parameter values for each layer
False. It is important to apply the same normalizing ΞΌ and Οƒ at every spatial location within a feature map, so that the statistics of the feature map remain the same regardless of spatial location.
55
Initialization: Variance of ReLU
𝑁(0,1)Γ—sqrt(𝑛_j / 2)
56
Batch: Batch gradient descent
Network goes through the forward/backward pass using the entire dataset. Pro: Provides deterministic updates. Con: Computationally expensive.
57
Batch: Stochastic gradient descent
Network goes through f/b using only one example. More noisy updates, but can have regularization effect by pushing model out of local minima. Doesn't take advantage of parallelization
58
Batch: Mini-batch gradient descent
Takes N samples from the empirical distribution and runs f/b pass. Benefits from parallel processing. More stable than SGD.
59
Supervised pretraining
Model is initially trained on a related task that has labeled data (supervised learning) before fine-tuning it on the target task of interest. Knowledge gained during the initial training phase can serve as a useful starting point for the model when tackling the target task.
60
Formula to calculate number of parameters and bias
K (F1 * F2 * D + 1) K: Number of kernels F: Filter dimensions D: Number of input channels