Quiz 2 Flashcards

1
Q

Receptive fields

A

Each node only receives input from a 𝐾1×𝐾2 window (image patch).

The region from which a node receives its input from is called a receptive field.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Shared Weights

A

Nodes in different locations can share features.

Uses the same weights/parameters in the computation graph.

  • Reduce parameters to (𝐾1×𝐾2+1)
  • Explicitly maintain spatial information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Learning Many Features

A

Weights are not shared across different feature extractors.

Reduce parameters to (𝐾1×𝐾2+1)βˆ—π‘€ where M is the number of features to be learned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Convolution

A

In mathematics a convolution is an operation on two functions f and g producing a third function that is typically viewed as a modified version of one of the original functions, giving the area of overlap between the two functions as a function of the amount that one of the original functions is translated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

T or F: Convolutions are linear operations

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are CNN hyperparameters

A
  1. in_channels(int): Number of channels in the input image
  2. out_channels(int): Number of channels produced by the convolution
  3. Kernel_size(int;tuple): Size of convolving kernel
  4. stride (int;tuple;optional): denotes the size of the stride used by the convolution (default is 1)
  5. padding (int;tuple;optional): Zero padding added to both sides of the input (default is 0)
  6. padding_mode (string):
    β€˜zeros’,’reflect’,’replicate’,’circular’ (default β€˜zeros’)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Output size formula for a vanilla convolution operation

A

Specifically:

(N - F + 2P) / S + 1 = output_dim_1 * output_dim_2 * N_channels

N: Input dimension
F: Filter dimension
P: Padding
S: Stride
1: Bias term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

β€œvalid” convolution

A

where the kernel fits inside the image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

T or F: Larger the filter the smaller the shrinkage

A

False. Larger filter = larger shrinkage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

β€œSame” convolution

A

zero-padding the image borders to produce an output the same size as the raw input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

CNN: Max pooling

A

For each window, calculate its max.

Pros: No parameters to learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

CNN: Stride

A

Movement of the convolution layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CNN: Pooling layer

A

Make the representations smaller and more manageable through downsampling.

Only pools width and height, not depth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CNN: Cross-correlation

A

Takes the dot product of a small filter (also called a kernel or weights) and an overlapping region of the input image or feature map.

Doesn’t flip the convolution layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CNN: T or F - Using a stride greater than 1 results in loss of information.

A

True. Stride > 1 implies jumping over some pixels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CNN: Output size of vanilla/valid convolution vs full convolution

A

Vanilla: m-k+1
Full: m+k-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

CNN: Benefit of pooling

A

Makes the representation invariant to small changes in the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

CNN: Full convolution

A

Enough zeros are added to the borders such that, every pixel is visited k times in each direction. This results in an image of size m+k-1.

Full = Bigger size than original

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Sigmoid

A

Min=0; max=1

Output is always positive

Saturates at both ends

Gradients vanish at each end (converging to 0 or 1 - gradient approaches zero)

Always positive

Computationally complexity high due to exponential term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

tanh

A

min=-1; max=1; and we note that is centred
Saturates at both ends (-1,1)
Gradients: vanish at both ends ; always positive
medium compexity as tanh is not as simple as say multiplication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ReLU

A

Min=0, Max= ∞; always positive
Not saturated on the positive side
gradients: 0 when X <= 0 (aka dead ReLU); constant otherwise (doesn’t vanish which is good)
Cheap: doesn’t come much easier than max function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

T or F: ReLU is differentiable

A

Technically no, but only at zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Initialization: What happens if you initialize close to a bad local minima

A

Poor gradient flow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Initialization: What happens if you initialize with large activations

A

Reach saturation quickly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Initialization: What happens if you initialize with small activations

A

Be in the linear regime or close to it in the nonlinear space, and you will have a strong gradient to learn from

26
Q

Initialization: What happens if you initialize all weights with a constant

A

All learns the same thing.

27
Q

Initialization: Common practice

A

1) Random sample from small normal distribution 𝑁(πœ‡=0,𝜎=0.01)

2) Random sample from uniform distribution

28
Q

Initialization: Why are equal (in terms of sampling from distribution), small weights preferred

A

No a priori reason why some weights should be greater

29
Q

Initialization: T or F - Deeper networks are less sensitive to initialization

A

False. More sensitive with deeper network because activations increasingly get smaller.

30
Q

Initialization: Fan-in Fan-out rule

A

Maintain the variance at the output to be similar to that of the input. Keeps each layer’s variance the same.

31
Q

Optimization: Issues that hinder optimization

A

Noisy gradient estimates (due to taking MiniBatches)

Saddle points

Ill conditioned loss surface, where the curvature is high in one direction but not the other

32
Q

Optimization: Loss surfaces that can cause problems

A

Local minima

plateaus

saddle points, a point that is a min in one axis but a max in another

33
Q

Optimization: Momentum

A

Overcomes plateaus by adding exponential moving average of the gradient. Helps move off of areas with low gradients.

34
Q

Optimization: Nesterov Momentum

A

Calculates gradient AFTER applying momentum term

35
Q

Optimization: How to use Hessian

A

Use 2nd order derivatives to get information about the loss surface

36
Q

Optimization: Condition number

A

The ratio between the smallest and largest eigenvalue of a hessian

Tell us how different the curvature is along different dimensions

37
Q

Optimization: General idea of techniques like Adam

A

Apply per-parameter learning rates

38
Q

Optimization: Adagrad

A

Adapts the learning rate for each parameter based on the historical gradients

Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease.

Pro: Works well in a gently sloped parameter space.

Con: Can prematurely make learning rate too small.

39
Q

Optimization: RMSProp

A

Takes AdaGrad but replaces calculation by an exponential moving average

40
Q

Optimization: Adam

A

Like Adagrad / RMSProp but includes momentum terms

41
Q

Regularization: L1 norm

A

Applies the a sign function to the weights in addition to regularization term. Results in sparse parameters

𝐽=𝛼 sign( π‘Š^{π½βˆ’1}+𝐽(π‘Š^{π½βˆ’1}) )

42
Q

Regularization: L2 norm

A

Applies a regularization term to the weights at each update

𝐽=𝛼 π‘Š^{π½βˆ’1}+𝐽(π‘Š^{π½βˆ’1})

43
Q

Regularization: Dropout

A

Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0’s) to prevent a subset of params from learning.

Makes the model less reliant on specifically effective parameters

44
Q

Regularization: What needs to be done with dropout during inference

A

1) Scale outputs or weights by the masking probability β€œp”

W_test = W * p

2) Scale by 1/p during training.

45
Q

Batch norm: Diff between batch vs layer norm

A

Batch: Normalizes activations along the batch dimension
Layer: Normalizes activations along the feature (channel) dimension for each data point in the mini-batch

46
Q

Batch Norm: Definition

A

Normalizes the activations of a layer across a mini-batch of data, which helps stabilize and speed up training by reducing internal covariate shift.

47
Q

Batch Norm: How to use batch norm during inference

A

Take running average taken during training.

48
Q

Batch norm: Pros

A

Improves gradient flow

Allows higher learning rates

Reduces dependence on initialization

Differentiable

49
Q

Batch norm: Cons

A

Sufficient batch sizes must be used to get stable per-batch mean/variance.

50
Q

Batch norm: Where to apply in network

A

Right before activation

51
Q

Batch norm: T or F - Batch norm is useful for linear networks

A

False. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence

52
Q

Batch norm: T or F - Batch norm is useful for deep networks

A

True. In a deep neural network with nonlinear activation functions, the lower layers can perform nonlinear transformations of the data, so they remain useful.

53
Q

Batch norm: T or F - The bias term should be omitted during batch norm

A

True. The bias term should be omitted because it becomes redundant with the Ξ² parameter applied by the batch normalization reparametrization.

54
Q

Batch norm: T or F - For CNNs, you should use different mean/variance parameter values for each layer

A

False. It is important to apply the same normalizing ΞΌ and Οƒ at every spatial location within a feature map, so that the statistics of the feature map remain the same regardless of spatial location.

55
Q

Initialization: Variance of ReLU

A

𝑁(0,1)Γ—sqrt(𝑛_j / 2)

56
Q

Batch: Batch gradient descent

A

Network goes through the forward/backward pass using the entire dataset.

Pro: Provides deterministic updates.

Con: Computationally expensive.

57
Q

Batch: Stochastic gradient descent

A

Network goes through f/b using only one example.

More noisy updates, but can have regularization effect by pushing model out of local minima.

Doesn’t take advantage of parallelization

58
Q

Batch: Mini-batch gradient descent

A

Takes N samples from the empirical distribution and runs f/b pass.

Benefits from parallel processing.

More stable than SGD.

59
Q

Supervised pretraining

A

Model is initially trained on a related task that has labeled data (supervised learning) before fine-tuning it on the target task of interest.

Knowledge gained during the initial training phase can serve as a useful starting point for the model when tackling the target task.

60
Q

Formula to calculate number of parameters and bias

A

K (F1 * F2 * D + 1)

K: Number of kernels
F: Filter dimensions
D: Number of input channels