Quiz 2 Flashcards

Question 1

Q

Receptive fields

Answer

A

Each node only receives input from a 𝐾1×𝐾2 window (image patch).

The region from which a node receives its input from is called a receptive field.

Question 2

Q

Shared Weights

Answer

A

Nodes in different locations can share features.

Uses the same weights/parameters in the computation graph.

Reduce parameters to (𝐾1×𝐾2+1)
Explicitly maintain spatial information

Question 3

Q

Learning Many Features

Answer

A

Weights are not shared across different feature extractors.

Reduce parameters to (𝐾1×𝐾2+1)∗𝑀 where M is the number of features to be learned

Question 4

Q

Convolution

Answer

A

In mathematics a convolution is an operation on two functions f and g producing a third function that is typically viewed as a modified version of one of the original functions, giving the area of overlap between the two functions as a function of the amount that one of the original functions is translated.

Question 5

Q

T or F: Convolutions are linear operations

Question 6

Q

What are CNN hyperparameters

Answer

A

in_channels(int): Number of channels in the input image
out_channels(int): Number of channels produced by the convolution
Kernel_size(int;tuple): Size of convolving kernel
stride (int;tuple;optional): denotes the size of the stride used by the convolution (default is 1)
padding (int;tuple;optional): Zero padding added to both sides of the input (default is 0)
padding_mode (string):
‘zeros’,’reflect’,’replicate’,’circular’ (default ‘zeros’)

Question 7

Q

Output size formula for a vanilla convolution operation

Answer

A

Specifically:

(N - F + 2P) / S + 1 = output_dim_1 * output_dim_2 * N_channels

N: Input dimension
F: Filter dimension
P: Padding
S: Stride
1: Bias term

Question 8

Q

“valid” convolution

Answer

A

where the kernel fits inside the image

Question 9

Q

T or F: Larger the filter the smaller the shrinkage

Answer

A

False. Larger filter = larger shrinkage.

Question 10

Q

“Same” convolution

Answer

A

zero-padding the image borders to produce an output the same size as the raw input

Question 11

Q

CNN: Max pooling

Answer

A

For each window, calculate its max.

Pros: No parameters to learn.

Question 12

Q

CNN: Stride

Answer

A

Movement of the convolution layer.

Question 13

Q

CNN: Pooling layer

Answer

A

Make the representations smaller and more manageable through downsampling.

Only pools width and height, not depth

Question 14

Q

CNN: Cross-correlation

Answer

A

Takes the dot product of a small filter (also called a kernel or weights) and an overlapping region of the input image or feature map.

Doesn’t flip the convolution layer.

Question 15

Q

CNN: T or F - Using a stride greater than 1 results in loss of information.

Answer

A

True. Stride > 1 implies jumping over some pixels.

Question 16

Q

CNN: Output size of vanilla/valid convolution vs full convolution

Answer

A

Vanilla: m-k+1
Full: m+k-1

Question 17

Q

CNN: Benefit of pooling

Answer

A

Makes the representation invariant to small changes in the input

Question 18

Q

CNN: Full convolution

Answer

A

Enough zeros are added to the borders such that, every pixel is visited k times in each direction. This results in an image of size m+k-1.

Full = Bigger size than original

Question 19

Q

Sigmoid

Answer

A

Min=0; max=1

Output is always positive

Saturates at both ends

Gradients vanish at each end (converging to 0 or 1 - gradient approaches zero)

Always positive

Computationally complexity high due to exponential term

Question 20

Q

tanh

Answer

A

min=-1; max=1; and we note that is centred
Saturates at both ends (-1,1)
Gradients: vanish at both ends ; always positive
medium compexity as tanh is not as simple as say multiplication

Question 21

Q

ReLU

Answer

A

Min=0, Max= ∞; always positive
Not saturated on the positive side
gradients: 0 when X <= 0 (aka dead ReLU); constant otherwise (doesn’t vanish which is good)
Cheap: doesn’t come much easier than max function

Question 22

Q

T or F: ReLU is differentiable

Answer

A

Technically no, but only at zero.

Question 23

Q

Initialization: What happens if you initialize close to a bad local minima

Answer

A

Poor gradient flow

Question 24

Q

Initialization: What happens if you initialize with large activations

Answer

A

Reach saturation quickly

Question 25

Q

Initialization: What happens if you initialize with small activations

Answer

A

Be in the linear regime or close to it in the nonlinear space, and you will have a strong gradient to learn from

Question 26

Q

Initialization: What happens if you initialize all weights with a constant

Answer

A

All learns the same thing.

Question 27

Q

Initialization: Common practice

Answer

A

1) Random sample from small normal distribution 𝑁(𝜇=0,𝜎=0.01)

2) Random sample from uniform distribution

Question 28

Q

Initialization: Why are equal (in terms of sampling from distribution), small weights preferred

Answer

A

No a priori reason why some weights should be greater

Question 29

Q

Initialization: T or F - Deeper networks are less sensitive to initialization

Answer

A

False. More sensitive with deeper network because activations increasingly get smaller.

Question 30

Q

Initialization: Fan-in Fan-out rule

Answer

A

Maintain the variance at the output to be similar to that of the input. Keeps each layer’s variance the same.

Question 31

Q

Optimization: Issues that hinder optimization

Answer

A

Noisy gradient estimates (due to taking MiniBatches)

Saddle points

Ill conditioned loss surface, where the curvature is high in one direction but not the other

Question 32

Q

Optimization: Loss surfaces that can cause problems

Answer

A

Local minima

plateaus

saddle points, a point that is a min in one axis but a max in another

Question 33

Q

Optimization: Momentum

Answer

A

Overcomes plateaus by adding exponential moving average of the gradient. Helps move off of areas with low gradients.

Question 34

Q

Optimization: Nesterov Momentum

Answer

A

Calculates gradient AFTER applying momentum term

Question 35

Q

Optimization: How to use Hessian

Answer

A

Use 2nd order derivatives to get information about the loss surface

Question 36

Q

Optimization: Condition number

Answer

A

The ratio between the smallest and largest eigenvalue of a hessian

Tell us how different the curvature is along different dimensions

Question 37

Q

Optimization: General idea of techniques like Adam

Answer

A

Apply per-parameter learning rates

Question 38

Q

Optimization: Adagrad

Answer

A

Adapts the learning rate for each parameter based on the historical gradients

Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease.

Pro: Works well in a gently sloped parameter space.

Con: Can prematurely make learning rate too small.

Question 39

Q

Optimization: RMSProp

Answer

A

Takes AdaGrad but replaces calculation by an exponential moving average

Question 40

Q

Optimization: Adam

Answer

A

Like Adagrad / RMSProp but includes momentum terms

Question 41

Q

Regularization: L1 norm

Answer

A

Applies the a sign function to the weights in addition to regularization term. Results in sparse parameters

𝐽=𝛼 sign( 𝑊^{𝐽−1}+𝐽(𝑊^{𝐽−1}) )

Question 42

Q

Regularization: L2 norm

Answer

A

Applies a regularization term to the weights at each update

𝐽=𝛼 𝑊^{𝐽−1}+𝐽(𝑊^{𝐽−1})

Question 43

Q

Regularization: Dropout

Answer

A

Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0’s) to prevent a subset of params from learning.

Makes the model less reliant on specifically effective parameters

Question 44

Q

Regularization: What needs to be done with dropout during inference

Answer

A

1) Scale outputs or weights by the masking probability “p”

W_test = W * p

2) Scale by 1/p during training.

Question 45

Q

Batch norm: Diff between batch vs layer norm

Answer

A

Batch: Normalizes activations along the batch dimension
Layer: Normalizes activations along the feature (channel) dimension for each data point in the mini-batch

Question 46

Q

Batch Norm: Definition

Answer

A

Normalizes the activations of a layer across a mini-batch of data, which helps stabilize and speed up training by reducing internal covariate shift.

Question 47

Q

Batch Norm: How to use batch norm during inference

Answer

A

Take running average taken during training.

Question 48

Q

Batch norm: Pros

Answer

A

Improves gradient flow

Allows higher learning rates

Reduces dependence on initialization

Differentiable

Question 49

Q

Batch norm: Cons

Answer

A

Sufficient batch sizes must be used to get stable per-batch mean/variance.

Question 50

Q

Batch norm: Where to apply in network

Answer

A

Right before activation

Question 51

Q

Batch norm: T or F - Batch norm is useful for linear networks

Answer

A

False. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence

Question 52

Q

Batch norm: T or F - Batch norm is useful for deep networks

Answer

A

True. In a deep neural network with nonlinear activation functions, the lower layers can perform nonlinear transformations of the data, so they remain useful.

Question 53

Q

Batch norm: T or F - The bias term should be omitted during batch norm

Answer

A

True. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparametrization.

Question 54

Q

Batch norm: T or F - For CNNs, you should use different mean/variance parameter values for each layer

Answer

A

False. It is important to apply the same normalizing μ and σ at every spatial location within a feature map, so that the statistics of the feature map remain the same regardless of spatial location.

Question 55

Q

Initialization: Variance of ReLU

Answer

A

𝑁(0,1)×sqrt(𝑛_j / 2)

Question 56

Q

Batch: Batch gradient descent

Answer

A

Network goes through the forward/backward pass using the entire dataset.

Pro: Provides deterministic updates.

Con: Computationally expensive.

Question 57

Q

Batch: Stochastic gradient descent

Answer

A

Network goes through f/b using only one example.

More noisy updates, but can have regularization effect by pushing model out of local minima.

Doesn’t take advantage of parallelization

Question 58

Q

Batch: Mini-batch gradient descent

Answer

A

Takes N samples from the empirical distribution and runs f/b pass.

Benefits from parallel processing.

More stable than SGD.

Question 59

Q

Supervised pretraining

Answer

A

Model is initially trained on a related task that has labeled data (supervised learning) before fine-tuning it on the target task of interest.

Knowledge gained during the initial training phase can serve as a useful starting point for the model when tackling the target task.

Question 60

Q

Formula to calculate number of parameters and bias

Answer

A

K (F1 * F2 * D + 1)

K: Number of kernels
F: Filter dimensions
D: Number of input channels

Brainscape's Knowledge GenomeTM

Quiz 2 Flashcards

Brainscape's Knowledge Genome^TM