You have an input volume of 32×32×3. What are the dimensions of

the resulting volume after convolving a 5×5 kernel with zero padding,

stride of 1, and 2 filters?

Parameter count = (k1 * k2 * depth + 1) * No. of filters

Therefore, (5 * 5 * 3 + 1) * 2 which comes to 152

- Consider a documents collection made of 100 documents. Given a

query q, the set of documents relevant to the users is D* = d3, d12,

d34, d56, d98. An IR system retrieves the following documents D =

d3, d12, d35, d56, d66, d88, d95

• Compute the number of True-Positives, True-Negatives, FalsePositives, False-Negatives

• Compute Precision, Recall, and Accuracy.

TP = 3, TN = 91, FP = 4, FN = 2 Precision = 3/7 Recall = 3/5 Accuracy = 94/100

You have an input volume of 32×32×3. What are the dimensions of

the resulting volume after convolving a 5×5 kernel with zero padding,

stride of 1, and 2 filters?

4. How many weights and biases would you have?

W×H×T* num of F + num of F

= 5 * 5 * 5 * 3 * 2 + 2

Output size of vanilla Convolution

(H-k1+1) X (W-k2+2)

Suppose you have an input volume of dimension 64x64x16. How many

parameters would a single 1x1 convolutional filter have, including the

bias?

17

Suppose your input is a 300 by 300 color (RGB) image, and you use

a convolutional layer with 100 filters that are each 5x5. How many

parameters does this layer have including the bias parameters?

7600

You have an input volume that is 63x63x16 and convolve it with 32

filters that are each 7x7, and stride of 1. You want to use a same

convolution. What is the padding?

((63 − 7 + 2P) / 1) + 1 = 63

Solve for P = 3

Sigmoid

0 to 1

Lose gradient at both ends

Computation is exponential term

Tanh

-1 to 1 (centered at 0)

Lose gradient at both ends

Still computationally heavy

Relu

No saturation on positive end

Can cause dead neuron (if x <= 0)

Cheap to compute

Leaky relu

Learnable parameter

No saturation

No dead neuron

Still cheap to compute

Which activation is best?

ReLU is typical starting point

Sigmoid is typically avoided

Initialization

Initialization that is close to a good (local) minima will converge faster and to a better solution

Initializing values to a constant value leads to a degenerate solution!

Xavier Initialization –> Lesson 3, Slide 26

Issues with optimizers

Noisy gradient estimates

Saddle points

Ill-conditioned loss surface

Optimization types

RMSProp

Keep a moving average of squared gradients

Adagrad

Use gradient statistics to reduce learning rate across iterations

Adam

Maintains both first and second moment statistics for gradients

Drop out

Dropout: For each node, keep its output with probability p; Activations of deactivated nodes are essentially zero

In practice, implement with a mask calculated each iteration

During testing, no nodes are dropped

Can be seen as:

Training 2^n networks, or

The model should not rely too heavily on particular features

Methods to address class imbalance

Sampling

Synthetic oversampling minority technique

Identify nearest neighbors in feature space, select subset of nearest neighbors, then uniformly sample from line segment connecting the nearest neighbors

Cost-based learning

Focal Loss

downweight easy examples (well classified, high probability examples)