Deep Learning Flashcards

1
Q

What does ReLu stand for? And what does it mean?

A

Rectified Linear Unit

The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of neural network architecture is used for eg. House price prediction or advertisement clicking probability?

A

Standard neural network architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of neural network architecture is used for image recognition?

A

CNN (convolution neural network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

For sequence data eg. Audio over time, what type of neural network architecture do we use?

A

Recurrent Neural Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the vanishing gradient problem?

A

Optimisation of parameters uses gradient descent method to find the best parameters. Vanishing gradient problem occurs when the gradient becomes exponentially small so that the update of the parameter that we are trying to update becomes insignificant. The implications can be that the model never converges to optimum or it takes much longer to train.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain gradient descent

A

Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.

1) start with random coefficients
2) calculate predicted values
3) Calculate partial derivative w.r.t a0 and a1. Sub in the predicted values.
5) Multiply the value by learning rate and subtract it from coefficient
6) stop after 100 iterations or until the error is Low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What activation should you use for the output layer of a binary classification and why

A

Sigmoid because you want to limit the output value in the range of 0 to 1

Ps, in the sigmoid function, when x=0, y=0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name other activation functions more suitable for hidden later

A

Tanh: similar to sigmoid function but limits the output range to -1 and 1, when x=0, y=0

ReLu: (default choice when you dk what to use because training time is faster as compared to using tanh and sigmoid due to the lack of vanishing gradient) a=max(0,z). Tanh and sigmoid have vanishing gradient problems at the tails

Leaky ReLU: a=max(0.01z, z) when x is negative, instead of the slope being zero, there is a small slope. The constant 0.01 can be another learning parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What type of activation function should you use for a regression problem where the output is non-negative

A

Relu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is there a need for non-linear activation functions

A

For networks with more than 1 hidden layer, if one were to use linear activation functions for all layers, the output will still be the same as that of 1 hidden layer with a linear activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two phases in neural network?

A

During forward propagation, the input is fed into the neural network, and the network calculates the output. During backward propagation, the error between the predicted output and the actual output is calculated, and the weights and biases of each neuron are adjusted to reduce the error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the common steps for pre-processing a new dataset

A
  1. Figure out dimensions and shapes of problem (m_train, m_test, num_px)
  2. Reshape the dataset such that each example is now a vector of size
  3. Standardised the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In a neural network with 1 hidden layer with 3 nodes and 4 input features what is the shape of the weights matrix in layer 1

A

(3,4)

The number of rows in W is the number of neurons in that layer and the number of columns is the number of input of the layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to build a 2 layer neural network?

A

1) Initial parameters:

Weights matrix 1 with shape (size of hidden layer, size of input layer) | all random weights
Bias vector 1 with shape (size of hidden layer, 1) | all zero
Weights matrix 2 with shape (size of output layer, size of hidden layer) | all random
Bias vector 1 with shape (size of output layer, 1) | all zero

2) Forward propagation:

Z = np.dot(W, A) + b

Where A is the activation from prev layer or the input data
B is the bias vector
W is the weight matrix

3) Calculate the activation for the layer by applying the activation function to z
g(z)

4) Compute the cost function
- if it’s a regression problem, cost functions are MAE, MSE, RMSE
- If it is a classification problem, cost functions are cross entropy loss

5) Backward propagation
- compute the derivative of cost function with respect to AL ( probability vector)
- Use dAL to calculate the derivative of the cost function with respect to z
- Use dZ to calculate the calculate the derivative of the cost function with respect to W, b

6) Update the parameters
- The new W and B is updated by subtracting the learning rate * by the gradient computed in backward propagation

7) Repeat steps 2 to 6 for a set number of iterations like 1000 times or until the cost is at a satisfactory level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does l2 regularisation work in neural networks

A

if lambda, the regularization parameter is large, then your parameters will be relatively small, because they are penalized being large in the cost function. And so if the weights, W, are small, then because z is a function of W, if W tends to be very small, then z will also be relatively small. And in particular, if z ends up taking relatively small values, just in this little range, then g of z will be roughly linear. So it’s as if every layer will be roughly linear, as if it is just linear regression. And we saw in course one that if every layer is linear, then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is, at the end of the day, only able to compute a linear function. So it’s not able to, you know, fit those very, very complicated decision, very non-linear decision boundaries that allow it to, you know, really overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does dropout regularisation work in NN

A
  • each layer has a probability of keeping a node
  • Keep prob lower for bigger weight matrix (eg, hidden layer) to increase drop out and higher fro layers with less nodes
  • Intuition, can’t rely on any one feature, so have to spread out of weight
  • For drop out the cost function will get fucked up, turn off drop out to ensure cost is dropping (ensure model is working) then turn it back
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the ways to speed up mini batch gradient descent and explain how

A

1) Momentum gradient descent

Problem with mini batch gradient descent is that it may take many steps to get to minimum due to noise of the batches, causing the cost of oscillate before reaching the minimum.

Momentum gradient descent reduces the number of steps taken using by smoothing out the movement using exponentially weighted average

The smoothing constant, beta, is a hyper parameter using 0.9 which is average of 10 iterations

2) RMSprop

  • known as root mean square prop
  • The vertical axis rep b and the horizontal axis rep w.
  • The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
  • Sdw=betasdw + (1-beta)dw^2 (element wise)
  • W=w- alphadw/sqrt(sdw)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why the need for learning rate decay?

A

At the start of learning, can afford to take bigger steps. But when learning rate is large, nearing the minima of gradient descent, the algo might wander around and not reach the minima due to the large noisy steps, hence using learning rate decay makes the lr smaller and hence the steps smaller towards the end of the training to better find the minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is one epoch

A

One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. So if you have 500 mini batches of 100, the number of iterations is 500 ie the parameters have been updated 500 and all 50,000 samples have been seen by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How is the learning rate alpha updated in LR decay

A

It is updated after each epoch,
Where alpha next = (1/1+decay rate epoch num)alpha prev

Other methods
Exp decay: Alpha = 0.95 ^ epoch num * alpha prev

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to sample values on a log scale

A

1) determine upper and lower limit
2) transform log10x
3) np.randn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to sample values on a log scale when sampling for learning rate

A

1) determine upper and lower limit
2) a=log10lowerlimit |b=log10upperlimit
3) random(a,b)
4) lr = 10^random number from 3

23
Q

What is batch norm

A

So we know that normalising inputs help speed up training by transforming optimisation problem from more elongated to circular

Batch norm is normalising the inputs to the next layer, eg. a or z where a = g(z) and z=wa + b to speed up trianing

Batch norm also helps with regularisation

Each mini batch is scaled by the mean/variance computed on just that mini-batch
This adds some noise to the values zl within that minibatch so similar to dropout, it adds some noise to each hidden layers activations, adding some reg effect

24
Q

How to implement batchnorm

A

Miu=1/m sum z
Variance = 1/m sum (z-miu)^2
Z norm i = zi - miu / sqrt(var+ epsilon)

But sometimes you dw z norm to have mean 0 and variance 1

~Z i = gamma z i norm + beta

Where game and beta are trainable parameters

25
Q

What activation function dyou use for multi class classification problem for the outputs layer? Explain how it works

A

Softmax.

26
Q

CNN

A

A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.

A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.

The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field.

During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region.

The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling

The fully connecred layer helps to map the representation between the input and the output.

27
Q

Other regularisation techniques

A

-data augmentation eg for images can flip or crop images to increase data set
- Early stopping plot cost against iterations for both CV dataset and train dataset, stop training when CV cost start to increase.

28
Q

Why normalise data

A

Easier to converge to minima when using gradient descent

29
Q

Minibatch or batch gradient descent for larger training sets? Why?

A

For training large dataset, use mini batch gradient descent runs much faster than batch gradient descent

Whereas with batch gradient descent, a single pass through the training set allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there.

30
Q

Diff between gradient descent and stochastic gradient descent

A
  • Because gradient descent uses a whole batch for training, the algo will take nice large steps into the minima but because sgd uses each sample training data as it’s own mini batch, the descent into the minima will be noisier as each sample have diff quality, and hence it also won’t get into minima but rather wander around the minima
31
Q

What is the con of SGD

A

lose speed from vectorisation

32
Q

Mini batch size

A
  • if less than 2000 use batch gradient descent
  • Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
  • Also depends on CPU and GPU memory
  • Can try diff values, see which one is more efficient
33
Q

Pros of minibatch

A
  • make progress before processing the whole dataset
34
Q

What is RMS prop

A
  • known as root mean square prop
  • The vertical axis rep b and the horizontal axis rep w.
  • The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
  • Sdw=betasdw + (1-beta)dw^2 (element wise)
  • W=w- alphadw/sqrt(sdw)
35
Q

What is Adam optimisation

A
  • adaptive moment estimation
  • combining momentum with RMSProp
36
Q

Tuning DL networks

A

List of things to tune
- number of layers
- number of nodes
- LR rate / LR decay rate
- Size of mini batch
- Dropout

Use random search rather than grid search to cover more search area then narrow down the search area

Tuning importance:
Alpha / LR

Momentum term = ~0.8
Hidden units
Mini batch size

Number of layers
LR decay

never tune
Beta1, beta2 and epsilon

37
Q

Why use transfer learning

A

Transfer learning save training time and have better performance without needing a lot of data.

38
Q

How does transfer learning work

A

In comp vision, neural networks usually try to detect edges in the r earlier layers, shapes in the middle layer and some task-specific features in the latter layers. In transfer learning, the early and middle layers are used and we only retrain the latter layers. It helps leverage the labeled data of the task it was initially trained on

39
Q

Common data augmentation techniques

A

Random crop, mirroring, color shifting (adding numbers to the RGB values) using PCA color augmentation

40
Q

Normal vs Depthwise convolution

A
  • for normal convolutions, you slide each filter block across the image block, for each stride multiply all the numbers in the filter block with all the numbers in the image block within the filter space and sun it up. The computational cost = number of filter parameters* number of filter positions * number of filters
  • For depthwise convolution, the number of filters = number of image channels. Match each filter to each channel and slide the corresponding filter to the corresponding image channel. Then perform the number multiplications and sun it up. The computation cost = number of filter Params * number of filter positions * number of filters. Do a pointwise convolution 11nc with nc’ filters
41
Q

Advantages of Mobile nets

A
  • low computational cost at deployment
  • Useful for mobile and embedded vision applications
42
Q

Inception Network

A

Instead of choosing which conv / pooling to come first, just do them all with a 1x1 convolution but concatenate the blocks together and let the network learn.

Drawback is computation cost but with 1x1 convolutions you can shrink the number of channels before applying convolution

43
Q

Why does resnet work?

A

It works because
a^l+2 = g(z^l+2 + a^l)
= g( w^l+2 a^l+1 + b^l+2 + a^l)

If you use l2 reg, w^l+2 will tend to shrink, and if w is 0, then the equation will just b g(a^l). Hence it is easy for resblocks to learn

44
Q

Purpose of 1x1 convolution

A

Purpose 1: shrink channels
- Eg. You have 28x28x192 and you wna shrink the volume to 28x28x32 you can apply 32 filters of 1x1x192
Purpose 2: add non-linearity

45
Q

What are resnets?

A

Resnets also known as residual networks are build out of smth called residual blocks that allow you to train very very deep networks (>100 layers)

ResNet works by adding residual connections to the network, which helps to maintain the information flow throughout the network and prevents the gradients from vanishing. The residual connection is a shortcut that allows the information to bypass one or more layers in the network and reach the output directly.

In theory, the more layers the lower the error but in practice or reality the training error increases which means the network have a harder time learning. But when you use resets, the more layers the lower the error, but eventually flattens and plateau.

46
Q

CNN why add padding

A
  • for every convolution layer the image shrinks where the new image is n-f+1.so if you dw ur image to shrink a lot can add padding esp if building v deep networks. New image after = n+2p-f+1
  • Also if you don’t add padding the pixel in the corner only contributes once to the model as compared the a more centralised pixel, if you don’t pad you’ll be throwing away a lot of info from edge
47
Q

How to build a tensorflow CNN?

A

sequential model

Conv2D(n_filters, shape of filter, activation, input shape = (pixel, pixels, 3) )
MaxPooling2D(2,2) <- shape of pooling window

Flatten()
Dense(n_neruons, activation)
Dense(n_output, activation)

48
Q

Mini batch size

A
  • if less than 2000 use batch gradient descent
  • Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
  • Also depends on CPU and GPU memory
  • Can try diff values, see which one is more efficient
49
Q

Explain the architecture of a image classification with localisation problem

A

Architecture (Two outputs)
- conv net
- softmax to output different possible classes of object (eg. Car, pedestrian, motto cycle and background)
- Output bounding box (bx, by, bh, bw) where bx and by is the middle of the bounding box

Y = vector (
Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3)

If no object, just label as don’t care

Loss function:
- if the actual label is got object then loss is the sum of squares of the prediction of each component (Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1,C2,C3) - actual label
- If actual label is no object, square of (Pc hat - Pc)

50
Q

Explain landmark detection

A

Landmark detection
- landmark detection is like detecting eyes/nose on a face
- Label training dataset with the a certain number coordinates/landmarks that surrounds the feature you are trying to detect
- change CNN output layer to output the feature of the face and coordinates of the points you want in the image like this
- Y = vector(face?, l1x, l1y, …l64x,l64y)

51
Q

Explain object detection

A

Object detection
- train a conv net to detect a car using heavily cropped images
- Then for an image, start by a picking a window size, input the window image into conv net and get a prediction. Slide the window, and pass in the next window image into the conv net. Do this until the whole image is covered. Stride can be customised
- Change window size and do it again

Intersection over union
= size of intersection between ground truth box and prediction box / size of union of both ground truth box and prediction box

If IOU>= 0.5 then prediction is correct
More generally, IOU is a measure of the overlap between two bounding boxes

Non max suppression is a way to make sure your algo only detects the object once instead of multiple times

Steps:
- discard all boxes with probability of object <= 0.6
- Pick the box with the largest probability of object as a prediction
- Discard any remaining box with IOU >= 0.5 with the box output in prev step

Anchor box
- aims to solve the problem that only one grid ce can only detect one object, if the grid cell contains two overallping objects,

Steps
- encode y

Y = vector (
Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3,

bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3,)

Where the second set refers anchor box 2

Limitations both object similar have the Sam e anchor box shape

Output shape = grid row x grid Col x n anchor box x 8 Params (Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3)

52
Q

What architecture is used for image segmentation eg. Medical imaging

A
  • blow the image back up in size using transpose convolution

Contracting path (Encoder containing downsampling steps):
Images are first fed through several convolutional layers which reduce height and width, while growing the number of channels.
The contracting path follows a regular CNN architecture, with convolutional layers, their activations, and pooling layers to downsample the image and extract its features. In detail, it consists of the repeated application of two 3 x 3 same padding convolutions, each followed by a rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled.
Crop function: This step crops the image from the contracting path and concatenates it to the current image on the expanding path to create a skip connection.

Expanding path (Decoder containing upsampling steps):
The expanding path performs the opposite operation of the contracting path, growing the image back to its original size, while shrinking the channels gradually.
In detail, each step in the expanding path upsamples the feature map, followed by a 2 x 2 convolution (the transposed convolution). This transposed convolution halves the number of feature channels, while growing the height and width of the image.
Next is a concatenation with the correspondingly cropped feature map from the contracting path, and two 3 x 3 convolutions, each followed by a ReLU. You need to perform cropping to handle the loss of border pixels in every convolution.
Final Feature Mapping Block: In the final layer, a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. The channel dimensions from the previous layer correspond to the number of filters used, so when you use 1x1 convolutions, you can transform that dimension by choosing an appropriate number of 1x1 filters. When this idea is applied to the last layer, you can reduce the channel dimensions to have one layer per class.
The U-Net network has 23 convolutional layers in total.
Important Note:
The figures shown in the assignment for the U-Net architecture depict the layer dimensions and filter sizes as per the original paper on U-Net with smaller

53
Q

Neural style transfer

A

Cost function = alpha*Jcontent(content image, generated image) + beta JStyle(style image, generated image)

Steps
1. randomly initiate Generated image ie. pixel numbers are random
2. Use gradient descent to minimise the cost function be

Content cost function
- use a pre trained conv net
- Let al and al be the activation of layer l on the images
- If al and al are similar, both images have similar content
- where the similarity is measured by the sum of the element wise differences squared of the two vectors

Style cost function
- Sun of squares of the element wise difference between the correlation matrices of the style image and the generate image where the correlation Matrix measures the correlation of the channels in the activation part of hidden layer