ML4G - general deep learning Flashcards by Charbel-Raphaël Segerie

What is the difference between nn.BCELoss and nn.BCEWithLogitsLoss?

Which one do we use generally?

nn.BCELoss calculates cross entropy on a one-hot-encoded target, but does not include the sigmoid function to scale activations between 0 and 1.

nn.BCEWithLogitsLoss does the sigmoid and binary cross entropy in a single function. This is the one you normally want to use.

How well did you know this?

Not at all

Perfectly

What does to_fp16 do?

It converts numbers to half-precision floating point values, which are less precise. This can speed up training.

How well did you know this?

Not at all

Perfectly

What is the difference between resnet50 and resnet101 (for the afternoon)?

The number of layers. Larger models are generally able to better capture the real underlying relationships in your data, but they also more likely to overfit to the training data.

How well did you know this?

Not at all

Perfectly

What are discriminative learning rates?

Learning rates that are different depending on the depth of the layer. Use a lower learning rate for the early layers of the neural network and a higher learning rate for the later layers, particularly the randomly added layers.

How well did you know this?

Not at all

Perfectly

What are two properties of activations that softmax ensures? Why is this important?

It ensures that activations are represented as a number from 0 to 1 and that the sum of activations from each category add up to 1. The raw activation values don’t have meaning by themselves – they represent the relative confidence of an input being in category1 vs category2. What we care about is which activation is higher and by how much. We get this when all activations add up to 1. Then we can think of it as the probability of being in that category.

The second property is that if one of the numbers in our activations is slightly bigger than the others, the exponential in softmax will amplify this difference, which is useful when we really want the classifier to pick one image as the prediction.

How well did you know this?

Not at all

Perfectly

What are the two pieces that are combined into cross-entropy loss in PyTorch?

Softmax and log likelihood

How well did you know this?

Not at all

Perfectly

What is out-of-distribution data?

It’s data seen in production that is significantly different than the data the model was trained on.

How well did you know this?

Not at all

Perfectly

What is data augmentation?

Data augmentation refers to creating random variations of our input data so that they appear different but do not change the meaning of the data. We can use this to increase the amount of input data and to have our neural network learn different versions of pictures of the same object.

How well did you know this?

Not at all

Perfectly

The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

With more layers, we train the model more quickly and it will take up less memory because we don’t need to use as many parameters with deeper models.

How well did you know this?

Not at all

Perfectly

What’s the difference betwen F.relu and nn.ReLu?

They represent the same thing but F.relu is a function and nn.ReLu is a PyTorch module.

How well did you know this?

Not at all

Perfectly

What is an activation function?

It’s a nonlinear function

How well did you know this?

Not at all

Perfectly

What is ReLU?

Rectified Linear Unit. It’s an activation function that transforms any negative numbers into zero.

How well did you know this?

Not at all

Perfectly

What does the backward method do?

It calculates the gradients and stores them in every .grad attribute of each tensor involved in the computational graph.

How well did you know this?

Not at all

Perfectly

What does the @ operator do in Python, and in PyTorch ?

decorators (function of functions), Matrix multiplication in PyTorch

How well did you know this?

Not at all

Perfectly

What does view do in Pytorch?

It changes the shape of a tensor without changing its contents

How well did you know this?

Not at all

Perfectly

Create a function that if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2,’b’), (3,’c’), (4,’d’)]. What is special about that output data structure?

Study These Flashcards

list(zip([1,2,3,4], ‘abcd’))

(This is the format of a Dataset in PyTorch: a collection that contains tuples of (X, y))

Write pseudocode showing the basic steps taken in each epoch for SGD.

Study These Flashcards

for features, targets in a minibatch:

preds = model(features)
loss = loss_function(preds, targets)
calculate the gradients for each parameter
update the parameters by subtracting gradient * learning_rate
reset gradients to zero for each parameter

Each epoch goes through all the minibatches.

In Pytorch, What does the DataLoader class do?

Study These Flashcards

It takes in a dataset, shuffles it on every epoch and creates mini-batches. It returns an iterator over the batches.

What is special about the shape of the sigmoid function?
Why is ReLU generally preferred?

Study These Flashcards

It looks like an “S”. It can take any input value, positive or negative, and output a value between 0 and 1.

ReLU is cheaper computationally and avoids the problem of the vanishing gradient.

Why can’t we use accuracy as a loss function?

Study These Flashcards

The gradient of a function is its slope–how much the value of the function changes divided by how much we changed the input:

(ynew - yold)/(xnew - xold)

The problem is that a small change in weight (x) isn’t likely to cause the prediction to change, so (ynew - yold) will almost always be 0, i.e., the gradient is 0 almost everywhere. Thus, a small change in weight will often not change the accuracy at all. If the gradient is 0, the model can’t learn from that step. We need a function that can show differences from small changes in weights.

What is a gradient?
What is the shape of the gradient when we are backwarding with respect to a scalar value ?

Study These Flashcards

A gradient is a derivative of the loss with respect to a parameter of the model.
The gradient has the same shape as the tensor.

Why does SGD use mini-batches?

Study These Flashcards

Instead of calculating the loss over all data items, which would take forever, SGD calculates the loss over a portion of data items at a time. This increases computational speed because it reduces the number of required calculations (derivatives).

What is SGD?

Study These Flashcards

Stochastic Gradient Descent. It is an iterative algorithm that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.

What is broadcasting?
What are the rules of broadcasting in PyTorch?

Study These Flashcards

When trying to do a mathematical operation between tensors of different ranks, broadcasting expands the tensor with the smaller rank to have the same size as the one with the larger rank.

What are RMSE and L1 norm?

RMSE - root mean square error. Take the mean of the squared difference and then take the square root of the answer. It penalizes bigger mistakes more heavily than L1 norm. L1 norm - mean absolute difference. Take the mean of the absolute value of differences

What is the difference between tensor rank and shape? How do you get the rank from the shape?

Tensor rank refers to the number of axes in a tensor. Tensor shape refers to the length of each axis. You can get the rank from the shape by using the length function: len(tensorname.shape)

What is a rank-3 tensor?

It's a tensor with 3 axes or dimensions

We are computing x+y by trying to use broadcasting. Which operation works? >>> x=torch.empty(5,7,3) >>> y=torch.empty(5,7,3) >>> x=torch.empty((0,)) >>> y=torch.empty(2,2) >>> x=torch.empty(5,3,4,1) >>> y=torch.empty( 3,1,1) >>> x=torch.empty(5,2,4,1) >>> y=torch.empty( 3,1,1)

>>> x=torch.empty(5,7,3) >>> y=torch.empty(5,7,3) # same shapes are always broadcastable (i.e. the above rules always hold) >>> x=torch.empty((0,)) >>> y=torch.empty(2,2) # x and y are not broadcastable, because x does not have at least 1 dimension # can line up trailing dimensions >>> x=torch.empty(5,3,4,1) >>> y=torch.empty( 3,1,1) # x and y are broadcastable. # 1st trailing dimension: both have size 1 # 2nd trailing dimension: y has size 1 # 3rd trailing dimension: x size == y size # 4th trailing dimension: y dimension doesn't exist # but: >>> x=torch.empty(5,2,4,1) >>> y=torch.empty( 3,1,1) # x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

What is vision segmentation?

Creating a model that can recognize the content of every individual pixel in an image

What is the "head" of a model?

The head of a model is the part that is newly added to the pretrained model to be specific to the new dataset.

What is a GPU?

Graphics Processing Unit, a.k.a. graphics card. A king of computer processor that can handle thousands of single tasks at the same time. GPUs can run neural networks hundreds of times faster than regular CPUs.

What were the two theoretical misunderstandings that held back the field of neural networks?

- Marvin Minsky and Syemour Papert wrote book called Perceptrons, where they showed that a single layer of these devices was unable to learn even simple mathematical functions. They also showed that adding more layers solved the problem, but only the first insight became widely known. - In the 80s, people started building models with two layers, which theoretically was enough to allow any mathematical function to be approximated. However, these networks were too big and too slow to be useful in practice.

What was the name of the first device that was based on the principle of the artificial neuron?

Mark I Perceptron (Rosenblatt) 1957

What is the difference between nn.BCELoss and nn.BCEWithLogitsLoss? Which one do we use generally?

nn.BCELoss calculates cross entropy on a one-hot-encoded target, but does not include the sigmoid function to scale activations between 0 and 1. nn.BCEWithLogitsLoss does the sigmoid and binary cross entropy in a single function. This is the one you normally want to use.

train/tests/validation sets → What is the difference between validation and test sets? When is it okay to tune a hyperparameter on the test set?

– Validation set: A set of examples used to tune the meta-parameters of a classifier, for example to choose the number of hidden units in a neural network. – Test set: A set of examples used only to assess the performance of a fully-specified classifier. When the relationship between the hyperparameter and the metric you're using to measure performance is a smooth curve. This gives us confidence that we're not picking an inappropriate outlier to define the hyperparameter value.

ML4G - general deep learning Flashcards

(35 cards)