ML4G - general deep learning Flashcards
(35 cards)
What is the difference between nn.BCELoss and nn.BCEWithLogitsLoss?
Which one do we use generally?
nn.BCELoss calculates cross entropy on a one-hot-encoded target, but does not include the sigmoid function to scale activations between 0 and 1.
nn.BCEWithLogitsLoss does the sigmoid and binary cross entropy in a single function. This is the one you normally want to use.
What does to_fp16 do?
It converts numbers to half-precision floating point values, which are less precise. This can speed up training.
What is the difference between resnet50 and resnet101 (for the afternoon)?
The number of layers. Larger models are generally able to better capture the real underlying relationships in your data, but they also more likely to overfit to the training data.
What are discriminative learning rates?
Learning rates that are different depending on the depth of the layer. Use a lower learning rate for the early layers of the neural network and a higher learning rate for the later layers, particularly the randomly added layers.
What are two properties of activations that softmax ensures? Why is this important?
It ensures that activations are represented as a number from 0 to 1 and that the sum of activations from each category add up to 1. The raw activation values don’t have meaning by themselves – they represent the relative confidence of an input being in category1 vs category2. What we care about is which activation is higher and by how much. We get this when all activations add up to 1. Then we can think of it as the probability of being in that category.
The second property is that if one of the numbers in our activations is slightly bigger than the others, the exponential in softmax will amplify this difference, which is useful when we really want the classifier to pick one image as the prediction.
What are the two pieces that are combined into cross-entropy loss in PyTorch?
Softmax and log likelihood
What is out-of-distribution data?
It’s data seen in production that is significantly different than the data the model was trained on.
What is data augmentation?
Data augmentation refers to creating random variations of our input data so that they appear different but do not change the meaning of the data. We can use this to increase the amount of input data and to have our neural network learn different versions of pictures of the same object.
The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?
With more layers, we train the model more quickly and it will take up less memory because we don’t need to use as many parameters with deeper models.
What’s the difference betwen F.relu and nn.ReLu?
They represent the same thing but F.relu is a function and nn.ReLu is a PyTorch module.
What is an activation function?
It’s a nonlinear function
What is ReLU?
Rectified Linear Unit. It’s an activation function that transforms any negative numbers into zero.
What does the backward method do?
It calculates the gradients and stores them in every .grad attribute of each tensor involved in the computational graph.
What does the @ operator do in Python, and in PyTorch ?
decorators (function of functions), Matrix multiplication in PyTorch
What does view do in Pytorch?
It changes the shape of a tensor without changing its contents
Create a function that if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2,’b’), (3,’c’), (4,’d’)]. What is special about that output data structure?
list(zip([1,2,3,4], ‘abcd’))
(This is the format of a Dataset in PyTorch: a collection that contains tuples of (X, y))
Write pseudocode showing the basic steps taken in each epoch for SGD.
for features, targets in a minibatch:
- preds = model(features)
- loss = loss_function(preds, targets)
- calculate the gradients for each parameter
- update the parameters by subtracting gradient * learning_rate
- reset gradients to zero for each parameter
Each epoch goes through all the minibatches.
In Pytorch, What does the DataLoader class do?
It takes in a dataset, shuffles it on every epoch and creates mini-batches. It returns an iterator over the batches.
What is special about the shape of the sigmoid function?
Why is ReLU generally preferred?
It looks like an “S”. It can take any input value, positive or negative, and output a value between 0 and 1.
ReLU is cheaper computationally and avoids the problem of the vanishing gradient.
Why can’t we use accuracy as a loss function?
The gradient of a function is its slope–how much the value of the function changes divided by how much we changed the input:
(ynew - yold)/(xnew - xold)
The problem is that a small change in weight (x) isn’t likely to cause the prediction to change, so (ynew - yold) will almost always be 0, i.e., the gradient is 0 almost everywhere. Thus, a small change in weight will often not change the accuracy at all. If the gradient is 0, the model can’t learn from that step. We need a function that can show differences from small changes in weights.
What is a gradient?
What is the shape of the gradient when we are backwarding with respect to a scalar value ?
A gradient is a derivative of the loss with respect to a parameter of the model.
The gradient has the same shape as the tensor.
Why does SGD use mini-batches?
Instead of calculating the loss over all data items, which would take forever, SGD calculates the loss over a portion of data items at a time. This increases computational speed because it reduces the number of required calculations (derivatives).
What is SGD?
Stochastic Gradient Descent. It is an iterative algorithm that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.
What is broadcasting?
What are the rules of broadcasting in PyTorch?
When trying to do a mathematical operation between tensors of different ranks, broadcasting expands the tensor with the smaller rank to have the same size as the one with the larger rank.