wk3 Flashcards
(10 cards)
what is equation for gradient descent in linear regression
w^(t+1)= w^(t) - learning rate * d.C(w)
What is stochastic gradient descent
-initialise all weights
-for each epoch or time step:
- select a random sample ,i, from the set of points with equal probability and compute the gradient of that point with respect to weights and update the weight
What are advantages and disadvantages of SGD vs G
-faster execution O(1) rather than O(n)
-it is not guaranteed to converge or come up with optimal solution
- on average it is correct with enough iterations
What is commonly the best learning rate
learning rate = C / root(t) | Learning rate decreases as the algorithm begins to converge
What is minibatch SGD
rather than selecting individual points during gradient descent we randomly select a batch of points
what are common choices of batches
32, 64, 128
What are the two approaches to sampling
sampling with replacement and sampling without replacement
Describe intuition behind margin-based loss
if yW^TX is positive then correct. If incorrect then negative:
-W^T X is the prediction, we take the sign of the function to indicate the prediction (negative or positive). If prediction is same sign as answer then it is correct otherwise not.
-If we multiply opposite signs we get a negative answer, if we multiply by same signs we get positive answer
What is loss function for margin based loss
L(y predicted, y actual) = g(y predicted, y actual) where g is a decreasing function
Why do we want g to be a decreasing function
-we want to minimise g. Since g is decreasing, minimising loss function L is equivalent to maximising the margin
-maximising margin means getting better model performance