Mathematical foundations of deep learning Flashcards
Lecture 2 (21 cards)
What are some ways to find as minimum of a function?
- If we calculate f(x) for every x, we can just try every possible value and find min, but this is impossible cuz x can go to infinity
- Compute a derivative of a function and find zeroes (determine if min, max or a saddle point). Complicated functions are a problem where we don’t know
- I
Linear function: expression, how to get the angle? What is a slope and intercept?
How do we know if, at a point c, a function is rising or going down?
By calculating a derivative at that point. The formula in the image gives us a tangent at the point c and using that, we can determine if the functio nis going up or down (and it helps us find the minimum/maximum)
What is a notation of a derivative?
We want a function D which, when given a differentiable function f as input, produces another function g such that it is a derivative of f at every point c.
Usually expressed as d/dx
What is the chain rule of derivatives?
We have two functions f, g that have derivatives. Then, the derivative of g(f(x)) = g’(f(x) * f’(x). All these 3 are different notations of the same rule
Solve this
What is the Rosenbrock function and why is it special?
It is a multivariate function f(x, y) that it is very hard to minimize because it has a flat valley. When we start at the valley, the gradients (derivatives) are very small and the iterations don’t move the minimum almost at all.
Explain the gradient-based optimization.
It is an algorithm to find the minimum of a function. We start at the random point, calculate the derivative (gradient) of that point and we get the strength and the direction of the change at that point. We find the new point that is slightly in that direction and continue until the change is smaller than some threshold or we reach some number of iterations.
What are the prerequisites to use gradient-based optimizations?
- We can calculate y = f(x) for any x
- We can calculate its derivative for any point x
How to calculate derivatives of multivariate functions?
By computing partial derivatives which means computing derivatives with respect to single variables and we end up with a vector (gradient). When we compute derivative with respect to x1, we keep other variables as constant.
Solve:
What is the gradient of (multivariate) functions?
When computing derivative of a multivariate functions using partial derivatives, we end up with multiple values (one for each variable of the function). Combining them into a vector results with the gradient
What is eta in gradient based optimization?
It is a learning rate. Basically how big of jumps are we taking when moving in the direction of the derivative (gradient). Big eta can make the algorithm end up in a loop, small one make it very slow.
What changes when doing gradient based optimization for multivariate functions?
The random position we start at is now a vector and not a single point, and we compute the gradient of the point instead of derivative (same thing basically, gradient is a vector)
How does the chain rule work for multivariate functions?
Suppose we have f3 = h( f1(x), f2(x)) and derivative is:
Sum of two chain rules, derivative of f3 with respect to f1 multiplied with derivative of f1 with respect to x (function - var, function - var), summed with the same thing but for f2.
When f1 and f2 are also multivariate, then we have a gradient as a result
Compute gradients normally
What is the computational graph? Create one
What are forward and backward passes of the computation graph when calculating gradients?
Forward is basically building the graph and caching the ‘local’ gradients (derivative of the current node with respect to every argument/child). The backwards pass is used for computing the final gradient. We go from node to node backwards and to compute the derivative of the global output (e) with respect to the current node output, we sum over all higher-level function (parents)
- Forward computation: Compute all nodes’ output (and
cache it) - Backward computation (Backprop): Compute the overall
function’s partial derivative with respect to each node
Compute forward and backwards pass of this computational graph:
What is backpropagation?
It is computing the gradient of the function by having a forwards and backwards pass through its computational graph (or doing it recursively).
- Forward computation: Compute all nodes’ output (and
cache it) - Backward computation (Backprop): Compute the overall
function’s partial derivative with respect to each node
What are the steps in ML?
- Collect the data D (input/output pairs): Xi as input and Yi as ground truth
- Build a model to solve the task D, and it gives us some y_hat as an approx of the ground truth. The model has some tunable parameters
- Compute the loss function (distance between predicted and ground truth). Based on the loss, we update the parameters (backbropagation)