Mathematical foundations of deep learning Flashcards

Lecture 2 (21 cards)

1
Q

What are some ways to find as minimum of a function?

A
  • If we calculate f(x) for every x, we can just try every possible value and find min, but this is impossible cuz x can go to infinity
  • Compute a derivative of a function and find zeroes (determine if min, max or a saddle point). Complicated functions are a problem where we don’t know
  • I
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear function: expression, how to get the angle? What is a slope and intercept?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we know if, at a point c, a function is rising or going down?

A

By calculating a derivative at that point. The formula in the image gives us a tangent at the point c and using that, we can determine if the functio nis going up or down (and it helps us find the minimum/maximum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a notation of a derivative?

A

We want a function D which, when given a differentiable function f as input, produces another function g such that it is a derivative of f at every point c.

Usually expressed as d/dx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the chain rule of derivatives?

A

We have two functions f, g that have derivatives. Then, the derivative of g(f(x)) = g’(f(x) * f’(x). All these 3 are different notations of the same rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Solve this

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Rosenbrock function and why is it special?

A

It is a multivariate function f(x, y) that it is very hard to minimize because it has a flat valley. When we start at the valley, the gradients (derivatives) are very small and the iterations don’t move the minimum almost at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the gradient-based optimization.

A

It is an algorithm to find the minimum of a function. We start at the random point, calculate the derivative (gradient) of that point and we get the strength and the direction of the change at that point. We find the new point that is slightly in that direction and continue until the change is smaller than some threshold or we reach some number of iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the prerequisites to use gradient-based optimizations?

A
  • We can calculate y = f(x) for any x
  • We can calculate its derivative for any point x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to calculate derivatives of multivariate functions?

A

By computing partial derivatives which means computing derivatives with respect to single variables and we end up with a vector (gradient). When we compute derivative with respect to x1, we keep other variables as constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Solve:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the gradient of (multivariate) functions?

A

When computing derivative of a multivariate functions using partial derivatives, we end up with multiple values (one for each variable of the function). Combining them into a vector results with the gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is eta in gradient based optimization?

A

It is a learning rate. Basically how big of jumps are we taking when moving in the direction of the derivative (gradient). Big eta can make the algorithm end up in a loop, small one make it very slow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What changes when doing gradient based optimization for multivariate functions?

A

The random position we start at is now a vector and not a single point, and we compute the gradient of the point instead of derivative (same thing basically, gradient is a vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the chain rule work for multivariate functions?

A

Suppose we have f3 = h( f1(x), f2(x)) and derivative is:

Sum of two chain rules, derivative of f3 with respect to f1 multiplied with derivative of f1 with respect to x (function - var, function - var), summed with the same thing but for f2.

When f1 and f2 are also multivariate, then we have a gradient as a result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Compute gradients normally

17
Q

What is the computational graph? Create one

18
Q

What are forward and backward passes of the computation graph when calculating gradients?

A

Forward is basically building the graph and caching the ‘local’ gradients (derivative of the current node with respect to every argument/child). The backwards pass is used for computing the final gradient. We go from node to node backwards and to compute the derivative of the global output (e) with respect to the current node output, we sum over all higher-level function (parents)

  • Forward computation: Compute all nodes’ output (and
    cache it)
  • Backward computation (Backprop): Compute the overall
    function’s partial derivative with respect to each node
19
Q

Compute forward and backwards pass of this computational graph:

20
Q

What is backpropagation?

A

It is computing the gradient of the function by having a forwards and backwards pass through its computational graph (or doing it recursively).

  • Forward computation: Compute all nodes’ output (and
    cache it)
  • Backward computation (Backprop): Compute the overall
    function’s partial derivative with respect to each node
21
Q

What are the steps in ML?

A
  1. Collect the data D (input/output pairs): Xi as input and Yi as ground truth
  2. Build a model to solve the task D, and it gives us some y_hat as an approx of the ground truth. The model has some tunable parameters
  3. Compute the loss function (distance between predicted and ground truth). Based on the loss, we update the parameters (backbropagation)