Perceptron Optimisation and Gradient Descent Flashcards

1
Q

Explain local optimisation

A

If you graph the error for all values of a single weight and you want to reduce the error, you can use the gradient to decide which direction to reduce the error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain zero order gradient descent method for optimization

A

This method is essentially taking any step towards the goal regardless of the gradient of the slope.
Mathematically you know the gradient is up or down but not the magnitude so you take any step down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain first order gradient descent method for optimisation

A

You kow the gradient of the slope so you take the step that has the greatest descent towards your goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How useful is zero order gradient descent? Why?

A

> Not effective

> Slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the equation for first order gradient descent?

A

x_(t+1) = x_t - η∇f(x_t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the usefullness and benefits of second order gradient descent

A

> Extremely computationally intensive when you have a lot of varaibles
The benefit is that it takes a single (idillic) step in the right direction.
Often easier to just take lots of small steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When do you stop with gradient descent?

A

When the change in the gradient is less than a threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the learning rate?

A

This is the step size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the effect of the size of the learning rate?

A

If the step is too large you can travel too far and overshoot the minimum error.
This can cause diversion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the effect of local minima?

A

There are multiple values each with a different minma. Gradient descent does not take this into account so it will converge on a local minima which may not be the smallest minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can we do to prevent local minima?

A

We can restart learning with random initial weights and it will converge on a different minima

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the issue with calculating gradient using E(X) = ∑ | y_n - t_n | ?

A

The equation (E(X) = ∑ | y_n - t_n |) is pievewise constant. So even if we improve the classification, if there is no change in error then it appears not to improve. The equation is not linearly differentiable so it has no gradient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do we measure error where the value is continuous even if the number of misclassified points does not change?

A

By measuring how close each point is to the decision boundary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the equation for calculating the distance of the misclassified point ot he decision boudnary

A
> x = x_p + d×w/||w||
> (x - x_p)||w|| / w = d
> d = distance to hyper place
> w = vector perpendicular to the hyper plane
> x_p = point on hyper plane
> x = misclassified point
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the new equation to calculate the total distance of all misclassified points?

A

E(X) = ∑ (w^T x_n + w_0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the new function for the error that is differentiable?

A

E(X) = ∑ w^T x_g (y_n - t_n)

17
Q

What is the equation of the gradient of the new error function?

A

E(X) = ∑ x_g (y_n - t_n)

18
Q

What is the gradient descent function with the new measurement of error?

A

w_k+1 = w_k - η × ∑ x_g (y_n - t_n)

19
Q

What is the issue of using this equation ( w_k+1 = w_k - η × ∑ x_g (y_n - t_n) ) for gradient descent?

A

It requires a lot of computation because eac h point needs to be calculated

20
Q

How do you reduce the amount of computation for gradient descent?

A

Using stchastic gradient descent

21
Q

What is stchastic gradient descent?

A

Instead of minimising the total error, we randomly select points and minimise the average error of those points

22
Q

What is the equation for calculating the average error?

A

E(X) = 1/N ∑w^T x_n (y_n-t_n )

23
Q

What is the equation for stchastic gradient descent?

A

w_(k+1) = w_k - ηx(y - t)

24
Q

What is the benefit of stchastic gradient descent? What is the trade off? What will happen overall?

A

Benefit: It uses much less computational power

Trade off: Sometimes the error can increase

Overall: The error will trend downward