Perceptron Optimisation and Gradient Descent Flashcards

Question 1

Q

Explain local optimisation

Answer

A

If you graph the error for all values of a single weight and you want to reduce the error, you can use the gradient to decide which direction to reduce the error.

Question 2

Q

Explain zero order gradient descent method for optimization

Answer

A

This method is essentially taking any step towards the goal regardless of the gradient of the slope.
Mathematically you know the gradient is up or down but not the magnitude so you take any step down.

Question 3

Q

Explain first order gradient descent method for optimisation

Answer

A

You kow the gradient of the slope so you take the step that has the greatest descent towards your goal.

Question 4

Q

How useful is zero order gradient descent? Why?

Answer

A

> Not effective

> Slow

Question 5

Q

What is the equation for first order gradient descent?

Answer

A

x_(t+1) = x_t - η∇f(x_t)

Question 6

Q

Explain the usefullness and benefits of second order gradient descent

Answer

A

> Extremely computationally intensive when you have a lot of varaibles
The benefit is that it takes a single (idillic) step in the right direction.
Often easier to just take lots of small steps

Question 7

Q

When do you stop with gradient descent?

Answer

A

When the change in the gradient is less than a threshold

Question 8

Q

What is the learning rate?

Answer

A

This is the step size

Question 9

Q

What is the effect of the size of the learning rate?

Answer

A

If the step is too large you can travel too far and overshoot the minimum error.
This can cause diversion

Question 10

Q

What is the effect of local minima?

Answer

A

There are multiple values each with a different minma. Gradient descent does not take this into account so it will converge on a local minima which may not be the smallest minima.

Question 11

Q

What can we do to prevent local minima?

Answer

A

We can restart learning with random initial weights and it will converge on a different minima

Question 12

Q

What is the issue with calculating gradient using E(X) = ∑ | y_n - t_n | ?

Answer

A

The equation (E(X) = ∑ | y_n - t_n |) is pievewise constant. So even if we improve the classification, if there is no change in error then it appears not to improve. The equation is not linearly differentiable so it has no gradient.

Question 13

Q

How do we measure error where the value is continuous even if the number of misclassified points does not change?

Answer

A

By measuring how close each point is to the decision boundary

Question 14

Q

What is the equation for calculating the distance of the misclassified point ot he decision boudnary

Answer

A

> x = x_p + d×w/||w||
> (x - x_p)||w|| / w = d
> d = distance to hyper place
> w = vector perpendicular to the hyper plane
> x_p = point on hyper plane
> x = misclassified point

Question 15

Q

What is the new equation to calculate the total distance of all misclassified points?

Answer

A

E(X) = ∑ (w^T x_n + w_0)

Question 16

Q

What is the new function for the error that is differentiable?

Answer

A

E(X) = ∑ w^T x_g (y_n - t_n)

Question 17

Q

What is the equation of the gradient of the new error function?

Answer

A

E(X) = ∑ x_g (y_n - t_n)

Question 18

Q

What is the gradient descent function with the new measurement of error?

Answer

A

w_k+1 = w_k - η × ∑ x_g (y_n - t_n)

Question 19

Q

What is the issue of using this equation ( w_k+1 = w_k - η × ∑ x_g (y_n - t_n) ) for gradient descent?

Answer

A

It requires a lot of computation because eac h point needs to be calculated

Question 20

Q

How do you reduce the amount of computation for gradient descent?

Answer

A

Using stchastic gradient descent

Question 21

Q

What is stchastic gradient descent?

Answer

A

Instead of minimising the total error, we randomly select points and minimise the average error of those points

Question 22

Q

What is the equation for calculating the average error?

Answer

A

E(X) = 1/N ∑w^T x_n (y_n-t_n )

Question 23

Q

What is the equation for stchastic gradient descent?

Answer

A

w_(k+1) = w_k - ηx(y - t)

Question 24

Q

What is the benefit of stchastic gradient descent? What is the trade off? What will happen overall?

Answer

A

Benefit: It uses much less computational power

Trade off: Sometimes the error can increase

Overall: The error will trend downward

Brainscape's Knowledge GenomeTM

Perceptron Optimisation and Gradient Descent Flashcards

Brainscape's Knowledge Genome^TM