Lecture 2 Flashcards
What are feed-forward networks?
A “feed forward network” in machine learning refers to a type of artificial neural network where data flows in only one direction, from the input layer through any hidden layers, and finally to the output layer, without any information looping back or being reused from previous layers.
What can we say about neural networks with one hidden layer?
If there are enough hidden units, they can approximate any function arbitrarily accurate.
What is the proof for saying a neural network with one hidden layer is a universal approximation?
Hand-waving proof rather than mathematical proof (looked at regression).
We can approximate a curve with lots of step functions.
Assuming we have the data, what do we need to set for a NN model?
- Network architecture (connections etc.)
- Weights and biases
What error function do we use for regression?
MSE
[See flashcards]
How do we get the best set of weights for regression?
Minimising the error function (MSE) with respect to the weights.
Why does the error function for linear regression have an analytical solution?
The MSE is quadratic in the weights.
We can differentiate to find the minimum, but this can’t be solved analytically to find the weights.
What is the relationship between the outputs and inputs in a neural network?
In a neural network, the outputs zi are highly nonlinear functions of the inputs xi.
How do the error functions for linear regression and neural networks compare?
The linear regression error is a simple, quadratic function with a single minima.
Neural networks have many complex minima. Different sets of weights can maximise these different minima. There is no analytical solution for the best weights in a neural network.
As there is no analytical formula for the best weights in a neural network, how do we find them?
Numerically.
How do we find the weights?
Begin with a “guess” for the weights, and change them in steps, decreasing the error each time.
When do we know we have reached a minima of the error function?
When the weights stop decreasing (when we no longer decrease the error).
What is the formula for going from step W(T) to W(T+1)
[See flashcard]
With each step, we want to make a small change to the weights.
When do we have the biggest decrease in error?
When the change in w points in the same direction as the negative of the differential of the error with respect to the weights - the dot product is 1, making it as big as possible.
Change in w proportional to -dE(w)/dw
What is the parameter v?
The learning rate.
This controls how quickly we decrease the error.
We have to tune the learning rate to control the optimisation.
Describe how v affects the model.
If v is too fast and may overshoot the minimum.
If v is too slow, we might never get there.
If we are close to a local minimum, what happens?
We move towards this local minimum, rather than the global minimum.
How do we find the global minimum?
1 - train several networks with random weights. We will find multiple minima which we can compare.
2 - use a minimisation method that may let us escape local minima (eg by “hopping over”) local minima
3 - using intuition to make an informed decision
Why may it be useful not to use the entire data set to change the weights (batch learning)?
If there is redundancy in the data (repetitions), we are doing more work than we need to.
How could we update the weights instead of batch learning?
We could randomly choose a training point i.
Each step now minimises the error for one training point, rather than the total error.
We choose points at random (a stochastic method) - we can get something unexpected from randomness which can allow us to get away from local minima.
The error for a single point decreases, but the total error may sometimes increase which allows us escape from local minima.
Rather than batch learning or sequential learning, what do we use in practice?
Explain this.
Multi-batch learning - this is somewhere between batch and sequential learning. Has the benefits of both (sequential learning - stochastic/randomness of training points, batch learning - using more information).
Each time weights are updated, choose a subset of the training set (mini-batch) randomly. We then repeat this with different batches.
This has the benefit that we only need to calculate the derivatives of the error with respect to the weights, and don’t need to solve for w. The derivates are costly to calculate- we use back-propagation to avoid this cost.
Why can we never be sure we have reached a global minima?
In theory there are infinite numbers of minima.
What is back propagation?
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural networks, particularly feed-forward networks. It works iteratively, minimising the cost function by adjusting weights and biases.
“Cost function” - measures the difference between the network’s predictions and the actual target values. A measure of how well the network is performing.
What does inputting xi into the network and getting the output zi involve?
The forward propagation of information through the network.
This is how we make predictions with a. neural network.