Lecture 5 Flashcards
What is supervised learning in machine learning?
Given a set of inputs and outputs, we build a model to predict the output for a new input.
What are we interested in with unsupervised learning?
Only providing inputs to find patterns in the data.
There is not a known correct answer.
An unsupervised algorithm also outputs a model
- Clustering - grouping together data points that are similar to each other
- Detecting anomalies - finding data points that do not appear to be from the same probability distribution eg fraud detection
- Association - finding relationships between variables
What is a key use of unsupervised learning?
To infer probability distributions
What do we assume about the data in unsupervised learning?
What do we do for parametric methods?
We assume this data was drawn from a probability density p(x).
For parametric methods, we choose a functional form for the distribution eg p(x) proportional to exp(-(x-u)T sum-1 (x-u))
We want to find the parameters (eg maximum likelihood). Finding which parameters are most likely to give you the data you see.
What are the disadvantages of parametric methods?
Not very flexible.
Requires us to make some assumptions about the data.
What parametric method might you choose if you don’t know much about the data?
The central limit theorem
- Assumes a large amount of data that behaves normally
- It is a very limited functional form
- There are a small number of parameters (mean and covariance) which reduces flexibility
In comparison to parametric methods, what do unsupervised learning give us access to?
Unsupervised learning gives us access to non-parametric methods that do not make assumptions.
NB: non-parametric may be misleading as there are still parameters
Describe an example of unsupervised learning giving access to non-parametric methods that do not make assumptions?
K-nearest neighbours (KNN).
We have N points and a parameter K set in advance. To find p(x) we sit at a point x and draw (hyper)spheres around this point until we find K points.
If V is the volume of the final hypersphere, p(x) ~= K / NV
Rather than a rigid functional form, the based probability distribution is based on how much data you have. The probability distribution is highest when the points are most dense.
This is an estimate of the probability distribution that is data driven.
What function does a supervised method often rely on?
Basis functions - we represent data by taking functions of the data.
We pick a set of functions to represent the data eg a + bx + cx^2 for linear regression.
The data we put into the neural network might be a function, not a set of numbers eg a histogram p(x,y). What is the problem with this?
There is a lot of redundant information. If we used the values of a histogram on a grid, we would have lots of information we don’t need (many of the numbers are zero).
We want to design better features to go into the model - ie representing this as a set of numbers.
What could we use as features instead of using the values of histogram on a grid?
The projections of our histogram on some basis functions, rather than looking at everything.
How do we pick the set of basis functions?
- Sometimes we can choose them by physical intuition
- But often we don’t know which best represent the data
Discussion of K-means clustering, with a predetermined K. We randomly assign each data point into one of the sets. Calculate the mean of each set and then re-assign data points to the set whose means they are closest to. We repeat until there are no changes.
What are two examples of how deep learning can be used in an unsupervised way?
- Autoencoders
- Generative adversarial networks (GANs)
Briefly, what do auto encoders do?
These can encode data by projecting it onto a lower-dimensional representation. This projection is learned by training a network whose outputs are as close as possible to its inputs.
Non-linear combinations of data of lower dimensionality.
Projecting information through non-linearity to get fewer dimensions.
Briefly, what are GANs?
Generative adversarial networks - these are used to train generative networks that can produce data points from an appropriate distribution. Two networks compete with each other, one to generate possible examples and one to detect false examples. (They compete in a zero sum game).
What is the difference between GANs and CNNs?
GANs are generative models that can generate new examples from a given training set, while convolutional neural networks (CNN) are primarily used for classification and recognition tasks.
What is the need for dimensionality reduction?
Often, we use more features to represent our data points than we need to (redundancies and dependencies of features). Dimensionality reduction aims to eliminate this redundancy, going from a high dimensional feature space to a lower-dimensional representation.
Briefly, discuss auto encoders and dimensionality reduction.
Autoencoders use deep learning to very efficiently reduce the dimensionality of a problem. (In a non-linear way, making it powerful).
What is the network architecture of an auto encoder?
The network architecture is deep, but becomes narrow in the middle. Effectively it is like a deep neural network with a bottle neck region, where are fewer nodes than inputs (and outputs).
The central layer is the code - this is the bottleneck region, here the number of features is smaller than the input.
There are two steps in forward propagation: encoding and decoding.
What are the two steps in forward propagation in an auto encoder?
- Encoding
- Decoding
Briefly describe the encoding part of forward propagation in an auto encoder.
Features propagate through the network until you reach the code region, where the number of nodes is smaller than the number of inputs.
When we apply the encoding step, we are compressing the data, however we are losing information - this is lossy compression.
Briefly describe the decoding part of forward propagation in an auto encoder.
After the input data has been encoded into a lower-dimensional representation, it is decoded it back to the original input space.
What is the target output of an auto encoder?
We want to get back something very similar to what we put in.
It will not be exactly the same as we have information loss due to lower-dimensional space. Recall that a single-layer neural network acts as a universal approximation, but only if we have lots of nodes.
Why are the inputs and outputs of an auto encoder not exactly the same?
Information loss due to lower-dimensional space.