Lecture 3 Flashcards
1
Q
Machine Learning Basics
A
- Identify some features (by hand or automatic)
- Feed system with training data
o System figures out which features
are useful - Supervised learning: we have labels
o E.g. image from healthy/unhealthy
patient
o For example Multi-class problems and trying to find linear boundaries between
classes - Unsupervised learning: we have no labels
o Aims to find structure in data (e.g. K-means) - Semi-supervised learning: we have partially unlabelled data
o Labelled data is very expensive
o Unlabelled data is easy to obtain
o How can we improve decision rules by means of unlabelled data? - Classification: no continuous target variable
o E.g. is the person in the image male or female
o Aims to label data - Regression: continuous variables
o E.g. how old is the person in the image?
o Aims to learn a function - Discriminative model: generates no new examples
- Generative model: generates new examples
o E.g. ChatGPT
2
Q
Parametric Models
A
- Result of the supervised learning procedure is a function that
predicts the label y from a given input x - Model typically has some parameters W
o Number of parameters defines capacity
o Balance between too few (under-fitting) and too many (over-fitting)
o Compromise between:
▪ Best fit to training data
▪ Best generalisation for future unseen data
o These need to be initialised, but must not be the same value -> all parameters would
be updated the same after calculating the gradient
▪ Best to use random initialisation with normal distribution
▪ Variance of the output of a neuron increases with the number of inputs ->
can normalise this
3
Q
Training Supervised Models
A
- Typically given N training samples
o Each sample is a feature vector x with a label
y
o Goal is to learn a model (function) f(x,W) = y
▪ Needs to be good at predicting the
right label for unseen data - Loss function: used to steer optimisation of parameters
o Should be differentiable
o Often use cross-entropy
o Input: predicted label & true label - Optimiser: uses the loss function and updates parameters to reduce the total loss
o I.e. we want to find a global minimum of the loss function, e.g. using gradient
descent
▪ Batch: use entire training set to calculate gradient steps
▪ Stochastic: use single samples to calculate gradient steps
▪ Mini batch: use subset of training data with more than 1 sample to calculate
gradient steps. A set of n mini-batches is called one epoch - Testing: present a set of test data that is unseen by the model, see how many labels are
predicted correctly - Forward pass: feeding data into model and making a prediction
- Backward pass: using prediction to optimise parameters
4
Q
Data in Supervised Learning
A
- Data is represented as a point cloud in a high-dimensional vector space
- Labelled dataset is split into training data and test data
o Test set to report performance of the system
o Same image must not occur in both sets
o Specifically to medical imaging the same patient may not occur in both sets (even if
the images are different) - Problem: we tune parameters on the test set
o Thus the test set is not independent of system development - Solution: we make a new split: add validation set
o Evaluate on validation set
o Only after finishing training use the test set
5
Q
K-fold Cross Validation:
A
- Divide labelled dataset into k subsets
- Use one subset as validation set
- Use one subset as test set
- Use the remaining sets as training data
- Repeat K times for all possible combinations
- Also here: patient may not be in different sets
- We end up with K trained systems (as each is trained on a
different training set)
o Can combine into ensemble system
o Or use Cross-fold validation to choose parameters and then retrain on the full dataset
6
Q
Neural Networks
A
- Consist of an input layer, a number of hidden layers and an
output layer - In each neuron, each input from the previous layer has its
own weight
o A neuron also has an additional bias term - Typically we use matrix notation
o Input layer becomes x (4x1)
o Hidden layer becomes W (2x4)
o Biases become b (2x1)
o Output becomes Wx + b - Before output is passed to the next layer, we apply a nonlinear activation function
o Must be non-linear so that we can develop complex representations that are not
possible with linear regression models - Output layer: One neuron for each possible label in a n-class classification problem
o Special activation function modelled to question
▪ E.g. multi-class classification uses one-hot vectors as output - Number of parameters: each neuron has a weight for each of its inputs plus one bias term
o More hidden neurons leads to over-fitting
7
Q
Backpropagation
A
- Recursively apply the chain rule to compute gradients of
expressions - In blue: forward pass
o We see that we always perform an
operation on two variables at a time - In purple: backward pass
- Makes use of a learning rate -> how fast we
update parameter values
▪ Usually pick the initial one,
and implement a strategy to decrease it over number of epochs (e.g. linear
decrease)