Lecture 3 Flashcards

Question 1

Q

Machine Learning Basics

Answer

A

Identify some features (by hand or automatic)
Feed system with training data
o System figures out which features
are useful
Supervised learning: we have labels
o E.g. image from healthy/unhealthy
patient
o For example Multi-class problems and trying to find linear boundaries between
classes
Unsupervised learning: we have no labels
o Aims to find structure in data (e.g. K-means)
Semi-supervised learning: we have partially unlabelled data
o Labelled data is very expensive
o Unlabelled data is easy to obtain
o How can we improve decision rules by means of unlabelled data?
Classification: no continuous target variable
o E.g. is the person in the image male or female
o Aims to label data
Regression: continuous variables
o E.g. how old is the person in the image?
o Aims to learn a function
Discriminative model: generates no new examples
Generative model: generates new examples
o E.g. ChatGPT

Question 2

Q

Parametric Models

Answer

A

Result of the supervised learning procedure is a function that
predicts the label y from a given input x
Model typically has some parameters W
o Number of parameters defines capacity
o Balance between too few (under-fitting) and too many (over-fitting)
o Compromise between:
▪ Best fit to training data
▪ Best generalisation for future unseen data
o These need to be initialised, but must not be the same value -> all parameters would
be updated the same after calculating the gradient
▪ Best to use random initialisation with normal distribution
▪ Variance of the output of a neuron increases with the number of inputs ->
can normalise this

Question 3

Q

Training Supervised Models

Answer

A

Typically given N training samples
o Each sample is a feature vector x with a label
y
o Goal is to learn a model (function) f(x,W) = y
▪ Needs to be good at predicting the
right label for unseen data
Loss function: used to steer optimisation of parameters
o Should be differentiable
o Often use cross-entropy
o Input: predicted label & true label
Optimiser: uses the loss function and updates parameters to reduce the total loss
o I.e. we want to find a global minimum of the loss function, e.g. using gradient
descent
▪ Batch: use entire training set to calculate gradient steps
▪ Stochastic: use single samples to calculate gradient steps
▪ Mini batch: use subset of training data with more than 1 sample to calculate
gradient steps. A set of n mini-batches is called one epoch
Testing: present a set of test data that is unseen by the model, see how many labels are
predicted correctly
Forward pass: feeding data into model and making a prediction
Backward pass: using prediction to optimise parameters

Question 4

Q

Data in Supervised Learning

Answer

A

Data is represented as a point cloud in a high-dimensional vector space
Labelled dataset is split into training data and test data
o Test set to report performance of the system
o Same image must not occur in both sets
o Specifically to medical imaging the same patient may not occur in both sets (even if
the images are different)
Problem: we tune parameters on the test set
o Thus the test set is not independent of system development
Solution: we make a new split: add validation set
o Evaluate on validation set
o Only after finishing training use the test set

Question 5

Q

K-fold Cross Validation:

Answer

A

Divide labelled dataset into k subsets
Use one subset as validation set
Use one subset as test set
Use the remaining sets as training data
Repeat K times for all possible combinations
Also here: patient may not be in different sets
We end up with K trained systems (as each is trained on a
different training set)
o Can combine into ensemble system
o Or use Cross-fold validation to choose parameters and then retrain on the full dataset

Question 6

Q

Neural Networks

Answer

A

Consist of an input layer, a number of hidden layers and an
output layer
In each neuron, each input from the previous layer has its
own weight
o A neuron also has an additional bias term
Typically we use matrix notation
o Input layer becomes x (4x1)
o Hidden layer becomes W (2x4)
o Biases become b (2x1)
o Output becomes Wx + b
Before output is passed to the next layer, we apply a nonlinear activation function
o Must be non-linear so that we can develop complex representations that are not
possible with linear regression models
Output layer: One neuron for each possible label in a n-class classification problem
o Special activation function modelled to question
▪ E.g. multi-class classification uses one-hot vectors as output
Number of parameters: each neuron has a weight for each of its inputs plus one bias term
o More hidden neurons leads to over-fitting

Question 7

Q

Backpropagation

Answer

A

Recursively apply the chain rule to compute gradients of
expressions
In blue: forward pass
o We see that we always perform an
operation on two variables at a time
In purple: backward pass
Makes use of a learning rate -> how fast we
update parameter values
▪ Usually pick the initial one,
and implement a strategy to decrease it over number of epochs (e.g. linear
decrease)

(7 cards)