ML Flashcards

(99 cards)

1
Q

Unsupervised Learning and how is the model trained

A

Only input data provided and models learn to extract patterns from the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised Learning and how is the model trained

A

Each input paired with an output (target) and model trained to minimise the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Regression

A

Finding the relationship between two variables where the model is linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A

Category labels e.g. dog or bagel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Underfitting

A

A model that is too simple which does not fit the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overfitting

A

A model that fits minor variations or noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model Selection and how does it work

A

Selecting correct model
Split dataset for training and validation (and testing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Training dataset

A

Used to train/optimise the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Validation dataset

A

Used for validating the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Test dataset

A

Used to test model for general fitting quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cross-validation

A

Split data into S groups so (S-1)/S data used for training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

No free lunch theorem

A

All models are wrong, but some are useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Model parameters

A

Values learned from training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parametric model

A

of parameters stay the same as quantity of data increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Non-parametric model

A

of parameters increase/decrease as quantity of data increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Likelihood function

A

Probability of data given model parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Maximum likelihood estimation

A

Method for estimating parameters of a probabilistic model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Linear regression model formula

A

p(y|x,w) = w^T x + e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the distribution of e in the linear regression model and what is the bias parameter in the linear regression model formula and what is it for

A

Gaussian distribution with mean 0: N(0, standard deviation squared)

Within the vector w by addition of dummy variable which always has value 1 and gives extra flexibility to fit the data 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Linear regression to a feature vector

A

p(y|ϕ(x), w) = w^T ϕ(x) + e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Least-squares problem for linear regression formula

A

Theta = (X^T X)^-1 X^T y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Two key points about least-squares solutions

A

The solution has a closed-form
The solution is also the maximum likelihood solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the Linear Discriminant compute (formula) and what class is x assigned to according to y

A

y=w^T x
y>=0 means C1
Otherwise C2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What assumptions are made for parameters to be learnt by applying MLE?

A
  1. Data for each class have a Gaussian distribution
  2. These two Gaussian distributions have the same covariance matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Logistic sigmoid function
sigmoid(a) = 1/(1+e^-a)
25
Logistic regression model for two classes
p(C1|x) = sigmoid(w^T x) p(C2|x) = 1 - p(C1|x)
26
How can the MLE parameters for logistic regression be found?
Using gradient descent
27
Weights
Parameters that transform input data within each neuron
28
Activation function
Determines whether a neuron should be activated and how to transform the input signal into an output signal
29
Input layer and how many neurons
Consists of neurons that receive input data and pass to next layer - # of neurons = # of features in input dataset
30
Hidden Layer
Layer of neurons between input and output which processes input data using weights and activation functions
31
Output layer
Final layer that produces result/prediction
32
Cost/Loss function
Measures difference between predicted and true output
33
Forward pass process
1) Calculate activations of hidden layer, h 2) Pass result of step 1 through a nonlinear function e.g. sigmoid 3) Use step 2 to calculate activations of output layer, o 4) Compute predictions using sigmoid of step 3
34
Backpropagation / backward pass and formula and what do the symbols represent
Algorithm used to compute gradients of loss function with respect to each weight to update weights across multiple layers based on derivative of error wrt the weight formula is δj = h′(aj )(Sum to k of:wkj δk) δj = error signal for jth hidden unit wkj = weight connecting hidden unit j to output unit k h'(aj) = derivative of activation function δ k = error signal at kth output unit
35
Vanishing gradient problem
When gradients used to update the weights during backpropagation become too small which slows down/stops learning process
36
Exploding gradient problem
When gradients grow too large during backpropagation, causing weights to become large and degrading model performance
37
Gradient clipping and formula and what do the variables mean
Used to cap magnitude of gradient : g' = min(1, c/||g|| ) g c = constant value to limit size of gradient
38
Residual network and formula
Each layer has the form of a residual layer which is like a shortcut for gradients to flow directly from output to previous layer defined to be: F1' = F1(x)+x where F1(x) is the standard mapping (linear transformation and then actication)
38
Non-saturating activation function
Activation functions that don't compress inputs into a limited range, avoiding vanishing gradients
39
Parameter initialisation significance
Necessary to choose good initial parameters to avoid issues with gradient descent
40
Early stopping
Process of creating a validation set and stopping training once the error on the validation step starts to increase
41
Weight decay formula
Adds a term: (lambda)w^T w to the loss function where user chooses value of lambda>0 to penalise large weights
42
Dropout and pro
When all outgoing connections from a unit with some given probability are turned off during training to ensure that each unit can perform well without depending on other units This can drastically reduce overfitting
43
Classification tree
Decision tree where each internal node represents decision based on a feature and each leaf represents a class label
44
Regression tree
Decision tree where each internal node represents a decision based on a feature and each leaf node represents a predicted value
45
Are trees parametric or non-parametric?
Non-parametric
46
CART (Classification And Regression Trees) algorithm
Learning the tree structure greedy algorithm: 1) Start from root node 2) Run exhaustive search over each possible variable and threshold for a new node: For each variable and threshold: a. Compute average of target variable for each leaf of the proposed node b. Compute error if we stop adding nodes here 3) Choose variable and threshold that minimise the error 4) Add a new node for the chosen variable and threshold 5) Repeat step 2 until there are only n data points associated with each leaf node 6) Prune back the tree to remove branches that do not reduce error by more than a small tolerance value, e Pruning: 1) Start with a tree T0 2) Consider pruning each node in T0 by combining branches to obtain tree T 3) Compute C(T) = (Sum from T=1 to |T|) et(T) + lambda |T| where - |T| is # of leaves - et(T) is error associated with t'th leaf of tree - lambda is a penalty added for the # of leaves in the tree 4) If C(T) <= C(T0) keep the pruned tree, otherwise reinstate the pruned node
47
Kernel function and example
Used to compute dot product between two data points e.g. For two vectors, x and x', possible kernel function K(x,x') = ϕ(x).ϕ(x') where ϕ is mapping to a higher-dimensional space
48
Kernel trick
We just need the kernel function to make a prediction so evaluate the kernel function values e.g. k(x1, x) without first computing ϕ(x1) and ϕ(x) and then compute their scalar product
49
Dual parameter
Indicate how much each training sample contributes to the decision boundary
50
Gram matrix
Defined to be K = ϕϕ^T so Knm = ϕ(xn)^T ϕ(xm) which is the similarity between the nth and mth datapoint
51
Margin
Distance from decision boundary (hyperplane) to closest training datapoint
52
Support vectors
Data points that lie closest to decision boundary (margin)
53
Soft margin
Allows points to lie on wrong side of hyperplane in exchange for a penalty which is controlled by a regularisation parameter
54
Slack variables
Measures distance of misclassified points from decision boundary
55
How can SVMs be extended to more than two classes? (for k classes)
One-versus-one = train k(k-1)/2 SVM classifiers to distinguish between each pair of classes and take a vote for the predicted class One-versus-the-rest = train k SVM classifiers to distinguish between each class and all other classes
56
Bayesian network
Graphical model that represents a set of random variables and their conditional dependencies via a DAG
57
What is needed to construct a Bayesian network to represent a given joint distribution?
1) The DAG (structure of the BN) 2) Conditional probability distributions p(xk|pak) (parameters of the BN)
58
Collider
A node is a collider on a path if both arrows point to it on the path
59
Blocked path
A path is blocked by S if at least one of the following: 1) There is a collider that is not in S and none of its desendants are in S 2) There is a non-collider on the path that is in S
60
How can you represent the joint probability distribution for a Bayesian network?
If the network has nodes X1,X2,...,Xn the joint probability distribution is: P(X1,X2,..,Xn)=∏ i=1to n (P(Xi)|Parents(Xi))
61
What is plate notation
A plate is drawn around a subset of variables, and the number of repetitions is indicated on the plate.
62
How would you translate a ML model to a BN?
To translate a machine learning model into a Bayesian network, you identify the random variables in the model and their dependencies.
63
What does d-separated mean? What is its significance?
If all paths from node x to node y are blocked given nodes S then x and y are d-separated by S. This means x and y are conditionally independent given S in the DAG.
64
Prior distribution
Represents our beliefs about the parameters of a model before observing any data.
65
Posterior distribution
Combines the prior distribution and the likelihood to provide an updated belief about the parameters after observing the data.
66
What do we know about parameters and data in the Bayesian approach?
Parameters (unknown quantities we try to estimate) and latent variables (unobserved variables that influence the model) are treated as random variables Data (observed variables) are also considered random but are known quantities
67
Ancestral sampling and example for: P(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|C)p(E|B,C)
Generates samples from a joint probability distribution from parents to children e.g. We first sample values for A and B, suppose we get A=0, B=1. Then we sample from C from the conditional distribution P(C|A=0, B=1) and so on
68
Rejection sampling example for P(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|C)p(E|B,C)
If we wanted to sample from P(B,D|E=1) we would sample from the marginal distribution P(B,D,E) and throw away those samples where E!=1
69
Markov chain
Series of random variables z1,...,zm such that the following holds for mE{1,...,M-1}: p(zm+1|z1,...,zm) = p(zm+1|zm)
70
Homogeneous Markov chain
If p(zm+1|zm) is the same for all m
71
Initial distribution in Markov chain
p(z1)
72
Transition distribution in Markov chain
It captures how likely it is to transition from the current state to any possible next state.
73
Markov chain Monte Carlo and significance of samples
Sample a sequence of distributions which eventually get close to desired distribution: 1) Draw sample from each distribution in the sequence 2) Only keep samples when we get close enough to desired distribution *These samples are not independent
74
Target probability distribution
Used to construct a Markov chain For Bayesian ML this will be P(theta|D=d), the posterior distribution of the model parameters given the observed data
75
Metropolis-Hastings Algorithm
Define single transition probability distribution for a homogeneous Markov chain. Let current state be z^(T). Generate a value z* by sampling from a proposal distribution q(z|z^(T)) Accept z* as the new state with certain acceptance probability in which case z^(T+1)=z*. If we don't accept z* then stay where you are so z^(T+1)=z^(T)
75
Metropolis algorithm
Special case of the Metropolis-Hastings algorithm where the proposal distribution is symmetric (i.e., the probability of proposing a move to a state is the same as proposing a move back)
76
Proposal distribution
Distribution from which samples are drawn to general potential states in Markov chain
77
Acceptance probability for MH and M algorithm
MH: α(x,y) = min(1, π(y)q(y|x)/π(x)q(x|y) ) where π is target distribution and q is proposal distribution M: α(x,y) = min(1, π(y)/π(x))
78
Burn-in and what for
Initial period of MCMC simulation during which samples are discarded because early samples might not represent the target distribution well due to influence of initial state
79
Convergence in MCMC
Process by which distribution of samples generated by Markov chain approaches the target distribution
80
Why do we run several chains during MCMC?
Helps to diagnose convergence and explore parameter space
81
What can a sample from a distribution be used for
Can be used to approximate an expected value defined by that distribution
82
What is R hat used for?
Value computed from an MCMC run used to check for convergence. If the run is successful ie. there's been a convergence, it will be close to 1
83
Clustering
Unsupervised learning technique to group similar data points into groups
84
Soft clustering
Assigns data points to clusters probabilistically by applying MLE to a Gaussian mixture model, rather than strictly assigning each point to single cluster
85
Gaussian mixture model
Probabilistic model that assumes data is generated from a mixture of several Gaussian distributions (type of soft clustering model)
86
Mixing coefficient
Represents the proportion of the kth component in the overall distribution as a probability between 0 and 1
87
Responsibility in mixture model
The probability that a specific data point was generated by a particular cluster in the model
88
K-means algorithm
- Randomly initialise K cluster centroids - Assign each data point to the closest cluster centroid based on Euclidean distance - Recompute the centroids by calculating the mean of all data points assigned to each cluster - Repeat the assignment and update steps until convergence ie. when the cluster assignments no longer change significantly
89
EM algorithm purpose and steps
MLE for Gaussian mixtures: Intialise by choosing starting variables for u (mean), covariance, pi Then compute iterations of: E step: Compute values for the responsibilities given the current parameter values M step: Re-estimate the parameters using the current responsibilities
90
Hinderance of EM algorithm
There is no guarantee that it will succeed in maximising the log-likelihood. It may converge to a local maximum which is not a global maximum of the log-likelihood function.
91
3 EM equations
(log likelihood of observed data X given parameters theta) ln p(X∣θ)=L(q,θ)+KL(q∣∣p) (Lower bound on log-likelihood) L(q,θ)= (sum to Z) q(Z) ln(p(X,Z∣θ)/q(Z)) q(Z) = distribution over latent variables Z P(X, Z|θ) = joint probability of observed data X and latent variables Z given parameters (Kullback-Leibler divergence between probability distributions p1 and p2) KL(q||p)= -(sum to Z) q(Z) ln(p(Z∣X,θ)/q(Z))
92
Key fact about Kullback-Leibler
KL(q||p) ≥ 0 for any choice of q, so L(q, θ) ≤ ln p(X|θ).
93
What is increased in E-step?
We increase L(q, θ) by updating q (and leaving θ fixed)
94
What is increased in M step?
We increase L(q, θ) by updating θ (and leaving q fixed)
95
Why does the EM algorithm work?
After the E-step we have L(q, θ) = ln p(X|θ) (and so KL(q||p) = 0), so that in the following M-step increasing L(q, θ) will also increase ln p(X|θ).
96
What is the ReLU function and where is it not differentiable (what do we do here instead)?
Straight line with slope 1 for positive values of 𝑥 and a horizontal line at 0 for negative values of 𝑥. It is not differentiable at 0 but we just impose a gradient of 0.