General Flashcards
(89 cards)
Define bias
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)
Define variance
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
Define bias-variance tradeoff
It is the compromise choose a model that both accurately capture the regularities in its training data, but also generalises well to unseen data. High-variance learning methods represent their training set well but overfit to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don’t tend to overfit but may underfit their training data, failing to capture important regularities.
Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.
How to overcome overfitting
- Reduce the model complexity (fewer features)
- Regularization (features contribute less)
What is a vector norm
A way of measuring the length of a vector
Give examples of vector norms
- L1
- L2
Define length of L2 norm ||B||_2
√B_0^2 + B_1^2
Define length of L1 norm ||B||_1
|B_0|+|B_1|
Sketch the ||B||_2 = 2 and ||B||_1 = 2
https://en.wikipedia.org/wiki/File:L1_and_L2_balls.jpg where crosses axes at 2
Describe Ordinary Least Squares
OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function. Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression line
What is iid?
a sequence or other collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum (or average) of IID variables with finite variance approaches a normal distribution.
What is the problem with highly correlated explanatory variables in OLS?
Very high variance between different samples, so feature weights get abnormally big
What is C in Ridge Regression (L2)?
C^2 is the radius of the CIRCLE in LP space,
where you define ||B||^2_2 <= C^2
What is the main difference in outcome between using L1 and L2 space for regularisation?
Given the L1 diamond shape as opposed to the L2 circle, you’re more likely to hit a corner which zeros coefficients.
Which regularisation gives a spares response
L1, as it zeros some coefficients
What is a generative model?
A generative model describes how data is generated, in terms of a probabilistic model.
In the scenario of supervised learning, a generative model estimates the joint probability distribution of data P(X, Y) between the observed data X and corresponding labels Y
Give examples of generative models
- Naive Bayes
- Hidden Markov Models
- Latent Dirichlet Allocation
- Boltzmann Machines
Why would you choose a discriminative model?
Because you didn’t have enough data to estimate the density f, so variance is massive.
Generative
p(x,y) = f(x|y)p(y)
Generative versus discriminative, discuss
Discriminative is probability of class given observation P(C|x), generative is probability of observation given class P(x|C). For generative, given data, you model whole distribution. For discriminative, given data, you model decision boundary. https://www.youtube.com/watch?v=OWJ8xVGRyFA
Pros and cons of discriminative model
Pros: easy & fewer observations
Cons: Can classify but not generate the data/obs back
Pros and cons of generative model
Pros: get the underlying idea of what the classifier is built on
Cons: Very expensive - lots of parameters
Need lots of data
Define SVM
A non-probabilistic binary linear classifier which separates the categories by a clear gap that is as wide as possible with a hyperplane or set of hyperplanes, defined so that the distance between the hyperplane and the nearest point x_i from either group is maximised
How do SVM perform non-linear classification?
the kernel trick
Describe the kernel trick
The idea is that data that isn’t linearly separable in n dimensional space may be linearly separable in a higher dimensional space. But, because of Lagrangian magic, we need not compute the exact transformation of our data, we just need the inner product of our data in that higher dimensional space.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78