Machine Learning Flashcards

(60 cards)

1
Q

Supervised ML

A

deal with samples, provide label, learn from set of labelled samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

classification

A

set of possible labels is finite, discrete and categorical output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

binary classification

A

has two possible labels (+1/-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

multi-class classification

A

has more than two (finitely) many labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

regression

A

set of possible labels is infinite and output is continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

batch learning

A

given training set of labelled samples, work out labels for the samples from a test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

online learning

A

see sample, work out predicted label, check true label carry on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

training/exploration stage

A

analyse training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

exploitation stage

A

apply hypothesis to test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

deduction induction transduction triangle

A
  • data -> hypothesis: induction
  • hypothesis -> unknown: deduction
  • data -> unknown: transduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

IID assumption

A

independent identically distributed

labelled samples (xi, yi) are assumed to be generated independently from the same probability measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

feature

A

attribute / component of the dataset which represents a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

label

A

categorises the sample into a certain class, the thing we’re trying to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

conformal prediction

A

given a training set and test sample, try in turn each potential label for the test sample

for each label we look at how plausible the augmented training set is under IID assumption

use p-value for this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

conformity measure

A

evaluate how conforming/well the new observed test data fits with the existing observed training data, give an equivariant conformity score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

p-value

A

evaluate implausibility of augmented training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

non conformity measure

A

similar to conformity measure but measures how strange(non-conformal) the test data is compared to training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

average false p-value

A

the average of all p-values for all postulated labels in the test set except for the true labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

training accuracy

A

accuracy on training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

generalisation accuracy

A

how well the model is able to accurately predict on the test set after training on the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

overfitting

A

when the model learns all of the details and noise of the training set that it is not able to generalise on the test set, so it negatively impacts performance on the test set.

high training accuracy low generalisation accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

underfitting

A

when the model does not capture enough of the underlying patterns of the training data due to it being too simple so cannot make accurate prediction on test data

low training accuracy low generalisation accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

learning curve

A

plot of the accuracy vs size of training set n
use: understanding CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

RSS

A

residual sum of squares
aim to minimise residuals
residuals is difference between true and predicted label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
feature engineering
including derived features to the training set at will for a more accurate prediction of the model
26
TSS
total sum of squares sum of squared difference of true label minus mean of true labels
27
R^2
1 - RSS / TSS % of variability in the label explained by data low value - not compatible with poor performance on test data due to possible overfitting
28
regularisation
each feature should have as little effect on the outcome as possible to avoid overfitting
29
α regularisation parameter
α = 0: ridge regression coincides with least squares, no penalty applied, size of coefficients doesn't change, keeps model complex -> overfitting α -> ∞: coefficient estimates forced to shrink towards 0, shrinkage penalty applied, RSS less important, model too simple may lead to underfitting
30
Lasso
Least absolute shrinkage and selection operator -L1 penalty instead of L2/euclidean norm -minimises RSS sets many w[j] coefficients to 0 -LASSO performs model selection = sparse model involve only some of the features - if a few of my features important then use this -useful for importance of the interpretability of the model
31
method chaining
concatenation of method calls
32
data normalisation
measuring the dataset on the same scale to ensure compatibility of the model
33
normalisation - least squares
not essential if first feature x[0] measured in metres, w^[0] will be the corresponding Least Squares estimate if instead x[0] is now measures in km, all xi[0] will decrease 1000 fold if you run Least Squares on the new dataset, then w^[0] will increase 1000 fold so no change to predictions
34
normalisation - ridge/lasso
essential due to the presence of penalty terms that are the same for all variables, so predictions will change normalising features prevents larger features from being unfairly penalised by the penalty terms
35
StandardScaler
for each feature mean 0 standard deviation 1 1) shift each feature down by its mean 2) divide each feature by its SD
36
RobustScaler
for each feature median 0 IQR 1 1) shift each feature down by its median 2) divide each feature by its IQR
37
MinMaxScaler
shift each feature so it is a value ranging between 0 and 1
38
data snooping
when test set is used for developing the model. test set leaks into model inaccurate normalisation leads to data snooping affect transformation of data lead to overfitting/underfitting
39
Normalizer
each sample divided by its Euclidean norm
40
parameter selection
split training set (used for model checking) further into validation set where we select the best parameters to evaluate on test set
41
inductive conformity measure
A : Z* x Z -> R A(C~, z) says how well z conforms to C~ no analogue of the equivariance requirement
42
kernel
a function that turns linear problems into non linear one take a feature mapping F . X -> H of X = sample space into H = feature space, equipped with dot product, allows this feature mapping to be turned to K(x, x') = F(x) . F(x')
43
kernel trick
- write the algorithms so that all xs can only appear in dot products - replace the dot products with kernels
43
kernel features
-symmetric K(x, x') = K(x, x') -positive definite ∑i=1∑j=1(aiaj)K(xi, xj) ≥ 0
44
decay factor lambda
used to weigh gaps between the substrings dimension of feature space, value of such coordinate depends on how frequently and compactly the sustring is embedded int the text
44
c
length of subsequences taken into account
45
activation function
np.tanh function nicely mapping the real line R to (-1,1)
46
separating hyperplane
in the p-dimnsional space R^p, a flat affine subspace of dimension p-1 separates two classes
47
linear scoring function
a function that models the hyperplane for some samples in a p-dimensional space. it separates 2 classes if less than 0 = negative if more than 1 = positive
48
margin
the shortest perpendicular distance from each of the training samples to the separating hyperplane
49
maximum margin hyperplane / optimal separating hyperplane
the farthest perpendicular distance to the hyperplane from the training samples.
50
maximum margin classifier
classifying a test sample based on which side of the maximum margin hyperplane it lies in
51
support vectors
vectors in the p-dimensional space R^p that lie closest to the maximum margin hyperplane if they moved slightly the mmh moves as well equidistant distance between support vectors and hyperplane = slab, larger slab = greater confidence lie on the directly on the soft margin or the wrong side
52
soft margin classifier
the hyperplane that almost separates the classes using a soft margin shorter margin but greater robustness and classification soft bc it violates some of the training observations solution to optimisation problem slide 8:46
53
slack variables
allow individual training samples to be on the wrong side of the margin or hyperplane
54
tuning parameter C
determines number and severity of violations to the margin and hyperplane that we tolerate C = ∞ ->no violations tolerated, slack variables must be 0, old C = 0 -> prioritise maximising margin, tolerate all violations
55
Pipeline
glues multiple processing steps into a single scikit-learn estimator fit = train model using training data, through transforming data then fitting svm score = evaluare on test data
56
cross-conformal predictor
≈ as full conformal predictor but p(y) = (all ranks + 1)/n+1 calculate conformity score rank scores rank - 1 repeat for all folds add all ranks add 1 divide by n + 1 for p-value repeat for all postulated labels point prediction = highest p-value label confidence = 1 - lower p-value credibility = highest p-value
57
set predictor
predictor that outputs prediction sets rather than point predictions and takes a significant level as parameter
58
calibration curve
error rate vs significance level