Machine Learning Flashcards

1
Q

Task

A

What we want to obtain given a set of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Features

A

The properties of the data that are used as input by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model

A

Gets the data as input and returns an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Learning algorithm

A

The algorithm that generates the model, given a specific set of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Classification

A

Task: give a label C1 of a set of labels Cs, to a given input

Learning task: find a function c*, called classifier, that approximate at best the real classification function c.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evaluation

A

We need to evaluate a model, to know how well it works for the task. There are a lot of measures to use to evaluate a model:

  • accuracy
  • recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decision tree

A

Is a model for CLASSIFICATION TASKS that uses a tree in which the nodes are feature-wise question to answer, the answers label the arcs to follow, and the leaves are the labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Accuracy

A

It’s an evaluation measure that calculates:

Number of correctly labeled examples
/
Number of labeled examples

(TRUE POSITIVES + TRUE NEGATIVES) / (ALL POSITIVES + ALL NEGATIEVS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Recall

A

In a BINARY CLASSIFICATION we define:

  • POSITIVE RECALL = TRUE POSITIVE / ALL POSITIVES
  • NEGATIVE RECALL (SPECIFICITY) = TRUE NEGATIVE / ALL NEGATIVES
  • AVERAGE RECALL = (POSITIVE RECALL + NEGATIVE RECALL) / 2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Contingency table

A

It’s a tool used for BINARY CLASSIFICATION, we put on the columns the number of the predicted classes, and on the rows the number of the true classes.

                  Predicted Class
             |  Positive  |  Negative  |
Actual Class |------------|------------|
  Positive   |     TP     |     FN     |
  Negative   |     FP     |     TN     |
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Coverage plot

A

It’s a graph large N and height M.
N: number of FALSE POSITIVES
M: number of TRUE POSITIVES

Can be used to MODELS PERFORMANCES by finding the representing coordinates for each model in the graph, using the number of the model’s TRUE and FALSE POSITIVES.

Classifiers with the same accuracy are connected by lines of slope 1 (DEMONSTRATION)

Classifiers with the same average recall are connected by lines parallel to the main diagonal (Neg/Pos slope) (DEMONSTRATION)

NEEDS NORMALIZATION!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ROC plot

A

It’s the coverage plot, normalized. So the graph is large 1 and height 1.
Let confront models that have performances calculated on different datasets and numbers.

Classifiers with the same accuray are connected by lines of slope Neg/Pos (DEMONSTRATION)

Classifiers with the same average recall are connected by lines of slope 1 (DEMONSTRATION)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scoring classification

A

This is a task similar to the classification, but the model is a SCORING CLASSIFIER s*, that takes an input and returns a Nth-vector of scores where N is the number of classes.

So it does not tell the class associated to the input, but the SCORE for each class associated to the input.

The TRAINING SET is the same as the one for the CLASSIFICATION.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Margin

A

For a SCORING BINARY CLASSIFIER the margin is a function that takes an input and returns a positive value if the input is classified correcty, negative otherwise.
Can be written as:

margin(x) = c(x)*s(x) =
1. +|s(x)| if s is correct
2. - |s(x)| if s is not correct

where
margin(x): is the margin function
c(x): is the true class of x (+1 for positive, -1 for negative class)
s(x): is the score for x given by the classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Loss function

A

For the purpose of rewarding large positive margins and penalize large negative margins, we define a LOSS FUNCTION.

Loss function is a function L that:

L: R –> [0,+inf)
that maps each example’s margin z(x) to an associated LOSS L(x).

We bound, or assume, L(x) to be:

L(x) > 1 for z(x) < 0
L(x) = 1 for z(x) = 0
L(x) > 0 and L(x) < 1 for z(x) > 0

So the value of L is less than 1 for each correctly classified example, and more than 1 otherwise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

0-1 Loss

A

A loss function

L(x) = 1 for z(x) <= 0
L(x) = 0 for z(x) > 0

Has some problems:
- discrete, not continuous so not entirely derivable
- has really bad derivatives (it is composed of 2 lines of slope 0)
- once the examples from the train set are correctly classified we cannot use the function to improve the model

17
Q

Hinge Loss

A

A Loss function

L(x) = (1 - z(x)) for z(x) < 1
L(x) = 0 for z(x) >= 1

Solves the discontinuity problems of the 0-1 Loss, but still has a discontinuity for z(x)=1

18
Q

Logistic Loss

A

A continuous loss function

L(x) = log2( 1 + exp( -z(x) ) )

Has a similar slope than Hinge loss, but all continuous

19
Q

Exponential Loss

A

A continuous loss function

L(x) = exp( -z(x) )

20
Q

Quadratic Loss

A

A continuous loss function

L(x) = (1 - z)2

this is the only one in the simple loss functions that goes to +inf for z(x) going to +inf, this could be a problem in optimization.

21
Q

Ranking

A

When using a scoring classifier, we can use the score associated to an example to sort the example by its score. This is a RANKING function on the scores.

22
Q

Ranking error

A

When using a RANKING we will put the (hopefully few) FALSE POSITIVES and FALSE NEGATIVES that has been classified with high score ahead of TRUE POSITIVES and TRUE NEGATIVES that have been classified with a lesser score. This is a RANKING ERROR.

The RANKING ERROR RATE can be calculated by taking each possible couple of 1 POS and 1 NEG example, and count how many of these couples are ordered right in the ranking, so the POS is ranked before the NEG:

RANKING ERROR RATE =
SUM( Numberof(s(x) < s(x)) + 0.5 * Numberof(s(x) = s(x))
/
Numberof(POS) * Numberof(NEG)

Where s(x) is the score of a POSITIVE and s(x*) is the score of a NEGATIVE.

Example:
++-++—

we have 1 - in front of 2 +, so it’s 12 = 2 wrong couples. We assume each example having a different score, so we do not have any 0.5 values. We have a total of 4 POS and 4 NEG, so:
RANKING ERROR RATE = 1
2 / 4*4 = 2/16 = 1/8

23
Q

Binary classifiers and Ranking functions

A

If N examples are ordered by their SCORE, we can define N+1 BINARY CLASSIFIERS to separate the ordered list into 2 sets. We can determine the value of those classifiers calculating the RANKING ERROR RATE from the ordered list.

24
Q

Class probabiliy estimation

A

This is a scoring classifier with the peculiarity that the score is the PROBABILITY ESTIMATION for the example to be correctly associated to the class.

p: X –> [0,1]k
where p
is the probability estimation function
X is the example
K is the number of the classes

so p*(x) is a vector of probability estimations for the given example x to be correctly labeled in the k classes.

For a binary classification we omit the probability for the NEGATIVE class, since it’s derivable from the POS probability.

25
Q

Mean squared probability error

A

This is one way to estimate the error in the probability assigned to an example x, knowing the true class probability c(x). It’s valued as the following

MSE = 0.5* ||p*(x) - Ic(x)||2,2 =
= 0.5 * sum ( pi(x) - Ic(xi) )2

where:
- p*(x) is the probability vector for the classes for the example x
- Ic(x) is the vector that has 1 for the right class position, and 0 otherwise
- the ||A||2,2 operation is the euclidean length for vector A, calculated as squared(squareRoot( sumOnEach_i( Ai * Ai) ))

26
Q

Empirical probabilities

A

We can use empirical probabilities to define a reference for a class probability estimation. We can define the EMPIRICAL PROBABILITY VECTOR as follows

p^(S) = (n1/|S|, n2/|S|, …., nk/|S|)

where
S is the set of labeled examples
ni is the number of the examples labeled on the class Ci

27
Q

Laplace correction

A

It’s the most common way to smooth the relative frequencies of classes for the caluclation of EMPIRICAL PROBABILITIES.
Can occour uniformly by adding 1 to each one of the k class occurrencies, so the empirical probability becomes:

p^i(S) = (ni + 1) / |S| + k

Can also be done with specific weights for each class,:

p^i(S) = (ni + wi*m) / |S| + m

where
wi is the a priori probability of the class i
m > k is any number
k is the number of classes
sum(wi) is 1

28
Q

Binary classifier Multiclass handling

A

A binary classifier can be used to generate a multiclass labeling. We have various approaches:

  • one vs rest
    • unordered learning
    • fixed-order learning
  • one vs one
    • symmetric
    • asymmentric
29
Q

1 vs rest - unordered

A

The classifier is made of n classifiers where n is the number of classes. Each classifier will be a binary classifier that returns 1 if the class is found, -1 if it thinks its another any class.

We can define an OUTPUT CODE MATRIX for this classifier (with n=3):

+1 -1 -1
-1 +1 -1
-1 -1 +1

where the columns are the results of each classifier on classes C1,C2,C3
and the rows are the classes C1,C2,C3

30
Q

1 vs rest - fixed order

A

The classifier is made of n-1 classifiers, where n is the number of classes. Each classifier is a binary classifier, but we use the classifiers with a fixed order ASSUMING that each non classified example at k-th classifier cannot be of the k classes checked by the previous classifiers.

So for example:
3 classes.
Checks are:

  1. C1 vs C2,C3 -> is C1 or go to step 2.
  2. C2 vs C3 -> C2 or C3

For this example the OUTPUT CODE MATRIX will be:

+1 0
-1 +1
-1 -1

NB:
We see that there is a THIRD VALUE, the 0 returned by the binary classifier, it’s not really a value returned since we do not accept that as a prediction.

31
Q

1 vs 1 - simmetric

A

The classifier is made of combination of n classifiers where n is the number of classes. So one classifier for each possible couple of classes.
Each classifier is a binary classifier, each classifier determines if the example is inside one of two classes.

32
Q

1 vs 1 - asimmetric

A

The classifier is made of combination of two classifier for each possible couple of classes.
Each classifier is a binary classifier, each classifier determines if the example is inside one of two classes.

33
Q

Binary classifier Multiclass Decoding

A

When the classifier is trained and we have the OUTPUT CODE MATRIX available, we can use it to define the classification for a new example given to the model.

Given n classes, we receive a WORD wn, a vector of dimension n. We can choose the class for the given example as the one that MINIMIZES the distance between wn and the corresponding row on the OUTPUT CODE MATRIX.

Distance on Cx = D[Cx] = SUMi / 2
where ci are the values of the selected row, and wi are the values of the vector w.

For example:
+1 -1 -1
-1 +1 -1
-1 -1 +1

w = (-1, -1, +1)
D[C1] = 1-(-1) + 1-1 + 1-(-1) = 2
D[C2] = 1-(1) + 1-(-1) + 1-(1) = 2
D[C3] = 1-(1) + 1-(1) + 1-(1) = 0

So the row with less Distance is 3, so the data has to be labeled as C3

34
Q

Binary classifier Multiclass Decoding With Same Distance

A

If in the decoding part, with the task of classifing new data, i get more rows with the same distance i can choose:

  1. decide randomly between the rows with same minimum distance
  2. lessen the possibility of same score by adding new classifier by merging more type of multiclass binary classifiers (ex. i add asimetric columns)
  3. instead of classifiers, that for definition returns 1 if the right class and -1 (or 0) if not, i can use a scoring classifier or a probabilistic classifier. This way the distance is less probable to be the same.
35
Q

Regression

A

This is a task in which the input is a data in the example set, and the output is a number in R.

We do not want to understand a class, but we want to find a point on a function.

So given the train set (xi, f(xi)) we want to define a function f* that is the more similar to f(xi), given f*(xi)

ATTENTION
To solve the task during the training the easiest way is to define f* as a polynome with n-1 parameters, where n is the number of example in the train set. This way f* and f would match perfectly, but this would be OVERFITTING, since f* would be exessively defined on the data from the example.
So rule of thumb is to have a smaller number of parameters than the example set dimension (this does not apply on neural networks)

36
Q

Bias-Variance Dilemma

A

If we underestimate the number of parameters we won’t be able to decrease the LOSS FUNCTION no matter how many examples we use in training.

If we overestimate the parameters the model will be more dependent on the training set, being less flexible to new cases.

37
Q
A