Data Mining - Formulas Flashcards

1
Q

How do you compute accuracy in classification?

A

True Positive + True Negative / n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you compute total cost?

A

(TP * cost TP) + (FP * cost FP) + (TN * cost TN) + (FN * cost FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Kappa statistics in multiclass prediction?

A

You can use it if you have an actual predictor confusion matrix and a random predictor confusion matrix.
-> It measures the improvement compared to the random predictor

(success rate actual predict - success rate random predictor) / (1 - succes rate random predictor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Recall?

A

The ability of the model to find all of the items of the class.

(True Positive) / (True Positive + False Negative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Precision?

A

The ability of the model to correctly detect class items.

True Positive) / (True Positive + False Positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the F-Measure?

A

Takes into account both Recall as Precision

F = (2 / (1/R) + (1/P)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate P (Play tennis = yes | outlook = sunny)?

A

You check how many days were sunny that you could play tennis.

You divide that by the total amount of days that you played tennis

So you compute a conditional probability P(yes and sunny)/(yes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we use Laplace Smoothing?

A

For every probability, you add a 1 to the numerator and the number of possible classes to the denominator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you calculate P(Play Tennis = yes)

A

You look at the amount of outcomes that are yes and you divide that by the total amount of outcomes.

This extends to other classes of course.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you calculate the P that a record is classified in a class given 3 variables?

A

P(class) * (Pclass given x1) * (Pclass given x2) * (Pclass given x3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Euclidean distance and how does it work?

A

Most popular distance measure for numerical values.

dij = √ ((Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 +…+ (Xip -Xjp)^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Manhattan distance?

A

A more robust distance than the Euclidean distance. It looks at the absolute differences instead of the squared differences.

dij = |Xi1 - Xj1| + |Xi2 - Xj2| +…+ |Xip - Xjp|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you normalize values?

A

(Value - average) / standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you calculate the OLS?

A

You calculate the error for every Yi that you have.

Yi is the actual observation for X.

The error for Yi is calculated by: Yi - Yhat

Yhat is the sample value, i.e. the model’s estimation for X.

OLS = SUM (Yi - Yhat)^2. You pick the model that has the smallest OLS.

Remember to compute and square each error béfore you add them up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you compute the GINI index?

A

If you have a split, you have two (or more) classes.

For each class, you divide the #records in that class by the #records of that node level. This way you have the proportion per class.

You square those proportions and you subtract them both from 1. That is your GINI.

Example: Class 1 has 2 and Class 2 has 4. Total of node level = 6.

1 - (2/6)^2 - (4/6)^2 = GINI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Entropy measure?

A

Similar to GINI, but a different computation.

  • (proportie node 1)log2 (proportie node 1) - (proportie node 2)log2(proportie node 2) etc.
17
Q

What is the combined impurity?

A

You calculate the GINI index for both nodes in which a layer above is split.

You then perform a weighted average to get the combined GINI:

((#records Node 1 /((#records node 1 + 2) * GINI Node 1) )+ ((# Records node 2 / ((#Records node 1 + 2) * GINI Node 2)

18
Q

How do you calculate support?

A

It is the frequency that an itemset occurs in a dataset.

Can be divided by the total number of records to get a percentage.

19
Q

How do you calculate confidence?

A

Frequency A and C happen / Frequency A happens

20
Q

How do you calculate lift?

A

Confidence / (frequency C / n)

21
Q

How do you calculate distance for numerical values in clustering?

A
  • Euclidean

- Manhatten

22
Q

How do you calculate distance for categorical values in clustering?

A

You compare how many of the attributes are different between two variable arrays.

This number is then divide by the length of the array.

x𝐴
: (‘young’, ‘myope’, ‘no’, ‘reduced’, ‘none’)
x𝐵
: (‘young’, ‘hypermetrope’, ‘no’, ‘reduced’, ‘none’)

→ d(A,B) = 1 / 5