Unit 2: Classification with Decision Trees Flashcards

1
Q

Classification

Definition

A

Given a collection of records (dataset)
- Each record is characterised by a tuple (x, y), where x is the attribute set and y is the class label.

Task:
- Learn a model that maps each attribute set x into one of the predefined class labels y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Inference

A

The process of using a model in order to infer the class or a given record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Gini impurity

A

A metric describing how pure or mixed the data labels in a node are.

Pure nodes have Gini Impurity close to 0, where impure notes have a Gini impurity value closer to 0.5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Information Gain of a Split

A

Refers to the amount of information gained by choosing one of the possible splits.

Information Gain
EQUALS
Impurity of the Parent
MINUS
Weighted Average Impurity of the Children

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

CART Algorithm

A

Grow (Data, attributes):

if (number of rows |Data| < h):
* Create (a leaf that has the dominant label of the Data)
* Return the leaf with its label.

Else:
* AttributesGain = InfoGain(Attributes)
* attribute = Max(AttributesGain)
* bsNode = Create (attribute)
* For each value of the attribute,
* - Data = Split(attribute, value)
* - Child = Grow (Data, Attributes)
* - bsNode = Add (Child to bsNode)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Entropy Formula

A

-Σᵢ₌₁ⁿ pᵢ log(pᵢ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gini Formula

A

Σᵢ₌₁ⁿ pᵢ (1 - pᵢ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Classification error

A

1 - max(pᵢ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Accuracy of a classifier

A

The proportion of correctly classified instances

(TP + TN) / N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Error rate

A

Proportion of incorrectly classified instances.

(FP + FN) / N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

General Structure of Hunt’s Algorithm

A

Let Dₜ be the set of training records that read a node t

General Procedure:
- If Dₜ contains records that belong to the same class yₜ, then t is a leaf node labeled as y
- If Dₜ contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

Recursively apply the procedure to each subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hunt’s Algorithm

2 Design Issues of Decision Tree Induction

A

How should training records be split?
- Method for specifying the test condition (which depends on attribute types).
- Measure for evaluating the goodness of a test condition.

How should the splitting procedure stop?
- Stop splitting if all the records belong to the same class or have identical attribute values.
- Early termination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

4 Attribute types in a DT test condition

A
  • Binary
  • Nominal
  • Ordinal
  • Continuous
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2 ways of splitting based on continuous attributes

A
  1. Discretization to form an ordinal categorical
    attribute.
    Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
  2. Binary Decision (A < v) or (A ≥ v)
    Consider all possible splits and finds the best cut. This can be more compute intensive.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Decision trees

Finding the best split

A
  1. Compute the impurity measure (P) before splitting
  2. Compute the impurity measure (M) after splitting. M is the weighted impurity of child nodes.
  3. Choose the attribute test condition that produces the highest gain.

Gaim = P - M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

5 Advantages of Decision-Tree based classification

A
  • Relatively inexpensive to construct
  • Extremely fast at classifying unknown records
  • Easy to interpret for small-sized trees
  • Robust to noise (especially when methods to avoid overfitting are employed)
  • Can easily handle redundant or irrelevant attributes (unless the attributes are interacting)
17
Q

2 Disadvantages of Decision Tree-based classification

A
  • Due to the greedy nature of the splitting criterion, interacting attributes may be passed over in favour of other attributes that are less discriminating.
  • Each decision boundary involves only a single attribute.
18
Q

Type I error

A

False Positive

19
Q

Type II error

A

False Negative

20
Q

Holistic Metrics

A

Metrics that look at both horizontal and vertical views and involve both positive and negative classes.

E.g. F1 score & Matthew Correlation Coefficient

21
Q

Precision

A

Refers to the precision of the prediction of the classifier with respect to the positive class.

A.k.a. Positive prediction value

22
Q

Recall

true positive rate / hit rate

A

Measures how good the classifier is at detecting the actual positive cases.

23
Q

positive detection rate

p(detect₊)

A

TP / (TP + FN)

How many of the positives were caught by the model?

24
Q

NOT positive detection rate

¬ p(detect₊)

A

FN / (TP + FN)

How many of the positives were missed by the model?

25
Q

negative detection rate

p(detect₋)

A

TN / (TN + FP)

How many of the negatives were caught by the model?

26
Q

NOT negative detection rate

¬p(detect₋)

A

FP / (TN + FP)

How many of the negatives were missed by the model?

27
Q

positive prediction rate

p(predict₊)

A

TP / (TP + FP)

How many of those predicted to be positive, were actually positive?

28
Q

NOT positive prediction rate

¬p(predict₊)

A

FP / (TP + FP)

How many of those predicted to be positive, were actually negative?

29
Q

negative prediction rate

p(predict₋)

A

TN / (TN + FN)

How many of those predicted to be negative, were actually negartive?

30
Q

NOT negative prediction rate

¬p(predict₋)

A

FN / (TN + FN)

How many of those predicted to be negative, were actually positive?

31
Q

Accuracy for a multi-class problem

A

∑ᵢ{ TCᵢ } / N

TCᵢ is the count of the correctly classified instances of class Cᵢ

32
Q

Holistic metric for multiclass p(detect) or p(predict)

recall score / balanced accuracy score / precision score

A
  1. Take the arithmetic mean of these rates.
  2. Take a weighted average of the detection rates. This allows assigning different weights for the different classes.
33
Q

Holistic metric for multiclass: F1 Score

A

To generalise the F1 score to multi-class problems, we can:
1. Either use overall recall and precision and then apply the harmonic mean, as in the binary class problem.
2. We can treat the classes as one vs. the rest and obtain the F1 score for each class separately. We can then take the average of the F1 score for all these classes.

34
Q

Model generalisation

A

Using the training set to come up with a general relationship between the features and the class that can be utilised to later classify unseen examples.