Lecture 12 - Decision Tree Induction Part 2 Flashcards

1
Q

What is the pseudocode for decision tree induction?

A

FUNCTION buildDecTree(examples,atts)

Create node N if necessary; //starting as a node, ending as a tree

IF examples are all in same class THEN RETURN N labelled with that class;

IF atts is empty THEN RETURN N labelled with modal example class;

bestAtt = chooseBestAtt(examples,atts);

label N with bestAtt;

FOR each value a i of bestAtt //each branch from node N

s i = subset of examples with bestAtt = a i;

IF s i is not empty THEN

newAtts = atts – bestAtt;

subtree = buildDecTree(s i,newAtts); //recursive

attach subtree as child of N;

ELSE

Create leaf node L;

Label L with modal example class;

attach L as child of N;

RETURN N;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is the best attribute usually chosen for decision tree induction?

A

Information Gain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the pseudocode for decision induction

A

Take the training set

Work out information gain for each attribute against the target attribute

Highest info gain is the best attribute to sort by

Examine the subtrees from separating by that attribute

If all the values of the target attribute are the same in the subtree, that subtree is replaced by a leaf with that value

Otherwise run the whole procedure again, unless there are no more attributes to sort by in which case choose the most frequently occuring value of the target attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are 3 issues with decision tree induction?

A

1) Inconsistent data
2) Numeric Attributes
3) Overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is inconsistent data a problem in decision tree induction?

How can it be solved?

A

May often have no more attributes available to generate subtree with

Easiest method to solve this is use modal value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are numeric attribute values a problem in decision tree induction?

How can the issue be alleviated?

A

With numeric attribute values there is a very large number of possible values - making massive trees

Easy solution is to divide into ranges, e.g. 1-5, 6-10, 11-15 so reducing the number of possible values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is overfitting?

A

When a model starts to model random noise on top of the real data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is overfitting specifically in relation to decision induction, and why does it occur?

How can it be alleviated?

A

When the decision tree models the sample set rather than the whole population and so takes into account peculiarities of the training set which might not be true of the whole population.

Can be alleviated by:

Pre pruning

Post Pruning

Increasing number of high quality samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly