Decision Trees and ID3 Algorithm Flashcards
(37 cards)
Decision Tree Learning
Practical Inductive Inference (Supervised Learning) Method.
Find Boolean function of attributes
Decision trees can be extended to functions with more than two output values
Widely used. Robust to noise
Can handle OR and AND expressions
Completely expressive hypothesis space
Easily interpretable.
Object Attribute Value (OAV) Data
Objects have attributes and attributes have values. For Example: A car has a door, the door can be red etc.
Usually as labelled csv, json or xml
One of the attributes, usually the first is the Object ID
Attributes can be numeric, categorical or Boolean
Many kaggle datasets are OAV data in csv or json format.
Concept
A concept is a subset of all possible OAV tuples of an object. The number of possible tuples is all the attribute combinations. A class label is used to define the concept - which cars are desirable and which are not.
One-Hot Encoding
Scheme where categorical variables are encoded as Boolean sparse vectors. This helps ML algorithms because the encodings are a high-dimension vector rather than a scalar or symbol.
Classification
The process of adding a new attribute, the class label to OAV. We start with a labelled OAV dataset, which has a class label column. Find a model for the class attribute as a function of the values of the other attribute. Using the learned model previously unseen records should be assigned to classes as accurately as possible.
What is a Decision Tree
A predictive model in the form of an upside-down tree. Each node (except leaves) represents an attribute of a given object. Each edge represents a value of the parent attribute. Each leaf is a class label. Boolean attribute values are not usually shown- left means yes and right means no.
The tree itself forms a hypothesis:
Disjunctions (OR) of Conjunctions (ANDs)
Each path from the root to a leaf forms a conjunction of constraints on attributes (aka a rule)
Separate branches are disjunctions
Naive Decision Tree Construction
Given a labelled dataset, we must construct a decision tree that segments the dataset into subsets which are either all labelled Yes or No. Each subset is associated with a rule - a path from the root to a leaf. We must ideally create the shortest possible trees. Shorter trees are more general. Constructing the shortest possible decision tree requires brute force.
For any given dataset there are a combinatorial number of decision trees that are consistent with the training set.
Types of problems decision tree learning is good for:
Instances represented by OAV data
Attributes take on a small number of discrete values (As in Mitchell’s book)
Can be extended to real-valued attributes
Target function has discrete output values
Assume Boolean functions
Can be extended to use multiple output values
Decision Tree Uses
Hypothesis space can include disjunctive expressions. In fact, the hypothesis space is the complete space of finite discrete-valued functions.
Robust to imperfect training data: Classification errors, errors in attribute values, missing attribute values.
Ex. Medical diagnosis
Decision Tree Construction
Given a labelled OAV training set how does one create a decision tree for the class label. If the training set is small this can probably be done by visual inspection.
Decision trees are usually not unique. Given a labelled OAV training set there may be many combinations that fit with the training examples.
Which Decision Trees do we choose?
The set of all decision trees that are consistent with (fit) a labelled OAV training set is called the Hypothesis Space. Occam’s razor says we should choose the simplest one. It has been shown that the shortest decision tree is the best (more general).
How do we find the shortest decision tree?
Not by inspection unless very small. Use the ID3 Algorithm to do this (inductive dichotomizer 3). It finds a short tree but not necessarily the shortest as that requires exhaustive brute force search.
ID3 Algorithm Definition
Top down, greedy search through space of possible decision trees. Decision trees represent hypotheses, so this is a search through hypothesis space. As you proceed down, choose attributes for each successive node. No backtracking. Goes from root to leaves.
What is Greedy Search?
At each step, make decision which makes greatest improvement in what you are trying to optimize. Don’t backtrack unless you hit a dead end. Not likely to find global optimum, but generally works well.
At each node, make a decision on which attribute best classifies training data at that point. Do this for each branch of the tree, end result will be tree structure representing a hypothesis which works best for the training data.
ID3 Algorithm (How it works)
Finds a shallow tree consistent with the training data. Uses two important concepts:
Entropy
Information Gain
Both used in order to determine which attribute is the best class discriminator. Choosing the best discriminator makes the tree shorter, which generalizes better.
ID3 has better time complexity but C4.5 is better at finding a shorter tree.
Entropy determines which attribute best classifies data.
Information Gain
Statistical quantity measuring how well an attribute classifies the data. Calculate the information gain for each attribute. Choose attribute with greatest information gain.
Entropy
Entropy measures information content of a random process:
Takes on largest value when events are equi-probable.
Takes on smallest value when only one event has a non-zero probability.
Computing Entropy
H(S) = -p(+) log2(p(+)) - p(-)log2(p(-))
If you want to find the entropy of a class of students, there are 50 students, 30 girls and 20 boys. Let girls be positive and boys be negative. Substitute them and with their corresponding values in the equation.
Computing Information Gain
Gain(S, Ai) = H(S) - 𝝨P(Ai=v)H(Sv)
The attribute with the highest information gain is usually chosen as the splitting attribute at each node in the decision tree. The goal is to maximize information gain at each decision node, thereby incrementally reducing uncertainty or entropy and making the data subsets at the leaves of the tree as pure as possible.
ID3 for Boolean Valued Class Labels
Calculate the entropy once for full training set with positive and negative subsets. Determine which single attribute best classifies the training examples using information gain. For each attribute find information gain. Use attribute with greatest information gain as root. Function p() is the proportion.
ID3 Choosing the Root Node
Loop over all attributes and compute the entropy for each attribute. Choose the node with the greatest information gain.
ID3 Restrictions
Attributes are excluded from consideration if they appear higher in the tree.
Process continues in each new leaf node until every attribute has already been included along the path through the tree or training examples associated with this leaf all have the same target attribute value.
ID3 Considerations
If there are contradictions in the data, we always take the majority vote. This handles noisy data. Attributes are eliminated when they are assigned to a node and never reconsidered on the same Root-to-leaf path.