Decision Trees Flashcards

1
Q

Definition

A

A decision tree is made of a series of splitting rules that divide the dataset into nodes. It does this without making any concrete assumptions about how the target and the predictors relate.

Regression and Classification problems

Recursive Binary Splitting is the algorithm used to determine the splits. As each split is defined by one predictor, a tree partitions the predictor space.

It is a greedy algorithm that will, at each node, attempt to divide the predictor space into two smaller nodes. At each node, the best split is chosen with no consideration of what the future splits might be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tree Structure

A

Each tree visual starts at the top with the root node, representing all of the training observations. A predictor is used to divide/split the observations into two groups/nodes. At the bottom of the tree (where the observations have stopped being divided) are the terminal nodes or leaves.

The terminal nodes determine the tree predictions, such that all observations in the same terminal node have the same prediction.

Regression tree: predicted value = sample mean
–top # is predicted value; bottom # is proportion of observations

Classification tree: predicted value = majority class
–top # is predicted value (0 or 1); bottom # is prob = 0 and prob = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Tree Flexibility

A

Number of terminal nodes is a tree’s flexibility measure.

Pruning controls flexibility, where typically a tree is first allowed to have many terminal nodes, which is then reduced by pruning.
–By starting with an overly complex tree and then pruning back we retain more valuable splits that occur after less valuable splits. However, if we built a new tree with the same complexity parameter, those less valuable splits would not occur in the first place, and thus the more valuable splits would never be discovered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stopping Criteria

A

To prevent a tree from overfitting (i.e. having too many terminal nodes), there are stopping criteria for when a node should be stopped from splitting. A node will not split if it fails to be within any of these hyperparameters:
–min # observations in a node (increase)
–min # observations permitted for a terminal node (increase)
–max depth of terminal nodes (decrease)
–min reduction of the SSE which is set by cp (increase)

All of these impact the tree’s flexibility.
All can be cross-validated to determine their optimal values, but pruning focuses on tuning the cp.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pruning

A

Cost-complexity pruning helps to identify the optimal tree size by removing splits that do not provide sufficient improvement. It is achieved by CV. Pruning is a technique used to reduce the complexity of a decision tree and protect against overfitting. Each split is evaluated to determine whether it is necessary for optimal model performance. If removing the split (i.e., turning it into a terminal node and erasing the subtree below the split) sufficiently decreases the model’s accuracy, then that split gets removed (“pruned”). This process is repeated for each remaining split until further pruning would result in decreased model accuracy (or insufficient gains).

We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.
–Each time the cost-complexity pruning algorithm is run, the splits of data used in the cross validation are randomly assigned. This makes the calculation of the optimal complexity parameter dependent on the seed selected. Therefore, changing the seed can result in different pruned trees.

cv searches for a higher cp that produces a lower cv error.

While all four of the listed stopping criteria can be cross-validated to determine their optimal values, pruning focuses on tuning the complexity parameter (cp). The cp is a value between 0 and 1, where pruning searches for a higher cp than the one initially set that produces a lower CV error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model Performance

A

Test set RMSE or another performance metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Other Concepts

A

The number of splits is always one less than the number of terminal nodes.

Factors do not need to be transformed into dummy variables.

Competitor splits are the best alternative splits from predictors not as the best split.

Surrogate splits are alternative splits that mimic the division of the best split for observations missing its value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Vs Ensemble Methods

A

Both methods can be used to build regression or classification models, determining splits based on impurity or information gain measures.
–A single decision tree partitions the data into a single set of non-overlapping regions. Ensemble methods (such as random forests and boosted trees) fit multiple decision trees and combine the predictions from the trees fit to determine the model’s prediction.
–Singular trees are prone to high variance due to overfitting, while ensemble methods reduce overfitting.

-To effectively use a continuous variable, trees need to repeatedly split to capture the effect, whereas GLM requires only a single coefficient to be estimated. If trees have limited depth, a predictor like this may not be used effectively. If using a boosted tree, the algorithm may uncover that type of variable importance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Vs Parametric models

A

Advantages:
-Do not require assumptions about the data
-Easy to interpret and explain
-Easy to present visually
-Easy to construct, i.e. factors do not need dummy variables
-Interactions captured automatically
-Insignificant predictors are removed automatically
-Can model non-linear relationships between variables without having to specify them in advance

Disadvantages:
-High risk of overfitting
-Not robust; highly sensitive to the training set
-Greedy -> unlikely to produce the globally optimal model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Single node tree

A

A single node tree is a result of the model being unable to locate any variable split that reduces the impurity of the parent node by the minimum default amount. Because of the inability to split the dataset on any variable, the average of the entire data set is applied as the estimate for all predictions.
The single node can be caused by poor choice of model parameters, such as a high complexity or minbucket parameter. A single node tree can also occur because none of the independent variables have explanatory value for the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly