Decision Trees Flashcards

1
Q

Briefly explain how a decision tree works

A

Decision trees divide the feature space into a finite set of non-overlapping regions containing relatively homogeneous observations more amenable to analysis and prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain why the classification error rate is generally less sensitive to node impurity than the Gini index and entropy

A

This is because the classification error rate depends on class proportions through the maximum. If two nodes have the same maximum proportion, then they will always share the same classification error rate, regardless of the distribution of the observations in other classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recursive binary splitting is a top-down, greedy algorithm. Explain the meaning of the terms “top-down” and “greedy”

A
  • “Top-down” means that we start from the top of the tree and go down, sequentially partitioning the feature space in a series of binary splits
  • “Greedy” means that we adopt the binary split that leads to the greatest reduction in impurity at that point, rather than looking ahead and selecting a split that results in a better tree in a future step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain why it is not a good idea to grow a tree by specifying an impurity reduction threshold and making a split only when the reduction in impurity exceeds the threshold

A

This is because it overlooks the important fact that a “not-so-good split” early on in the tree-building process may be followed by a good split, which means that we will never arrive at the good split that comes later

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain how cost-complexity pruning works

A

Cost-complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.
Cost complexity pruning includes a tradeoff between the goodness of fit of the tree to the training data with the size/complexity of the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain how the complexity parameter affects the quality of a decision tree

A

The complexity parameter is a penalty term in place to prevent overfitting
* cp = 0, no penalty for a complex tree, and the fitted tree is identifical to the large, overblow tree without pruning
* cp increases from 0 to 1, the penalty term increases and the tree branches are snipped off retroactively to form new, larger terminal nodes. The squared bias of the tree predictionrs will increase, but the variance will drop
* cp = 1, then the tree is minimized to one with the root node only (intercept-only linear model). In this instance, the drop in relative traning error can never compensate for the increase in penalty, and no splits occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain how a monotone transformation on a numeric predictor and on the target variable affects a decision tree

A

Monotone transformation on a numeric predictor: will produce the same decision tree because the split points are based on the ranks of the feature values, not their actual values
Monotone transformation on the target variable: will not produce the same decision tree because the split points are on different cutoff values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain why a decision tree tends to favor categorical predictors with many levels

A

This is because there are numerous ways to split its levels into two groups, making it easier to choose a seemingly good split based on a multi-level categorical predictor for the particular training data at hand, even though that predictor is not necessarily an important one. This will likely lead to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain how a random forest works

A
  1. A random forest works by first taking the training data and using bootstrap to produce B bootstrapped training samples with the same size as the original training set (independently and randomly sampling with replacement)
  2. Then, each of the bootstrapped training samples are fitted to their respective training sets to produce B fitted trees
  3. Each of the results from the B fitted trees are combine to form an overall prediction
    * Numeric target variables = average of B predictions
    * Categorical target variables = majority vote

Summary = making independent bootstrapped samples, fititng a decision tree to each bootstrapped sample separately, and averaging the predictions to these trees to reduce variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain how a boosted tree works

A

Boosting builds a sequence of interdependent trees using information from previously grown trees. In each iteration, a tree is fit to the residuals of the preceeding tree and a scaled-down version of the current tree’s predictions is added to previous predictions, focusing on predicting observations that the previous tree predicted poorly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the differences between a random forest and a boosted tree

A
  • In parallel vs. in series: base trees in a random forest are fitted in parallel, independently of one another. base trees in a boosted model are fitted in series
  • Bias vs variance: random forests address model variance, and boosted trees focus on a reduction in model bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the considerations that go into the choice of the mtry parameter of a random forest

A
  • The larger the value of mtry (the number of features considered in each split), the less variance reduction
  • If mtry is too small, each split may have too little freedom in choosing the split variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain what a partial dependence plot is and in what way it improves over a variable importance plot

A

Partial dependence plots = attempts to visualize the association between a given variable and the model prediction after averaging the values of other variables. These plots give insight into how the target variable depends on other predictors
Variable importance plot = tells us which predictors are most influential, but doesn’t shed light on the relationship between the predictors and the target variable (for ex., whether positive, negative, or complex relationship)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is information gain?

A

Information gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most information gain. Therefore, the choice of impurity measure (e.g., Gini or entropy) directly impacts information gain calculations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Compare random forests and gradient boosting machines

A

Random forests tend to do well in reducing variance while having similar bias to that of a basic tree model.
Gradient boosting machines use the same underling training data at each step. This is very effective in reducing bias but is very sensitive to the training data (high variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a node? Explain the different types of tree nodes

A

A node is a point on a decision tree that corresponds to a subset of the training data
* Root node: the node at the top of a decision tree representing the full dataset
* Terminal nodes (leaves): the nodes at the bottom of a tree which are not split further and constitute an end point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain what the depth of a decision tree represents

A

The depth is the number of tree splits needed to go from the tree’s root node to the furthest terminal node. It’s a measure of complexity of a decision tree where the larger the tree depth, the more complex it is

18
Q

What constitutes a balanced tree?

A

If the left and right subtrees coming out of any node differ in depth by at most one

19
Q

What do we use as the prediction of a decision tree when it comes to a regression vs classification problem?

A

Regression = numerical target variable: use the average (mean) of the target variable in that region as the predicted value
Classification = categorical target variable: use the most common class (mode) of the target variable in that region as the predicted class

20
Q

What are the two inter-related decisions we have to make regarding binary splits?

A
  1. The predictor to split
  2. Given the split predictor, the corresponding cutoff value
21
Q

What is node impurity and information gain?

A
  • Node impurity is how dissimilar the observations are in a node
  • Information gain is the decrease in impurity created by a tree split
22
Q

What is the goal of constructing a decision tree?

A

The goal is to create terminal nodes that contain similar observations that are easy to analyze and predict. In every split, we partition the target observations by whichever variable results in the greatest reduction of impurity = greatest information gain

23
Q

What are three commonly used node impurity measures for classification trees?

A
  1. Classification error rate
  2. Gini index
  3. Entropy
24
Q

What does recursive binary splitting mean?

A

Each split operates on the results of the previous splits until a stopping criterion is met

25
Q

What is the relative training error?

A

The training error of the tree scaled by the training error of the simplest tree, i.e., the tree with no splits
* Regression trees: RSS/TSS
* Classification trees: # misclassifications made by tree / # of misclassifications made by using majority class

26
Q

Explain how the complexity paramter produces a nested family of trees

A

By varying the complexity paramter, we can trace the whole family of trees ranging from the largest, most complex tree, to the tree only with the root node because the values of the complexity paramter forms subsets of trees

27
Q

What are other paramters at play when pruning decision trees?

A

When we minimize the complexity paramter, there are other constraints imposed such as the minimum number of observations in a node (minbucket), or the maximum depth level (maxdepth)
* The higher minbucket, the less complex the tree
* The higher maxdepth, the more complex the tree

28
Q

How do you tune cp, minbucket, and maxdepth?

A
  • cp can be tuned by cross validation
  • minbucket and maxdepth have to be tuned by trial and error
29
Q

What can we comment on when interpreting trees?

A
  • Number of tree splits
  • Split sequence, i.e., start with X1, further split the bucket by X2
  • Which are the most important predictors? (early splits)
  • Which terminal nodes have the most observations?
  • Any prominent interactions?
  • (Classification trees) Combinations leading to the positive event
30
Q

What is the one-standard-error (1-SE) rule?

A

We can select the smallest tree whose cross validation error is within one standard error of the minimum cross validation error. The rationale is to select a simpler and more interpretable tree with comparable prediction performance

31
Q

What are the key parameters of random forests?

A
  • mtry: # of features sampled as candidates at each split. can be tuned by CV, but usually = root p for classification and p/3 for regression
  • ntree: # of trees to be grown. The higher ntree, the more variance reduction. We set ntree to a relatively small value to save run time, and overfitting often does not arise even if we set ntree to a large number
32
Q

What are the key parameters of boosted trees?

A
  • eta: the learning rate (or shrinkage) parameter. The higher eta, the faster the algorithim converges, but more prone to overfitting. Rule of thumb is to set eta to a relatively small value
  • nrounds: maximum # of rounds in the tree construction process
33
Q

What is the importance score in variable importance plots?

A

The total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees

34
Q

What is the definition of partial dependence?

A

The model prediction obtained after averaging the values or levels of variables that are not of interest

35
Q

What are the pros and cons of GLMs?

A

Pros:
* (Target distribution) excel in accomodating a wide variety of distributions for the target variable
* (Interpretability) the model equation clearly shows how the target mean depends on features; coefficients = interpretable measures of directional effect of features
* (Implementation) simple to implement

Cons:
* (Complex relationships) unable to capture non-linear or non-additive relationships unless additional features are manually incoroporated (such as polynomial variables or interactions)
* (Interpretability) for some link functions (inverse link), the coefficients may be difficult to interpret

36
Q

What are the pros and cons of regularized GLMs?

A

Pros:
* (Categorical predictors) via the use of model matrices, binarization of categorical variables is done automatically
* (Tuning) can be tuned by CV
* (Variable selection) for lasso and elastic nets, variable selection can be done by making lambda large enough

Cons:
* (Target distribution) limited/restricted model forms allowed by glmet
* (Categorical predictors) possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected
* (Interpretability) coefficient estimates are more difficult to interpret because the variables are standardized

37
Q

What are the pros and cons of single decision trees?

A

Pros:
* (Interpretability) if there are not too many buckets, trees are easy to interpret because of the if/then nature of the classification rules and their graphical representation
* (Complex relationships) excel in handling non-linear and non-additive relationships without the need to insert extra features manually
* (Categorical variables) categorical predictors are automatically handles wihtout the need for binarization or selecting a baseline level
* (Variable selection) variables are automatically selected as part of the model building process –> most important variables are at the top of the tree

Cons:
* (Overfitting) strongly dependent on training data and prone to overfitting which leads to unstable predictions with a high variance and results in lower user confidence
* (Numeric variables) usually need to split a numeric predictor repeatedly to capture its effect effectively
* (Categorical variables) tend to favor categorical predictors with a large number of levels –> because too many ways to split –> easy to find split that looks good on training data, but doesn’t really exist in the signal

38
Q

What are the pros and cons of ensemble decision trees?

A

Pros:
* Much more robust and predictive than base trees by combining the results of multiple trees

Cons:
* Opaque / difficult to interpret. This is because many base trees are used, but variable importance and partial dependence plots can help
* Difficult to implement. This is because there’s a huge computational burden with fitting multiple base trees

39
Q

List two ways impurity measures can be used for classification trees

A
  1. construction: choose the best predictor and best cutoff resulting in the greatest reduction in chosen impurity measure (generally Gini and entropy)
  2. pruning: impurity measure will make up the objective function penalized by tree complexity (generally classification error rate)
40
Q

State the difference between dissimilarity and linkage

A

Dissimilarity measures the proximity of two observations, whereas linkage measures the proximity of two clusters

41
Q

What is the height of a dendrogram?

A

The height is the inter-cluster dissimilarity between the two closest clusters when they are fused in each step of the hierarchical clustering algorithm

42
Q

Interpret ROC (receiver operating characteristic) curve

A

The ROC curve shows the sensitivity and specificity of a classifier for a range of cutoffs between 0 and 1. The AUC combined these quantities across all cutoffs into a single metric measuring the classifier’s predictive ability