2) Trees Flashcards

Question 1

Q

What are trees?

Answer

A

Structures that split X into groups and use the most likely factor (when Y is a factor) or mean (when Y is continuous) as the prediction

Question 2

Q

How are splits selected in trees?

Answer

A

Splits are made to minimise the residual sum of squares

This is equivalent to maximising the squared distance between the group means with a weighting based on the group sizes.

Ideally, the group means are as far apart as possible and the splits are approximately even to avoid peeling off outliers

Question 3

Q

What is the general algorithm for building a tree?

Answer

A

Search over all variables and all cut points for the best split
Divide the data at the best cutpoint
Repeat 1-2 for each branch until they all end in a single data point
Repeat 1-3 to do cross-validation
Prune the tree based on a complexity penalty

Question 4

Q

What are the types of splits?

Answer

A

The set of splits considered at each node are “competing splits”, only the best is chosen

Surrogate splits attempt to replicate the chosen split with another variable

Question 5

Q

How is variable importance measured in trees?

Answer

A

It is based on the improvement of fit when the variable is used plus a proportion of its improvement when used as a surrogate based on its accuracy as a surrogate

Question 6

Q

How is goodness of fit measured for trees?

Answer

A

Regression: 1 - R^2

Classificaton: Gini impurity

Question 7

Q

What is Gini impurity?

Answer

A

The probability that two randomly-chosen elements will have the same level of a factor:

G = 1 - sum_{i in factor} p_i^2

Question 8

Q

How are trees handled in R?

Answer

A

Plotcp shows how the cross-validation error changes as the size/complexity of the tree increases

Printcp is similar to plotcp but also gives the standard error of the cross-validation error

The MSPE is rel_error x root_node_error

As a rule of thumb, take the smallest model within one standard error of the best model

Question 9

Q

How can the performance of a tree be visualised?

Answer

A

With a confusion matrix

Question 10

Q

What is notable about the structure of trees?

Answer

A

The structure of trees is much less stable than their accuracy; cross-validation creates very different trees with similar performance

Question 11

Q

What are the pros and cons of trees?

Answer

A

Interactions and non-linearities are built in: different branches have completely independent splits

They are not very smooth so lots of splits are needed to capture a straight line

They are greedy and so can get stuck

They are very interpretable by this can be misleading due to the instability of their structure

Question 12

Q

What are the basic issues with trees?

Answer

A

They are discontinuous and so not ideal for numeric responses i.e. regression
Only one variable is used at each node - at most only log_2 n variables for each observation are used
Many trees give similar predictions

Question 13

Q

How can the issues with trees be addressed?

Answer

A

Bagging (bootstrap aggregation) – grow trees on bootstrap samples and evaluate the error on observations left out; average the trees

Boosting – weight observations that the last trees got wrong
Random forest – take small random samples of the variables and bootstrap observations

Question 14

Q

What is the process of constructing a random forest?

Answer

A

Select a bootstrap sample of training data, the rest is test data
Grow a single tree from the training data:
- - a. select a small number of potential predictors independently for each node
- - b. don’t prune the tree - hence no cross-validation is required
Repeat 1-2 for lots of trees and average the results

Question 15

Q

Why do random forests avoid overfitting?

Answer

A

Instead of getting caught up on finding the best tree, they find a lot of decent ones

Question 16

Q

How is variable importance measured in a random forest?

Answer

Study These Flashcards

A

Impurity – how much improvement there is when splitting on that variable times by how many data points this happened to

Permutation – randomly shuffle the values of that variable and see how much worse the predictions get

Note: these metrics only apply to a particular tree/model; if two variables are highly-correlated then only one may appear important

2) Trees Flashcards

(16 cards)