2) Trees Flashcards

(16 cards)

1
Q

What are trees?

A

Structures that split X into groups and use the most likely factor (when Y is a factor) or mean (when Y is continuous) as the prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are splits selected in trees?

A

Splits are made to minimise the residual sum of squares

This is equivalent to maximising the squared distance between the group means with a weighting based on the group sizes.

Ideally, the group means are as far apart as possible and the splits are approximately even to avoid peeling off outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the general algorithm for building a tree?

A
  1. Search over all variables and all cut points for the best split
  2. Divide the data at the best cutpoint
  3. Repeat 1-2 for each branch until they all end in a single data point
  4. Repeat 1-3 to do cross-validation
  5. Prune the tree based on a complexity penalty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the types of splits?

A

The set of splits considered at each node are “competing splits”, only the best is chosen

Surrogate splits attempt to replicate the chosen split with another variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is variable importance measured in trees?

A

It is based on the improvement of fit when the variable is used plus a proportion of its improvement when used as a surrogate based on its accuracy as a surrogate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is goodness of fit measured for trees?

A

Regression: 1 - R^2

Classificaton: Gini impurity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Gini impurity?

A

The probability that two randomly-chosen elements will have the same level of a factor:

G = 1 - sum_{i in factor} p_i^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are trees handled in R?

A

Plotcp shows how the cross-validation error changes as the size/complexity of the tree increases

Printcp is similar to plotcp but also gives the standard error of the cross-validation error

The MSPE is rel_error x root_node_error

As a rule of thumb, take the smallest model within one standard error of the best model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can the performance of a tree be visualised?

A

With a confusion matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is notable about the structure of trees?

A

The structure of trees is much less stable than their accuracy; cross-validation creates very different trees with similar performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the pros and cons of trees?

A

Interactions and non-linearities are built in: different branches have completely independent splits

They are not very smooth so lots of splits are needed to capture a straight line

They are greedy and so can get stuck

They are very interpretable by this can be misleading due to the instability of their structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the basic issues with trees?

A
  • They are discontinuous and so not ideal for numeric responses i.e. regression
  • Only one variable is used at each node - at most only log_2 n variables for each observation are used
  • Many trees give similar predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can the issues with trees be addressed?

A

Bagging (bootstrap aggregation) – grow trees on bootstrap samples and evaluate the error on observations left out; average the trees

  • Boosting – weight observations that the last trees got wrong
  • Random forest – take small random samples of the variables and bootstrap observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the process of constructing a random forest?

A
  1. Select a bootstrap sample of training data, the rest is test data
  2. Grow a single tree from the training data:
    - - a. select a small number of potential predictors independently for each node
    - - b. don’t prune the tree - hence no cross-validation is required
  3. Repeat 1-2 for lots of trees and average the results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do random forests avoid overfitting?

A

Instead of getting caught up on finding the best tree, they find a lot of decent ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is variable importance measured in a random forest?

A

Impurity – how much improvement there is when splitting on that variable times by how many data points this happened to

Permutation – randomly shuffle the values of that variable and see how much worse the predictions get

Note: these metrics only apply to a particular tree/model; if two variables are highly-correlated then only one may appear important