Trees and Forest Flashcards

week five

1
Q

alternative names of regression trees

A
  • CART (classification and regression trees);
  • Recursive partitioning methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Random forests are…

A

Random forests are then collections of trees.

A random forest is a collection of decision trees (e.g. regression trees) generated by applying two separate
randomisation processes:
1. The observations (rows) are randomised through a bootstrap resample.
2. A random selection of predictors (columns) are considered for each split, rather than considering all
variables

Random forests are examples of ensemble methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

binary splits

A

Most commonly, the groups are formed by a sequence of binary splits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The basic idea of regression trees

A

The basic idea is to split the data into groups using the predictors, and then estimate the response within
each group by a fixed value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

binary tree

A

The resulting partition of the data can be described by a binary tree. a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

the target value at each leaf

A

At each leaf the target is estimated by
the mean value of the y-variable for
all data at that leaf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

At each stage in tree growth, how we select the best split?

A

Which node to split at?. The goal is to find the node that helps separate the data into more homogeneous groups with respect to the target variable.

Which variable to split with?: At each node, the decision tree algorithm considers splitting the data based on different features or variables. It evaluates each variable to see which one best separates the data into distinct groups. The variable that results in the greatest improvement in predictive accuracy is chosen for splitting at that node.

What value of that variable to split at?: Once the variable is chosen for splitting, the algorithm determines the optimal value to split the data. This value could be a specific threshold for continuous variables (e.g., height > 170 cm) or a categorical value for categorical variables (e.g., color = “red”). The algorithm searches for the value that maximizes the improvement in predictive accuracy.

The best split is the one which results in smallest residual sum of squares on the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

for factors with many levels, how many possible splits to consider?

A

If there are levels, then there are k
2^(k−1) −1
possible splits to consider

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

function which fit the regression tree

A

rpart() command fits a regressio tree, using a similar syntax to lm().

wage.rp <- rpart(WAGE ~ . ,data = wage.train)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

regression tree how to visualize with Rcode

A

plot(wage.rp, compress = TRUE, margin = 0.1)
text(wage.rp)

The plot command visualises the tree, whilst the text
command adorns it with labels.
The compress=TRUE tends to make the tree more visually appealing, while margin=0.1 adds a bit of whitespace.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

surrogate splits.

A

Regression trees can handle missing
data using surrogate splits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pruning trees

A

Pruning trees is a technique used to prevent overfitting and improve the generalization ability of decision trees. Overfitting occurs when a tree captures noise in the training data and performs poorly on unseen data. Pruning helps to simplify the tree by removing branches that do not provide significant improvement in predictive accuracy.

Consider the bias-variance trade off:

One observation per leaf implies lots of flexibility in the model (so low bias) but high variability.

Many observations per leaf reduce flexibility (introduce bias) but reduce variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

complexity parameter

A

Model Complexity:
Low complexity - high bias - low variability.
High complexity - low bias - high variability.
Specified by cp argument of rpart(), default cp=0.01.
cp=0.1 - simple
cp =0.0001 complex.

The default value of cp = 0.01 is only a rule-of-thumb.

Pick the value of cp that minimizes prediction error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cross-Validation

A

The idea is that we split the data into equally size blocks (subgroups). Each block in turn is set aside as the validation data, with the remaining blocks combining to form the
training set.

printcp(wage.rp.3)

The xerror column contains cross validation estimates of the (relative) prediction error.

xstd is the standard error (i.e. an estimate of the uncertainty) for these cross-validation estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The forest benefits from the instability of the trees

A

Each bootstrap resample is likely to generate a different tree as tree building is brittle.

Considering only a subset of the available predictor variables for each split in the tree helps ensure the trees are different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Random forest: Tuning

A

We can tune the number of predictors to consider at each split using
mtry
.

11
Q

nodesize

A

The tree depth is controlled by nodesize , the minimum size of nodes before a split is allowed.
This defaults to 5 for prediction.