Regression Trees Flashcards

1
Q

Gini Index

A

Impurity measure of node t.
i(t) = 1 - sum(p_jt)^2

p_jt is the realtive frequency of class j at node t.

Gini index of split is the sum of all i(t)s weighted by the relative amount of cases in each node.

We choose the attribute that provides the smallest GINI split measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Information gain

A

Measures the homogeneity of a node.

Entropy:
i(t) = sum(p_jt * log(p_jt))

Entropy gain of split is the difference between the entropy before the split and the sum of the entropy of each node after the split, weighted by hteir relative frequencies.

We choose the split that achieves the greatest reduction, and allows us to maximise the gain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Split Info and Gain Ratio

A

Split info is the sum of the relative number of cases in the node times the log of the relative number of cases.
Gain ratio is the Entropy gain over the split info.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stop criteria

A
  1. Minimum size of groups
  2. Minimum non-homogeneity of parent group
  3. Maximum number of iterations
  4. Minimum explanatory power
  5. Pruning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

CART

A

Characteristics:

  • Variables:
    1 dependent variable (quantitative or qualitative), mixed explanatory variables.
  • Split type: Binary.
  • Splitting rule: based on impurities.
    Makes all possible subdivisions of explanatory variables, and chooses the that produces the maximum reduction of impurity.
  • Stop rule:
    based on the number of cases on the leaf nodes. And optimisation based on pruning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

CHAID

A
- Variables: 
Qualitative dependent variables 
- Split type: 
can be on binary or multiple nodes
- Splitting rule: 
based on Chi-square test for the null hypothesis of statistical independence between the dependent variable and the explanatory variable. 
- Stop rule: 
explicit and must relate to the maximum dimension of the tree, maximum number of levels or the minimum number of elements in a node.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

C4.5/C5.0

A

Similar to CART, but differs in the following respects:

  • The segmentation of the nodes is not binary.
  • Predictors and their values are selected based on information gain.
  • Stop criterion is the pruning, based on the assignment of an expected error at each leaf.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

QUEST

A

Quick, Unbiased, Efficient, Statistical Tree

  • Binary splits
  • Choice of explanatory variables of the split is done before the first split.
  • The association between each predictor variable and the target variable is calculated using ANOVA F or the Levene test (for continuous predictors, or ordinal) or Pearson Chi-square(nominal predictors).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Random Forest

A

The Random Forest method employs instead of a single tree, a set of
decision trees.

Each tree is implemented on an appropriate data resampling (Bootstrap with N samples), and on a subset of predictor variables.

X trees are estimated in this way and at the end the classification suggested by the majority of the trees is the final classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Pros and cons of Random Forest

A

Pros:

  • High predictive performance/Generalisation
  • Parameters easy to choose

Cons:

  • Complexity/Processing time
  • More difficult interpretation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

XGBoost

A

The forest of trees is estimated sequentially, such that each new tree takes into account the prediction errors of the previous tree.

Pros:

  • Computational speed
  • Reduction of overfitting problems
  • ability to easily define custom objective functions.

Objective function of XGBoost: Loss + Regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Ensemble, Bagging, Boosting

A
  • Ensemble: collection of models that are combined (e.g. with some kind of mean) in order to improve the final accuracy.

Types of Ensembles:

  • Bagging: each predictor is independent and they are averaged through mean or voting.
  • Boosting: each predictor is some kind of improvement over the previous iteration.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly