Decision Trees Flashcards

1
Q

Decision Tree - decision making

A
  1. Start with whole data and create all possible binary decisions based on each feature:
    - discrete feature: is this class or no class?
    - continuous feature: threshold < value or threshold > value
  2. calculate the gini impurity for every decision
  3. pick the decision which reduces the impurity the most
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

classification trees vs regression treees

A

classification: output are discrete. Leaf values are set to the most common outcomes
regression: output are numerical. Leaf values are set to the mean value in outcomes. Use MSE or RSS instead instead of gini

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Decision tree - how to avoid overfitting

A

Prepruning (prune while you build the tree)

  • leaf size: stop splitting when examples get small enough
  • depth: stop splitting at a certain depth
  • purity: stop splitting if enough of examples are the same class
  • gain threshold: stop splitting when the information gain becomes too small

Postpruning (prune after you’ve finished building the tree)
- merge leaves if doing so decreases test-set error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ensemble methods

A

combining many weak models to form a strong model.

We train multiple models on the data, each model is different. They could be trained on different subsets of the data, or trained in different ways, or even be completely different types of models.

In order for ensemble to work, each model have to be capturing something new and different so they can add incremental insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Decision Tree - bagging

A

creating each model from a bootstrap sample and aggregating the results. Can be used with any sort of model, but generally with decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Random Forest

A

It takes bagging but doesn’t just bootstrap rows, but it also picks random set of features, and random features to split at. So some of these trees are split on more important features while others are forced to split on less important features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Random Forest - pros and cons

A

pros

  • no feature scaling needed
  • good performance
  • model nonlinear relationships

cons

  • can be expensive to train
  • not interpretable (no inference)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Gradient Boosting Regressor

A

Goal: to minimize sum of square errors.
start with the mean, subtract from y, then use the residual to build a tree. Outcome = residual, input = features with a learning rate. The learning rate is to slow down the reduction in residuals so we can be more precise in our prediction.

good for capturing non-linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Gradient Boosting Regressor Hyperparameters

A
  1. loss - controls the loss function to minimize
  2. n_estimators - how many decision trees to grow
  3. learning_rate - start with 0.1 and go down
  4. max_depth - how deep to grow each tree
  5. subsample - similar to bagging in random forest. 1 = use 100% of data 0.5 = 50% etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Gradient Boosting Classifier

A

Goal: minimize the residual between y and the probability of class y (aka predict_proba)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Optimization

A

Throughout machine learning we have a constant goal to find the model the best predicts the target from the features. We generally define best as minimizing some cost function or maximizing a score function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

derivative

A

slope of the line - when our graph is non-linear and we want to find out a the slope of a specific point on the non-linear graph, we can find the slope by calculating the derivative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

gradient descend

A

gradient gives us the direction of the deepest decrease. Gradient descend is using gradient to point us to the direction, and continue to follow the decrease until we hit the bottom.

We can apply a learning rate to make the steps go smaller.

If our learning rate is low enough, gradient descend should lead us to the global minimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Neural networks - forward propagation

A

We calculated the outcome based on features values and weights by passing through different layers to arrive at the outcome neuron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Neural networks - backward propagation

A

Moving from the end back to beginning. We need to find the optimal weight to minimize the error at the end by applying gradient descend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Epoch

A

a loop of forward and backward propagation

17
Q

Stochastic gradient descend

A

generally drops the error faster than gradient descend

18
Q

Neural networks - overfitting

A
  1. limit the number of hidden units
  2. limit the norm of the weights
  3. stop the learning before it has time to overfit
  4. dropout - have a certain percentage of neurons fail i in each layer