Random Forest Flashcards

1
Q

What is random forest? 👶

A

Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we need randomization in random forest? ‍⭐️

A

Random forest in an extension of the bagging algorithm which takes random data samples from the training dataset (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a random sample of m features from full set of n features (with replacement) and uses this subset of features as candidates for the split (for example, m = sqrt(n)).

Training decision trees on random data samples from the training dataset reduces variance. Sampling features for each split in a decision tree decorrelates trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main parameters of the random forest model? ‍⭐️

A

max_depth: Longest Path between root node and the leaf
min_sample_split: The minimum number of observations needed to split a given node
max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees
min_samples_leaf: minimum number of samples in the leaf node
n_estimators: Number of trees
max_sample: Fraction of original dataset given to any individual tree in the given model
max_features: Limits the maximum number of features provided to trees in random forest model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we select the depth of the trees in random forest? ‍⭐️

A

The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting:

limit the maximum depth of a tree
limit the number of test nodes
limit the minimum number of objects at a node required to split
do not split a node when, at least, one of the resulting subsample sizes is below a given threshold
stop developing a node if it does not sufficiently improve the fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we know how many trees we need in random forest? ‍⭐️

A

The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is it easy to parallelize training of a random forest model? How can we do it? ‍⭐️

A

Yes, R provides a simple way to parallelize training of random forests on large scale data. It makes use of a parameter called multicombine which can be set to TRUE for parallelizing random forest computations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the potential problems with many large trees? ‍⭐️

A

Overfitting: A large number of large trees can lead to overfitting, where the model becomes too complex and is able to memorize the training data but doesn’t generalize well to new, unseen data.

Slow prediction time: As the number of trees in the forest increases, the prediction time for new data points can become quite slow. This can be a problem when you need to make predictions in real-time or on a large dataset.

Memory consumption: Random Forest models with many large trees can consume a lot of memory, which can be a problem when working with large datasets or on a limited hardware.

Lack of interpretability: Random Forest models with many large trees can be difficult to interpret, making it harder to understand how the model is making predictions or what features are most important.

Difficulty in tuning : With an increasing number of large trees the tuning process becomes more complex and computationally expensive.

It’s important to keep in mind that the number of trees in a Random Forest should be chosen based on the specific problem and dataset, rather than using a large number of trees by default. In practice, the number of trees in a random forest is chosen based on the trade-off between the computational cost and the performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens when we have correlated features in our data? ‍⭐️

A

In random forest, since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features.

In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to “do the same job” i.e. explain some variance, reduce entropy, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly