Random Forest Flashcards

Question 1

Q

Inductive Learning

Answer

A

also known as discovery learning, is a process where the learner discovers rules by observing examples. This is different from deductive learning, where students are given rules that they then need to apply.

Question 2

Q

Decision Tree Strucutre

Answer

A

Consists of root notes (where the tree starts)
Branches (splits with children)
Leaf nodes (end of the tree - represents possible outcomes)
nodes {where a parent and child meet}

Question 3

Q

Experience Table

Answer

A

A labeled data set with your target variable and all of the features for which data was collected

Question 4

Q

What kind of algorithm will we use for our decision trees?

Question 5

Q

Decision Tree Algorithm

Answer

A

(1) Choose the best attribute to split the remaining instances - that becomes the root
(2) repeat process with children
(3) stop when - all instances have the same target attribute value, there are no more attributes, or there are no more instances

Question 6

Q

How do you identify the best attribute to become the root of your decision tree?

Answer

A

Information gain

Question 7

Q

What makes a good decision tree?

Answer

A

It must be small AND classify accurately

small trees are less susceptible to overfitting and are easier to understand

Question 8

Q

Information Gain and Impurity Levels

Answer

A

{xxxxxyxxxxyxxx} not pure
{xxxxxxxxxxxxxx} as pure as it gets
{xxxxxxxyyyyyyyy} least pure

Question 9

Q

Information Gain

Answer

A

We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between classes to be learned.

Information gain tells us how important a given attribute of the feature vectors is

We use it to decide the order of attributes in the nodes of a decision tree

Question 10

Q

Decision Tree CONS

Answer

A

suffer from a problem of errors propagating throughout the tree (becomes more of an issue as number of classes increases)

Question 11

Q

Error Propogation

Answer

A

Since decision trees work by a series of local decisions, what happens when one of these local decisions is wrong? Everything beyond that point is incorrect, and we may never return to the right path

Question 12

Q

Noisy data in decision trees

Answer

A

When 2 values have the same attribute / values pairs but different classifications

some values of the attributes are incorrect because of errors in the data acquisition process or the preprocessing phase

Some attributes may be irrelevant to the decision making process (the color of a dice used to roll)

Question 13

Q

Overfitting in Decision Trees

Answer

A

Irrelevant attributes can VERY EASILIY lead to overfitting

Too little training data can also lead to overfitting

Question 14

Q

How to avoid overfitting in Decision Trees

Answer

A

Stop growing the tree when the data split is not statistically significant

Acquire more training data

Remove irrelevant attributes

Grow a full tree then post - prune

Question 15

Q

How to select the best decision tree

Answer

A

Measure performance over training data
Measure performance over separate validation sets
Add complexity penalty to performance measure

Question 16

Q

Bootstrapping

Answer

A

Very important across all of statistics. You create new datasets by sampling with replacement from your original dataset. Some values may get repeated in your set.

The closer your Boostrap N to the original N, to more overlap you will get

Question 17

Q

Evaluating decision trees

Answer

A

Accuracy - how many things can it classify correctly?
Scalability - performs model generation and prediction functions, larger datasets, and speed
Robustness - how well does it perform with missing or noisy data
Intuitive appeal - Results are easily understood, decisions can be made

Question 18

Q

Ensemble Learning

Answer

A

Combining weak classifiers in order to produce a strong classifier

Question 19

Q

Random Forest

Answer

A

Solving the weaknesses of decision trees by introducing randomness to the equation.

Random Forest is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the individual trees

Introduces idea of bagging

Question 20

Q

Bagging

Answer

A

Bootstrap Aggregation. Used to avoid overfitting (important since RF trees are unpruned) and to improve accuracy / stability

It is broken into two steps - bootstrap a sample set and aggregate

3 variables:
n - data in your original dataset
n’ - data you want in each bag
m - number of bags to create

Works best where n’ < n (about 60% is a typical number)

bagging reduces the variance of the base learner but has a limited effect on the bias.

Strongest if you are using strong learners

Question 21

Q

adaBoosting

Answer

A

A variation on bagging, where points that are modeled poorly in your ensemble are weighted to get picked a little better in the subsequent ‘random’ bag of data.

Question 22

Q

Random Forest Algorithm

Answer

A

1 - assign variables: N = training cases, M = total features to classify

2 - m = input variables to be used at nodes of tree (should be smaller than M)

3 - chose a training set for the tree - either bagging or adaboost)

4 - at each node of the tree, randomly choose m variable to use. Calculate the best split on these m variables in the training set

5 - grow each tree fully, and do not prune

for a new prediction, a new sample is pushed down all the three and the average vote of all the trees is the prediction it is given

Question 23

Q

Random forest algorithm simplified

Answer

A

Grow a forest of many trees. (R default is 500)

Grow each tree on an independent bootstrap sample* from the training data.

At each node:
Select m variables at random out of all M possible variables (independently for each node).

Find the best split on the selected m variables.

Grow the trees to maximum depth
(classification).

Vote/average the trees to get predictions for new data.

*Sample N cases at random with replacement.

Question 24

Q

Random Forest Pros

Answer

A

-Can classify and regress
-Handle categorical predictors
-Computationally simple and quick
-No distribution assumptions
-Can handle highly non linear data / classifications
-Automatic variable selection
-Hardy to overfitting
-Handles missing values (using proximities)

Question 25

Q

Random forest cons

Answer

A

HARD TO UNDERSTAND

Question 26

Q

How does Random Forest imrpove on decision trees

Answer

A

Accuracy & instability (if you change the data a little, the individual trees may change but the forest remains stable)

Brainscape's Knowledge GenomeTM

Random Forest Flashcards

Brainscape's Knowledge Genome^TM