07.a Decision Trees Flashcards

Question 1

Q

What is another name for a Decision Tree

Answer

A

Prediction Tree

Question 2

Q

Is the Decision Tree supervised or unsupervised machine learning

Answer

A

Supervised

Question 3

Q

What is the primary task performed by Classifiers

Answer

A

The primary task performed by classifiers is to assign class labels to new observations. The set of labels for classifiers is predetermined, unlike in clustering, which discovers the structure without a training set and allows the data scientist optionally to create and assign labels to the clusters.

Question 4

Q

What are the two fundamental classification methods

Answer

A

Decision Trees and Naive Bayes

Question 5

Q

Must the input variables to the decision tree be continuous or categorical?

Answer

A

The input variables to a decision tree can be either continuous or categorical.

Question 6

Q

What is the name of the shortest decision tree, one which has only a root node and leaf nodes

Answer

A

A Decision Stump

Question 7

Q

What are the names of the nodes beyond the root node in a decision tree

Answer

A

Leaf nodes (also known as terminal nodes) are at the end of the last branches on the tree. They represent class labels—the outcome of all the prior decisions. The path from the root to a leaf node contains a series of decisions made at various internal (decision) nodes.

Question 8

Q

What are the two types of decision tree

Answer

A

Classification trees
Used for discrete variables and binomial data output.
Regression trees
Used of continuous output such as predicted prices and probabilities.

Question 9

Q

What is meant by the depth of a node in a decision tree

Answer

A

The depth of a node is the number of steps required to reach the node from the root

Question 10

Q

What does a Classification Tree do

Answer

A

A Classification Tree will determine a set of logical if-then conditions to classify problems. For example discriminating between three types of flowers based on certain features.

Question 11

Q

What does a Regression Tree do

Answer

A

The Regression Tree is used when the target variable is numerical or continuous in nature. We fit a regression model to the target variable using each of the independent variables. Each split is made based on the sum of the squared error.

Question 12

Q

What is purity referring to in Decision Trees

Answer

A

The purity of a node is defined as its probability of the corresponding class. i.e. a Pure node is one in which 100% of the records meet the criteria. For example 100% of the records are female.

Question 13

Q

How is a Decision Tree trained

Answer

A

At each node the algorithm looks for the split of the records that reached that node which is the most “informative”

Question 14

Q

When does the Decision Tree algorithm stop

Answer

A

The algorithm constructs subtrees until one of the following criteria is met:

All the leaf nodes in the tree satisfy the minimum purity threshold.
The tree cannot be further split with the pre-set minimum purity threshold.
Any other stopping criterion is satisfied (such as the maximum depth of the tree).

Question 15

Q

What are entropy methods in relation to Decision Trees

Answer

A

The entropy methods select the most informative attribute based on two basic measures:

Entropy, which measures the impurity of an attribute
Information gain, which measures the purity of an attribute

Question 16

Q

What is Information Gain

Answer

A

Information gain is a measure of the purity of an attribute. It is the measure of decrease in entropy after the dataset is split

Question 17

Q

How do you calculate Entropy

Answer

A

H(x) = Entropy of x = -SUMx in X P(x) log2 P(x)
so minus the SUM of all of the probabilities of each individual variety appearing in the dataset
i.e. H(x) = -(3/8 log2 3/8 + 2/8 log2 2/8 + 3/8 log2 3/8)

Question 18

Q

What is the Entropy of a dataset with 50/50 split

Answer

A

The Entropy is 1 (the highest possible) based on the equation
H(x) = -((0.5log2(0.5))+(0.5log2(0.5)))
H(x) = -((0.5 * -1) + (0.5 * -1))
H(x) = 1

Question 19

Q

How do you calculate the purity of a node

Answer

A

Purity(record=class) = (records/all records)
i.e. P(subscribed=yes) = (221yes/2000) = 10.55%
It is the probability of that class

Question 20

Q

What is Conditional Entropy and how is it calculated

Answer

A

The Conditional Entropy is the remaining entropy of Y given X
Same form as entropy but each set of terms are appraised and summed.
-SUMP(x) * SUM(P(y|x)log2P(y|x)

Question 21

Q

How do you calculate Information Gain

Answer

A

I = H(X) - H(X|Y)

Information Gain = Entropy - Conditional Entropy

Question 22

Q

Name three popular decision tree algorithms

Answer

A

ID3, C4.5 and CART

Question 23

Q

What is ID3

Answer

A

ID3 stands for Iterative Dichotomiser 3 and is a Decision Tree Algorithm.

Question 24

Q

What is C4.5

Answer

A

C4.5 is a Decision Tree Algorithm which is an improvement on the ID3 version because it can handle missing data. It can also handle both categorical and continuous data. It also employees pruning.

Question 25

Q

What is CART

Answer

A

CART (Classification and Regression Trees) which can handle continuous variables. Where the C4.5 employs entropy based criteria CART uses Gini diversity index

Question 26

Q

What is the Gini diversity index and when is it used

Answer

A

It is used in the CART decision tree and has the following form:
Ginix=1-SUM P(x)2

Question 27

Q

What does greedy mean and what is its context

Answer

A

Decision trees use greedy algorithms - this means they always choose the option that seems the best available at that moment. This can lead to poor early decisions. These poor early decisions can be avoided using random forest.

Question 28

Q

How can you check to see if a decision tree makes sense

Answer

A

Sanity check of the splits

Look at the depth of the tree - too many layers with few members may imply over-fitting

Brainscape's Knowledge GenomeTM

07.a Decision Trees Flashcards

Brainscape's Knowledge Genome^TM