Chp 5 Decision Tree Flashcards by Ivan Tejeda

Greedy Strategy

Split the records based on an attribute test that optimizes certain criterion

How well did you know this?

Not at all

Perfectly

Binary Split

Divides values into two subsets

How well did you know this?

Not at all

Perfectly

Multi-way split

Use as many partitions as there are distinct values

How well did you know this?

Not at all

Perfectly

How to specify continuous attributes

sort attribute
create split positions, at halfway points
Determine gini index

How well did you know this?

Not at all

Perfectly

The lower the gini index

the better it is

How well did you know this?

Not at all

Perfectly

Gini Index

How well did you know this?

Not at all

Perfectly

Entropy

Amount of uncertainty involved in a distribution

How well did you know this?

Not at all

Perfectly

In decision tree algorithms, entropy measures

Purity

How well did you know this?

Not at all

Perfectly

Purity

The fraction of observations belonging to a particular class in a node

How well did you know this?

Not at all

Perfectly

Pure node

If all observations belong to the same class

How well did you know this?

Not at all

Perfectly

Information

Amount of certainty involved in a distribution

How well did you know this?

Not at all

Perfectly

We need to choose a split in a decision tree that maximizes

Information Gain

How well did you know this?

Not at all

Perfectly

Information Gain formula

Entropy before split - Entropy after split

How well did you know this?

Not at all

Perfectly

Steps to determine split?

Calculate entropy at root node
Calculate each information gain for each attribute split
Pick the attribute split that has the highest information gain

How well did you know this?

Not at all

Perfectly

Decision trees are non/parametric

Non parametric because you are not specifying any parameters

How well did you know this?

Not at all

Perfectly

Time complexity of building a decision tree

O(log n)

How well did you know this?

Not at all

Perfectly

Does multicollinearity affect decision tree accuracy?

No, just added extra height

How well did you know this?

Not at all

Perfectly

Finding optimal decision tree is

expensive, NP complete

How well did you know this?

Not at all

Perfectly

Decision tree can create

rectilinear boundaries

How well did you know this?

Not at all

Perfectly

Decision trees cannot create

non straight (only up or down) lines

How well did you know this?

Not at all

Perfectly

Model =

Algorithm + hypothesis

How well did you know this?

Not at all

Perfectly

4 steps of selecting a model

Prepare training data
Choose hypothesis set and algorithm
Tune algorithm
Train the model, fit the model to out of sample data (test set) and evaluate results

How well did you know this?

Not at all

Perfectly

Goal of model selection

Study These Flashcards

Select the best model from training phase

Two ways to evaluate models

Study These Flashcards

Model Checking
Performance Estimation

The best model is the one that gives you _ and _ well on testing set

Smallest prediction error Generalizes

Model Checking

Given a data set, divide it into training and testing dataset

What happens if you randomly select test points that are not representative of the population in general?

Cross Validation

Cross Validation approaches 3

Holdout K-fold cross validation Leave one out cross validation

Cross validation

an approach to systematically create and evaluate multiple models on multiple subsets of the dataset

Holdout method

Split 80% training data and 20% testing data randomly

k-fold cross validation

Split data into k chunks, train on k-1 chunks and test on the kth chunk. Do this k times and calculate average error

Leave one out cross-validation

Extreme version of k-fold where k=1 observation (n chunks)

Use of kfold or LOOC resampling methods are more robust if

data is split into training, validation, and testing

Typical application of holdout methods is to

determine a stopping point with respect to error, stop when test set error starts increasing (this is overfitting)

why split data into three part?

If you model has certain hyper parameters, then you can adjust the hyperparameters on the validation dataset

two ways to do performance evaluation of a model

Confusion Matrix Receiver Operating Characteristics (ROC) curve

Confusion Matrix

Provides numerous metrics computed from the matrix

ROC curve

Characterize the trade off between positive hits and false alarms

Can you minimize both FP and FN?

What does the ROC plot?

The true positive rate against the false positive rate

What does the middle line mean

a random guess

AUC standards

.5-.6 fail .6-.7 worthless .7-.8 poor .8-.9 good >.9 excellent

Overfitting

picking up nuances in training data, matches too much with training data

Underfitting

model is too simple that it cant capture patterns

Prepruning

halt growth of tree based on some constraint good for shorter trees dont know when to stop

Post pruning

Grow tree to maximum size, then trim Gives better results Wastes computer cycles

How to prune

Focus on complexity parameter (cp), keep splitting until cp reaches a certain value

How to find cp

Use cross validation error and see where it starts to rise again after decreasing, use cp at this value

What does pruning do

Gives tree that is more generalized

Mostly all datasets have what type of class distribution

imbalanced

Cost sensitive learning

Penalizes the model when it commits a false negative error

Mitigation balance techniques 3

Cost sensitive learning Sampling techniques Synthetic data

Synthetic data

May be generated, if possible, to ensure that the class distribution is equivalent

Sampling techniques

modify the class distribution such that the rare class is well represented in the training set

Undersampling + con

Gathers less of the majority class observations for training Useful observations may not be part of the sample

Oversampling + con

gathers more of the minority class observations for training If training data is noise, oversampling may amplify the noise

Chp 5 Decision Tree Flashcards

(56 cards)