Foundation Flashcards

1
Q

How does Predictive Modelling work?

A

Predicts OUTCOME (target), based on set of INPUTS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 types of Prediction?

A

Classification, Estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What criteria is needed to use “Classification”?

A

Target must be CATEGORICAL
Hint: “Class” in “Classification” => Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What criteria is needed to use “Estimation”?

A

Target must be CONTINOUS (numerical)
Hint: “Estimate” => Numbers hence, continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What percentage should we split the data?

A

70% Training, 30% TESTING

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Using what node in SPSS Modeler can we split the data?

A

Partition node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is it neccessary to SPLIT data?

A

Aim of predictive model: It should be trained to be accurate on UNSEEN data

What is UNSEEN data? It is the TRAINING data!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why we use “Seed” in IBM SPSS?

A

It helps to RANDOMLY select records (datarows) to be either training/testing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why remember the exact seed number?

A

To ensure that it does not randomly select a record to be training/testing data

This further ensures that the model’s result is consistent as the same records are chosen to be training and testing data respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does CART stand for?

A

Classification And Regression Tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When to use Regression Tree?

A

When it is to estimate (the type of predict is to ESTIMATE numerical target)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 types of impurities to measure for Classification?

A

Gini Index, Entrophy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the impurity measure for Regression?

A

Sum of Squared Error (SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is lower or higher GINI better?

A

Lower! Because Gini shows impurity. Gini = 1-purity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is lower GINI more desirable?

A

The nodes are more homogenous = better prediction = better model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

State the 4 kind of nodes in a decision tree

A

Parent node, Child Node, Root node, Leaf node

17
Q

What is the main difference between a Confusion Matrix vs. Analysis Node that gives accuracy (correct/wrong)?

A

Confusion Matrix can give HIT rate by True positive, false negative, true false, false positive. Provides a more in depth detail.

Conversely, Analysis node only states Correct/ Wrong.

18
Q

Why is Confusion Matrix useful?

A

By analysing the True and Negative Positive/False, we can determine the severity (depending on context) to it’s hit rate. E.g) in a medical field, we would want MORE “False Positive” than “False Negative.

Why? Consequence is severe! So hint is to see the context’s consequences.

19
Q

Formula for False Positive Hit Rate

A

FP / FP + TN

20
Q

Formula for False Negative Hit Rate

A

FN / FN + TP

21
Q

What are the thing we have to take EXTRA note in decision trees?

A

Overfitting

22
Q

State the 2 signs of overfitting

A
  1. Training accuracy > Testing Accuracy %
  2. Leaf node only has 1 sample -> Gini = 0 -> 100% Pure (impossible because it is TOO specialized in training data)
23
Q

How can we prevent Overfitting?

A
  1. Set rules to stop growth in IBM (“Overfitting Prevention”)
  2. Prune tree until it does not overfit (see the best condition for MINIMUM ERROR RATE!!%)