quiz 4 Flashcards

(32 cards)

1
Q

What is Predictive Data Mining?

A

A form of supervised machine learning that uses input data to predict a known outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the steps in the CRISP-DM process?

A
  • Business Understanding
  • Data Understanding
  • Data Sampling
  • Data Preparation
  • Data Partitioning
  • Model Construction
  • Model Evaluation
  • Deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a Confusion Matrix represent?

A

It tracks how well the model is classifying outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the components of a Confusion Matrix?

A
  • True Positive (TP)
  • False Negative (FN)
  • False Positive (FP)
  • True Negative (TN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the formula for Accuracy?

A

Accuracy = (TP + TN) / Total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does Precision measure?

A

Precision = TP / (TP + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does Recall / Sensitivity measure?

A

Recall = TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Specificity?

A

Specificity = TN / (TN + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the F1 Score?

A

The harmonic mean of Precision and Recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the default cutoff value in classification?

A

0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens when the threshold is lowered?

A

More ‘yes’ predictions, higher recall, lower precision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Lift Chart?

A

A visual tool that compares a model’s ability to rank positives vs. a random guess.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does ROC Curve plot?

A

True Positive Rate (Recall) vs False Positive Rate (1 – Specificity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why not use linear regression for classification?

A

Linear regression produces continuous outputs and doesn’t bound predictions between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What transformation does logistic regression use?

A

The logit transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the logistic model equation?

A

ln(p / (1 - p)) = β0 + β1x1 + β2x2 + …

17
Q

What is k-Nearest Neighbours (k-NN)?

A

A method that classifies new observations based on the closest training points.

18
Q

What distance metric is commonly used in k-NN?

A

Euclidean distance.

19
Q

What is the effect of choosing k = 1 in k-NN?

A

Highly sensitive, may overfit.

20
Q

What are key features of Classification and Regression Trees (CART)?

A
  • Non-parametric model
  • Tree-based structure
  • Splits based on informative features
21
Q

What is overfitting in the context of CART?

A

Fitting the training data too well, resulting in high variance.

22
Q

What is the purpose of pruning in CART?

A

To improve generalisability based on validation error.

23
Q

What is an Ensemble Method?

A

Combines multiple weak models to create a stronger model.

24
Q

What is Monte Carlo Simulation?

A

A technique to model uncertainty using random values from probability distributions.

25
What does the output of a Monte Carlo Simulation represent?
A range of possible outcomes.
26
What function is used to generate uniform random numbers in Excel?
RAND()
27
What is the Excel function to simulate a normal distribution?
NORM.INV(RAND(), μ, σ)
28
What are the key variables in the Sanotronics case?
* Labour cost * Parts cost * Demand
29
What type of distribution is used for Parts cost in the Sanotronics case?
Uniform distribution.
30
What is the goal of the Land Shark case?
Simulate whether a bid will win a property auction.
31
What are the inputs for the Land Shark case simulation?
* Estimated property value * Random number of competing bidders * Competitor bid values
32
What is a limitation of simulation?
Results are estimates, not guarantees.