Classification Flashcards

1
Q

impute/imputation

A

In statistics, imputation is the process of replacing missing data with substituted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

confusion matrix

A
  • A cross-tabulation of our model’s predictions against actual values
  • A matrix (table) used to measure the performance of a machine learning algorithm
  • Rows: actual classes (Ci)
  • Columns: predicted classes (Cj)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 4 possible outcomes of classification task?

A

True Positive
False Postive
False Negative
True Negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the common choice for the baseline model for a classification problem?

A

a model that simply predicts the most common class every single time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the common evaluation metrics for a classification problem/model?

A
  • Accuracy
  • Precision
  • Recall
  • Specificity
  • f1 score
  • ROC curve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is accuracy?

A

the number of times we predicted correctly divided by the total number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is precision / positive predictive value?

A

the percentage of positive predictions that we made that are correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is recall / true positive rate / sensitivity?

A

the percentage of positive cases we accurately predicted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is specificity / true negative rate?

A

the percentage of negative cases we accurately predicted.

The percentage of predicting true negative out of all negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

logistic regression

A
  • A regression algorithm
  • To find the values of the coefficients that weight each input variable)
  • To assign observations to a discrete set of classes
  • To predict discrete outcomes
  • binomial and multinomial
  • The output is a value between 0 and 1 that represents the probability of one class over the other.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

regularized least squares

A
  • A way of solving least squares regression problems
  • An extra constraint on the solution, which is called regularization
  • It adds a penalty term to the error.
  • A argument in LogisticRegression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the components of a decision tree?

A
  • root
  • condition/internal node
  • branches/edges
  • decision/leaf
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is classification tree?

A

to classify the outcome variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is regression tree?

A

to predict continuous values like price of a house

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CART = ?

A

classification and regression trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Recursive binary splitting / greedy algorithm

A
  1. Consider all the features
  2. Use a cost function to try and test all different (candidate) split points
  3. Select the split with the best/lowest cost.
  4. Make the root node the best predictor/classifier
17
Q

What are Decision Trees?

A
  1. Is a supervised machine learning process / train on labeled data.
  2. Use the training data to train the tree to find a decision boundary / a sequence of rules.
  3. Use the boundary as a decision rule to classify 2 or more classes.
18
Q

What does each node represent?

A
  • Is a splitting point in the decision tree.
  • Represents a single input variable(x).
  • a split point or class of that variable
19
Q

What are the pros of decision tree?

A
  1. Simple to understand
  2. Simple to visualize
  3. Simple to explain the output
  4. Requires little data preparation
  5. Don’t need to encode our target variable
  6. Perform well for a broad range of problems
20
Q

What is f1-score?

A
  1. harmonic mean of Recall and Precision
    - giving both metrics equal weight.
  2. When you are looking to optimize for both Recall and Precision.
21
Q

What is support?

A

number of occurrences of each class in where y is true.

22
Q

What is overfitting?

A

Don’t generalize the data well.

23
Q

How to avoid overfitting in Decision-tree?

A

Mechanisms such as

  1. Pruning.
  2. Set the minimum number of samples required at a leaf node
  3. Set the maximum depth
24
Q

How to handle overfitting?

A
  1. Obtain more training data
  2. Feature engineering
    3.
25
Q

What is Logistic Regression model?

A
  1. Maps any real value into a number between 0 and 1,
    - representing the probability that
    - an observation is in the positive class.
26
Q

How does the threshold in LR model affect the metrics?

A
  1. decrease the threshold, Recall increases.

2. increase the threshold, Precision increases

27
Q

What is ROC curve?

A
  1. Receiver Operating Characteristic Curve
  2. Summarize the trade-off between TPR and FPR
    - for a predictive model
    - using different probability thresholds
28
Q

How to calculate baseline prediction?

A
  1. Average value of dependent variable
29
Q

What is encoding?

A

Transform categorical variables to binary or numeric counterparts

30
Q

What is the purpose of splitting data?

A

to avoid overfitting the model to one sample