Classification Flashcards
(30 cards)
impute/imputation
In statistics, imputation is the process of replacing missing data with substituted values.
confusion matrix
- A cross-tabulation of our model’s predictions against actual values
- A matrix (table) used to measure the performance of a machine learning algorithm
- Rows: actual classes (Ci)
- Columns: predicted classes (Cj)
What are the 4 possible outcomes of classification task?
True Positive
False Postive
False Negative
True Negative
What is the common choice for the baseline model for a classification problem?
a model that simply predicts the most common class every single time
What are the common evaluation metrics for a classification problem/model?
- Accuracy
- Precision
- Recall
- Specificity
- f1 score
- ROC curve
What is accuracy?
the number of times we predicted correctly divided by the total number of observations
What is precision / positive predictive value?
the percentage of positive predictions that we made that are correct.
What is recall / true positive rate / sensitivity?
the percentage of positive cases we accurately predicted.
What is specificity / true negative rate?
the percentage of negative cases we accurately predicted.
The percentage of predicting true negative out of all negatives.
logistic regression
- A regression algorithm
- To find the values of the coefficients that weight each input variable)
- To assign observations to a discrete set of classes
- To predict discrete outcomes
- binomial and multinomial
- The output is a value between 0 and 1 that represents the probability of one class over the other.
regularized least squares
- A way of solving least squares regression problems
- An extra constraint on the solution, which is called regularization
- It adds a penalty term to the error.
- A argument in LogisticRegression
What are the components of a decision tree?
- root
- condition/internal node
- branches/edges
- decision/leaf
What is classification tree?
to classify the outcome variable
What is regression tree?
to predict continuous values like price of a house
CART = ?
classification and regression trees
Recursive binary splitting / greedy algorithm
- Consider all the features
- Use a cost function to try and test all different (candidate) split points
- Select the split with the best/lowest cost.
- Make the root node the best predictor/classifier
What are Decision Trees?
- Is a supervised machine learning process / train on labeled data.
- Use the training data to train the tree to find a decision boundary / a sequence of rules.
- Use the boundary as a decision rule to classify 2 or more classes.
What does each node represent?
- Is a splitting point in the decision tree.
- Represents a single input variable(x).
- a split point or class of that variable
What are the pros of decision tree?
- Simple to understand
- Simple to visualize
- Simple to explain the output
- Requires little data preparation
- Don’t need to encode our target variable
- Perform well for a broad range of problems
What is f1-score?
- harmonic mean of Recall and Precision
- giving both metrics equal weight. - When you are looking to optimize for both Recall and Precision.
What is support?
number of occurrences of each class in where y is true.
What is overfitting?
Don’t generalize the data well.
How to avoid overfitting in Decision-tree?
Mechanisms such as
- Pruning.
- Set the minimum number of samples required at a leaf node
- Set the maximum depth
How to handle overfitting?
- Obtain more training data
- Feature engineering
3.