Linear Reg Flashcards
How do we evaluate a regression model?
• Given N examples, pairs xi yi, linear regression computes a model
• So that for each point,
• We evaluate the model by computing the Residual Sum of Squares (RSS)
The goal of linear regression is thus to find the weights that minimize RSS
What are the assumptions made for linear regression?!
• Linearity
– When applying linear regression, prediction is a linear combinations of the inputs
• Normality
–The target outcome follows a normal distribution
• Homoscedasticity
–The variance of the error terms is assumed to be constant over the entire feature space
• Independence
– Each instance is independent from one another
• Absence of Multicollinearity
–There are no strongly correlated features
What is the coefficient of determination R squared in linear regression? What does it indicates?
• Total sum of squares
• Coefficient of determination
• R2 measures of how well the regression line approximates the real data points. When R2 is 1, the regression line perfectly fits the data.
• R2 increases with the number of features even if they do not convey any information about the target
• Therefore, it is usually better to use the adjusted R2
How do we evaluate a model?
• Models should be evaluated using data that have not been used to build the model itself
• Example: would be feasible to evaluate students using exactly the same problems solved in class?
• The available data must be split between training and test
–Training data will be used to build the model
–Test data will be used to evaluate the model performance
What is cross-validation?
• First step
– Data is split into k subsets of equal size
• Second step
–Each subset in turn is used for testing and the remainder for training
• This is called k-fold cross-validation and avoids overlapping test sets
• Often the subsets are stratified before cross-validation is performed
• The error estimates are averaged to yield an overall error estimate
• Standard method for evaluation stratified ten-fold cross-validation
• Why ten? Experiments have shown that this is the best choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better : repeated stratified cross-validation
• Ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
• Other approaches appear to be robust, e.g., 5x2 crossvalidation
What is over fitting?
Very good performance on the training set (model fits precisely patterns present in training data)
Terrible performance on the test set (patterns were just noise and are no longer present)
Why and how regularizations such as Ridge and Lasso work?
How can we analyze the effect of regularizations on models?
We could plot the weight values before and after the application of the regularizations
We could also analyze the effect of the regularizations as the alpha value changes plotting the weight values against the variation on alpha.
What are the strategies to evaluate the best alpha value
• To select the best value of α we cannot use the test set since it is going to be used for evaluating the final model (which uses α)
• Need to reserve part of the training data to evaluate possible candidate values of α and to select the best one
• If we have enough data, we can extract a validation set from the training data which will be used to select α
• If we don’t have enough data, we should select α by applying k-fold cross-validation over the training data choosing the α corresponding to the lowest average cost over the k folds
What are some of the metrics used to evaluate classification models
• Accuracy
–Classifier accuracy in predicting the correct the class labels
• Speed
–Time to construct the model (training time)
–Time to use the model to label unseen data
• Other Criteria
–Robustness in handling noise
– Scalability
– Interpretability
What are linear classifiers? How do they work?
Linear classifiers are algorithms used in machine learning to classify data points by separating them into different classes using a linear decision boundary. They work by finding a hyperplane (a line in 2D, a plane in 3D, or a higher-dimensional equivalent) that best divides the data points of different classes.
Key Components:
1. Linear Decision Boundary: The boundary is defined by a linear equation of the form:
f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b
where:
• \mathbf{x} is the input feature vector.
• \mathbf{w} is the weight vector that defines the orientation of the hyperplane.
• b is the bias term that shifts the hyperplane.
2. Classification Rule:
• A data point is classified based on which side of the hyperplane it lies. For binary classification:
\text{Class 1 if } f(\mathbf{x}) \geq 0, \text{ otherwise Class 2}.
How Linear Classifiers Work:
1. Training: The algorithm adjusts the weights ( \mathbf{w} ) and bias ( b ) during training using labeled data so that the hyperplane best separates the classes. • Algorithms like Perceptron, Support Vector Machine (SVM), or optimization techniques like Gradient Descent are used for this purpose. 2. Prediction: For a new input, the model calculates f(\mathbf{x}) and determines the class based on the sign or value of f(\mathbf{x}) . 3. Evaluation: The performance of the classifier is measured using metrics like accuracy, precision, recall, and others.
Common Examples of Linear Classifiers:
1. Logistic Regression: Models the probability of a binary outcome and uses a logistic function. 2. Support Vector Machines (Linear Kernel): Maximizes the margin between classes while finding the optimal hyperplane. 3. Perceptron Algorithm: A simple linear classifier that adjusts weights iteratively.
Limitations:
• Not Suitable for Non-linear Data: Linear classifiers cannot model complex relationships or datasets where classes are not linearly separable. • Sensitive to Feature Scaling: The performance depends heavily on how features are scaled.
Extensions for Non-linear Data:
• Kernel methods (e.g., in SVMs) or feature transformations (e.g., polynomial features) can help handle non-linear data while still using a linear classifier approach.
Detail the logistics regression technique
Logistic regression is a supervised learning technique used for binary classification problems, where the output variable can take one of two possible values (e.g., yes/no, 0/1, spam/not spam). Unlike linear regression, logistic regression predicts the probability that a given input belongs to a particular class, mapping the output to a range between 0 and 1 using a sigmoid function.
- Key Concepts
Model Equation
Logistic regression uses the following model:
P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)
where:
• \mathbf{x} : Input feature vector.
• \mathbf{w} : Weight vector (coefficients).
• b : Bias (intercept).
• \sigma(z) : Sigmoid function defined as:
\sigma(z) = \frac{1}{1 + e^{-z}}
The sigmoid function maps any real-valued number to the range [0, 1].
Decision Boundary
To classify data points, logistic regression uses a threshold (e.g., 0.5):
• If P(y=1 | \mathbf{x}) \geq 0.5 , classify as class 1.
• Otherwise, classify as class 0.
The decision boundary is a linear hyperplane, defined by:
\mathbf{w}^T\mathbf{x} + b = 0
- Training Process
Log-Likelihood Function
The model is trained by maximizing the likelihood of the observed data. The likelihood for a dataset with n samples is:
L(\mathbf{w}, b) = \prod_{i=1}^n P(y_i | \mathbf{x}i)
Taking the logarithm (log-likelihood) simplifies computation:
\log L(\mathbf{w}, b) = \sum{i=1}^n \Big[ y_i \log P(y_i | \mathbf{x}_i) + (1 - y_i) \log (1 - P(y_i | \mathbf{x}_i)) \Big]
Optimization
The log-likelihood function is maximized to find the optimal weights ( \mathbf{w} ) and bias ( b ):
1. Gradient Descent or variants like Stochastic Gradient Descent (SGD) are commonly used to optimize the parameters.
2. The gradients of the log-likelihood with respect to the parameters are computed to update them iteratively:
\mathbf{w} \gets \mathbf{w} + \eta \nabla_{\mathbf{w}} \log L
where \eta is the learning rate.
- Advantages• Probabilistic Output: Predicts probabilities, which makes it interpretable and useful in risk-based decision-making.
• Efficient: Works well for linearly separable datasets and is computationally efficient.
• Feature Importance: The learned weights ( \mathbf{w} ) provide insights into feature importance. - Limitations• Linear Decision Boundary: Cannot handle non-linear relationships unless features are transformed.
• Imbalanced Data: Can perform poorly if one class dominates the dataset. Techniques like class weighting or oversampling are needed.
• Outliers: Sensitive to outliers, which can significantly affect the decision boundary. - Extensions
Multinomial Logistic Regression:
For multi-class classification, logistic regression can be extended using the softmax function, which generalizes the sigmoid function to multiple classes.
P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}
Regularized Logistic Regression:
Adding regularization terms helps prevent overfitting:
• L1 Regularization: Adds \lambda \sum |w_i| (LASSO).
• L2 Regularization: Adds \lambda \sum w_i^2 (Ridge).
- Applications• Medical Diagnosis: Predicting the presence of a disease (e.g., diabetes).
• Spam Filtering: Classifying emails as spam or not spam.
• Customer Churn Prediction: Identifying customers likely to leave a service.
• Credit Scoring: Determining the likelihood of loan default.
By mapping probabilities to binary outcomes with a linear decision boundary, logistic regression is both a simple yet powerful classification tool.
Define the one versus the rest multi class classification technique
• For each class, it creates one classifier that predicts the target class against all the others
• Given three classes A, B, C, it computes three models
– One that predicts A against B and C
– One that predicts B against A and C, and
– One that predicts C against A and B
• Then, given an example, all the three classifiers are applied and the label with the highest probability is returned
• Alternative approaches include the minimization of loss based on the multinomial loss fit across the entire probability distribution
How can we use logistic regression for multiclass classification?
Logistic regression can be extended to handle multiclass classification problems (where the output has more than two classes) using two main approaches: One-vs-Rest (OvR) and Multinomial Logistic Regression (Softmax Regression). Here’s how they work:
- One-vs-Rest (OvR) Approach
In this method, logistic regression is applied multiple times, once for each class. For a problem with K classes, the approach works as follows:
1. Binary Classifiers: Train K binary logistic regression classifiers, where each classifier distinguishes one class from the rest (e.g., “Class 1 vs. Not Class 1,” “Class 2 vs. Not Class 2,” and so on).
2. Prediction:
• For a new input, each classifier predicts a probability for its respective class.
• The class with the highest probability is assigned as the final prediction:
\hat{y} = \arg\max_{k \in {1, 2, \dots, K}} P(y=k | \mathbf{x})
Advantages of OvR:
• Simple to implement using binary logistic regression. • Efficient for problems with a small number of classes.
Limitations of OvR:
• Can be computationally expensive for large numbers of classes (since K models are trained). • May not perform as well if the classes are highly imbalanced.
- Multinomial Logistic Regression (Softmax Regression)
This is the direct extension of logistic regression for multiclass classification, where a single model predicts the probabilities for all K classes simultaneously. It uses the softmax function to ensure the output probabilities for all classes sum to 1.
Model
For a dataset with K classes, the probability of a data point \mathbf{x} belonging to class k is given by:
P(y = k | \mathbf{x}) = \frac{\exp(\mathbf{w}k^T \mathbf{x} + b_k)}{\sum{j=1}^K \exp(\mathbf{w}_j^T \mathbf{x} + b_j)}
where:
• \mathbf{w}_k and b_k are the weight vector and bias for class k .
• The denominator normalizes the probabilities.
Decision Rule
The predicted class is the one with the highest probability:
\hat{y} = \arg\max_{k \in {1, 2, \dots, K}} P(y = k | \mathbf{x})
Training
The model is trained by maximizing the log-likelihood for all classes. For n samples, the log-likelihood is:
\log L = \sum_{i=1}^n \sum_{k=1}^K \mathbf{1}(y_i = k) \log P(y_i = k | \mathbf{x}_i)
where \mathbf{1}(y_i = k) is an indicator function (1 if y_i = k , 0 otherwise).
Optimization is done using methods like Gradient Descent or Stochastic Gradient Descent.
Advantages of Softmax Regression:
• Single model handles all classes. • Provides probabilistic outputs for all classes. • Works well for balanced and separable datasets.
Limitations of Softmax Regression:
• Computationally expensive for datasets with a large number of classes. • Assumes linear separability in the feature space.
- Regularization
To prevent overfitting, regularization can be applied to both approaches:
• L1 Regularization (LASSO): Encourages sparsity in weights.
• L2 Regularization (Ridge): Penalizes large weights to improve generalization.
- Comparison of OvR and Softmax
Feature OvR Softmax Regression
Number of Models K binary models 1 multinomial model
Training Complexity Linear in K More complex (joint training for all classes)
Output Class probabilities for each binary model Probabilities for all classes in one step
Use Case Few classes, simpler datasets Balanced and larger datasets
- Applications• Image Classification: Recognizing objects (e.g., dog, cat, car) in images.
• Document Classification: Classifying documents into categories (e.g., sports, technology, politics).
• Medical Diagnosis: Predicting types of diseases or conditions.
By choosing between OvR and Softmax Regression based on the dataset and problem requirements, logistic regression becomes a versatile tool for multiclass classification tasks.
Define the confusion matrix and its attributes. What is the importance of distinguishing the different types of errors?
Confusion Matrix: Definition
A confusion matrix is a tool used to evaluate the performance of a classification model. It provides a summary of the predictions made by the model compared to the actual labels in the dataset. It breaks down the outcomes into four categories: True Positives, True Negatives, False Positives, and False Negatives, which give insight into the types of errors the model makes.
Attributes of the Confusion Matrix
1. True Positives (TP): • Instances where the model correctly predicts the positive class. • For example, the model predicts “disease present” when the disease is indeed present. 2. True Negatives (TN): • Instances where the model correctly predicts the negative class. • For example, the model predicts “no disease” when there is no disease. 3. False Positives (FP): • Instances where the model incorrectly predicts the positive class. • For example, the model predicts “disease present” when there is no disease. • This is also known as a Type I error or a “false alarm.” 4. False Negatives (FN): • Instances where the model incorrectly predicts the negative class. • For example, the model predicts “no disease” when the disease is present. • This is also known as a Type II error or a “miss.”
Importance of Distinguishing Different Types of Errors
1. Context-Specific Impact: • The severity of false positives and false negatives depends on the application. • In medical diagnosis, a false negative (missing a disease) may be life-threatening, while a false positive (incorrectly diagnosing a disease) may cause unnecessary anxiety and tests. 2. Decision-Making: • By understanding the types of errors, we can adjust the model to minimize the more critical error type. For example, in fraud detection, reducing false negatives (undetected fraud) is often more important than reducing false positives (flagging legitimate transactions as fraud). 3. Model Evaluation: • Metrics like precision, recall, and F1-score depend on these error types. For instance, precision focuses on minimizing false positives, while recall emphasizes reducing false negatives. 4. Imbalanced Datasets: • In datasets with imbalanced classes (e.g., rare diseases), accuracy alone can be misleading. Distinguishing errors helps ensure the model is evaluated based on how well it handles the minority class. 5. Real-World Implications: • Understanding and balancing the trade-offs between false positives and false negatives ensures the model’s outputs align with the desired outcomes in practical scenarios.
By analyzing the confusion matrix, we can fine-tune a model to achieve a balance that best fits the specific goals and constraints of the application.
How can we use the confusion matrix to calculate the accuracy of a model? Why is that not enough? In which situations accuracy is not effective/usefull?
Using the Confusion Matrix to Calculate Accuracy
The accuracy of a model measures the proportion of correct predictions (both true positives and true negatives) out of the total predictions. It can be calculated from the confusion matrix using the formula:
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Predictions (TP + TN + FP + FN)}}
In simple terms, it is the ratio of correctly classified instances (both positive and negative) to the total number of instances in the dataset.
Why Accuracy Is Not Always Enough
Although accuracy is intuitive and easy to calculate, it does not always provide a complete picture of model performance. This is because:
1. Class Imbalance:
• In datasets where one class dominates (e.g., 95% of samples belong to Class A and only 5% to Class B), a model that always predicts Class A will achieve 95% accuracy but will fail completely at identifying Class B.
• In such cases, accuracy is misleading because it does not account for the model’s ability to correctly classify the minority class.
2. No Insight into Error Types:
• Accuracy does not distinguish between false positives (Type I errors) and false negatives (Type II errors). For certain applications, one type of error might be far more critical than the other.
• Example: In cancer detection, missing a cancer case (false negative) is more serious than falsely diagnosing cancer (false positive).
3. Lack of Granularity:
• Accuracy is a single metric and does not provide insights into specific aspects of the model’s performance, such as precision, recall, or the trade-offs between them.
4. Overfitting and Bias:
• High accuracy might indicate overfitting to the training data or bias in the dataset, where the model memorizes patterns instead of generalizing well.
Situations Where Accuracy Is Not Effective
1. Imbalanced Datasets: • Example: In fraud detection, where only 1% of transactions are fraudulent, a model predicting all transactions as “non-fraudulent” will have 99% accuracy but will fail to detect any fraud cases. 2. High Cost of Specific Errors: • Example: In medical diagnosis, missing a disease (false negative) might have serious consequences, even if the model achieves high accuracy overall. 3. Multi-Class Problems: • In multi-class classification, accuracy alone does not reveal which classes are being misclassified and whether certain classes are disproportionately affected. 4. Anomalies and Rare Events: • Example: In cybersecurity, detecting rare attacks is crucial, and a high accuracy model might fail to identify these rare cases effectively.
Better Alternatives to Accuracy
When accuracy is not effective, other metrics derived from the confusion matrix are more informative:
1. Precision:
• Focuses on the reliability of positive predictions ( \frac{TP}{TP + FP} ).
• Useful when false positives are costly (e.g., spam filtering).
2. Recall (Sensitivity):
• Measures the ability to identify all actual positives ( \frac{TP}{TP + FN} ).
• Useful when false negatives are costly (e.g., medical diagnosis).
3. F1-Score:
• Combines precision and recall into a single metric ( 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ).
• Useful for imbalanced datasets.
4. Specificity:
• Measures the ability to identify actual negatives ( \frac{TN}{TN + FP} ).
• Important when false positives need to be minimized.
5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
• Evaluates the trade-off between true positive and false positive rates across different thresholds.
By considering these metrics alongside accuracy, we gain a more comprehensive understanding of model performance, especially in critical or imbalanced scenarios.
How can we use a cost matrix along with the confusion matrix to better evaluate a model?
Using a Cost Matrix with a Confusion Matrix
A cost matrix is a tool used to quantify the cost or impact of different types of errors (false positives and false negatives) and correct predictions (true positives and true negatives). By combining it with a confusion matrix, we can evaluate a model’s performance more realistically, considering the actual costs or consequences of its predictions.
How a Cost Matrix Works
A cost matrix assigns a numerical value (cost) to each outcome in the confusion matrix:
• True Positives (TP): Often assigned a reward or zero cost.
• True Negatives (TN): Often assigned a reward or zero cost.
• False Positives (FP): Associated with the cost of a Type I error.
• False Negatives (FN): Associated with the cost of a Type II error.
Example Cost Matrix for Binary Classification:
Actual \ Predicted Positive Negative
Positive (Actual) 0 (Reward) FN Cost
Negative (Actual) FP Cost 0 (Reward)
Steps to Use a Cost Matrix with a Confusion Matrix
1. Calculate the Confusion Matrix: • Derive the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for the model’s predictions. 2. Define the Cost Matrix: • Assign costs based on the problem’s context. For example: • In fraud detection: Cost of missing a fraud (FN) is much higher than falsely flagging a legitimate transaction (FP). • In medical diagnosis: Missing a disease (FN) may have life-threatening consequences, while misdiagnosing a healthy patient (FP) may result in unnecessary tests. 3. Compute the Total Cost: • Multiply the confusion matrix values by the corresponding costs from the cost matrix. • Calculate the total cost using the formula:
\text{Total Cost} = (TP \cdot C_{TP}) + (FP \cdot C_{FP}) + (FN \cdot C_{FN}) + (TN \cdot C_{TN})
• Where C_{TP}, C_{FP}, C_{FN}, C_{TN} are the costs from the cost matrix. 4. Evaluate the Model: • Compare the total costs of different models to identify the one that minimizes the overall cost, rather than solely relying on metrics like accuracy.
Why a Cost Matrix Improves Model Evaluation
1. Realistic Decision-Making: • Incorporates the real-world consequences of errors, making the evaluation more aligned with the application’s requirements. • Example: In fraud detection, the cost of missing fraud is higher than the cost of flagging a legitimate transaction. 2. Prioritization of Errors: • Helps prioritize reducing specific errors (false positives or false negatives) based on their impact. 3. Balancing Class Imbalances: • Adjusts for the unequal importance of classes, especially in datasets with rare but critical events (e.g., fraud, diseases). 4. Guides Threshold Selection: • A cost-sensitive approach can help choose an optimal decision threshold to minimize overall costs.
Example
Scenario: Medical Diagnosis
• TP (Correctly detects disease): Cost = $0 (Reward for correct detection). • FP (Healthy person diagnosed with disease): Cost = $100 (Cost of unnecessary tests). • FN (Missed disease): Cost = $10,000 (Cost of untreated disease). • TN (Correctly identifies healthy): Cost = $0.
Suppose the confusion matrix for a model is:
• TP = 90, FP = 10, FN = 5, TN = 95.
Cost Computation:
\text{Total Cost} = (90 \cdot 0) + (10 \cdot 100) + (5 \cdot 10,000) + (95 \cdot 0)
\text{Total Cost} = 0 + 1,000 + 50,000 + 0 = 51,000
This cost-driven evaluation reveals the high penalty for false negatives, emphasizing the need for a model with higher recall.
When to Use a Cost Matrix
1. Applications with High-Stakes Errors: • Fraud detection, medical diagnosis, credit risk analysis, cybersecurity. 2. Imbalanced Datasets: • When class distribution is skewed, and some errors (e.g., false negatives) are more critical than others. 3. Cost-Sensitive Decision Making: • In scenarios where the focus is on minimizing the overall cost rather than maximizing general metrics like accuracy.
By using a cost matrix, we shift from generic model evaluation to cost-sensitive optimization, enabling better alignment with real-world objectives.
Define the precision and recall metrics
• Alternatives to accuracy, introduced in the area of information retrieval and search engine
• Precision
–In the information retrieval context represents the percentage of actually good documents that have been shown as a result.
–Percentage of items classified as positive that are actually positive
• Recall
–Percentage of positive examples that are classified as positive
– In the information retrieval context, recall represents the percentage of good documents shown with respect to the existing ones.
What is the F1 metrics? How can we calculate it?
F1 Metric: Definition
The F1 metric (or F1-score) is a measure of a model’s accuracy that considers both precision and recall. It is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between these two measures. The F1-score is especially useful in cases of imbalanced datasets, where accuracy might be misleading.
• Precision: The proportion of true positive predictions out of all positive predictions made by the model.
• Recall (Sensitivity): The proportion of true positive predictions out of all actual positive instances.
The F1-score is calculated to give equal weight to precision and recall, making it effective when both metrics are important.
Formula to Calculate F1-Score
The F1-score is computed using the formula:
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
Where:
• \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
• \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
Steps to Calculate F1-Score
1. Determine Precision: • Count the true positives (TP) and false positives (FP). • Calculate precision using the formula: \text{Precision} = \frac{TP}{TP + FP} . 2. Determine Recall: • Count the true positives (TP) and false negatives (FN). • Calculate recall using the formula: \text{Recall} = \frac{TP}{TP + FN} . 3. Calculate the F1-Score: • Use the precision and recall values in the F1 formula:
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
Why Use the F1-Score?
1. Balances Precision and Recall: • When both false positives and false negatives are critical, the F1-score provides a balanced evaluation. 2. Useful for Imbalanced Datasets: • Accuracy may appear high if the majority class dominates, but the F1-score reflects the model’s performance for minority classes by focusing on TP, FP, and FN. 3. Handles Trade-Offs: • High precision and low recall (or vice versa) result in a low F1-score, emphasizing the importance of balancing the two.
Example Scenario
Consider a binary classification model:
• The model predicts 50 positives.
• Of these, 30 are true positives (TP) and 20 are false positives (FP).
• There are 70 actual positives in the dataset, so there are 40 false negatives (FN).
Step 1: Calculate Precision
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{30}{30 + 20} = 0.6
Step 2: Calculate Recall
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{30}{30 + 40} = 0.4286
Step 3: Calculate F1-Score
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.6 \cdot 0.4286}{0.6 + 0.4286} \approx 0.5
Limitations of F1-Score
• It does not account for true negatives (TN), so it might not fully reflect the model’s overall performance, particularly when TN is important. • The F1-score assumes equal importance for precision and recall. If one is more critical, other metrics like the weighted F1-score or a customized cost function might be more suitable.
When to Use the F1-Score
• When dealing with imbalanced datasets. • When both false positives and false negatives are significant but need to be balanced. • In applications like fraud detection, medical diagnosis, or spam filtering, where one type of error might dominate but balancing both errors is essential.
What are the interpretations for the weights according to the attribute type?
• Numerical variables
– Increasing the numerical feature by one unit changes the estimated outcome by its weight
• Binary variables
– Changing the variable’s value modifies the outcome by the variable’s weight
• Nominal variables
– They are generally transformed using one-hot-encoding
thus the values are mapped into binary variables
• Intercept
– The interpretation of this weight makes the most sense when the values have been normalized (standardized)
– In this case, the intercept reflects the predicted outcome when all the variables are at mean value
What is class imbalance? How could we solve it?
Class Imbalance
In many data sets there are a disproportionate number of instances that belong to different classes
In health-care applications, we expect to observe a smaller number of subjects who are positively diagnosed.
In credit card fraud detection, fraudulent transactions are greatly outnumbered by legitimate transactions.
Strategies for Imbalance Datasets
• A basic approach for creating balanced training sets is to generate a sample of training instances where the rare class has adequate representation.
• Two types of sampling methods to enhance the representation of the minority class: undersampling and oversampling
• Undersampling
–The frequency of the majority class is reduced to match the frequency of the minority class
– However, some of the useful negative examples may not be chosen for training, therefore, resulting in an inferior classification model.
Oversampling:
• Examples of the minority class are artificially created to make them equal in proportion to the number of negative instances (e.g., by duplicating existing examples or creating new ones)
• Duplicating a positive instance is analogous to doubling its weight during the training stage.The same effect can be achieved by assigning higher weights to positive instances than negative instances (an approach that can be used for example with logistic regression, ANN, and SVM).
• Duplicated examples have an artificially lower variance compared with their true distribution in the overall data.This can bias the classifier to the specific distribution of training instances, which may not be representative of the distribution of test instances, leading to poor generalizability.
What is the SMOTE technique?
• To overcome the limitations of oversampling by duplication, we can generate synthetic positive instances in the neighborhood of existing positive instances.
• Synthetic Minority Oversampling Technique (SMOTE)
– First determine the k-nearest positive neighbors of every positive instance x
– Then generate a synthetic positive instance at some intermediate point along the line segment joining x to one of its randomly chosen k-nearest neighbor, xk.
– Repeat the process until the desired number of positive instances is reached
• SMOTE generates new positive instances in the convex hull of the existing positive class. Hence, it does not improve the representation of the positive class outside the boundary of existing positive instances
How to compare the relative performance among competing models?
• Suppose we have two models
–Model MA with an accuracy = 82% computed using 10-fold crossvalidation –Model MB with an accuracy = 80% computed using 10-fold crossvalidation
• How much confidence can we place on accuracy of MA and MB?
• Can we say MA is better than MB?
• Can the performance difference be the result of random fluctuations in the test set?
How do we know that the difference in performance is not just due to chance?
We computes the odds of it! Apply the t-test and compute the p-value
The p-value represents the probability that the reported difference is due to chance
What is the general idea when applying student-t test to two models?
• First decide on a confidence level, for example, 95%
– Corresponds to false discovery (false positive) rate: 𝛂 = 5%
– How frequently you are willing to declare difference when there is none
• Apply k-fold cross-validation to each model
– Obtaining k evaluations for each algorithm over same folds
• Apply Student’s t-test and compute p-value to determine whether reported difference is statistically significant
– If p-value>𝛂 then difference is not significant (can claim nothing)
– If p-value<𝛂 then difference is significant (claim one better than the other) – Note that the t-test can be paired or unpaired