Data Science Flashcards
(34 cards)
What is accuracy in classification metrics?
Percentage of correct predictions overall.
Good for balanced datasets but can be misleading if classes are imbalanced.
Define precision in classification metrics.
Proportion of positive predictions that are actually correct (TP / (TP + FP)).
Important when false positives are costly. E.g. cancer detection model
What is recall (sensitivity) in classification metrics?
Proportion of actual positives correctly identified (TP / (TP + FN)).
Important when missing positives is costly. E.g. fraud detection
What does the F1 score represent?
Harmonic mean of precision and recall.
Useful when you want a balance and classes are imbalanced.
What is the Area Under the ROC Curve (AUC-ROC)?
Measures ability to distinguish between classes across all thresholds.
Higher AUC means better model discrimination.
What does Precision-Recall AUC measure?
Measures the area under the curve plotting precision versus recall at different classification thresholds.
Useful for evaluating models on imbalanced datasets, where the positive class is rare.
What is a confusion matrix?
Shows TP, TN, FP, FN counts for detailed error analysis.
Define Mean Absolute Error (MAE).
Average absolute difference between predicted and actual values.
Easy to interpret, less sensitive to outliers.
What is Mean Squared Error (MSE)?
Average squared difference between predicted and actual values.
Penalizes larger errors more heavily.
Explain Root Mean Squared Error (RMSE).
Square root of MSE, same units as target variable.
Commonly used, sensitive to outliers.
What is R-squared (Coefficient of Determination)?
Proportion of variance explained by the model.
Ranges from 0 to 1; higher is better.
What does Log Loss (Cross-Entropy Loss) measure?
Measures uncertainty of classification predictions, penalizing wrong confident predictions.
Lower is better.
Log Loss = - (y * log(p) + (1 - y) * log(1 - p))
What are Lift and Gain used for?
Used in marketing and business to measure improvement over random guessing.
What metrics should be focused on for imbalanced classification?
Precision, recall, F1, or AUC rather than accuracy.
What are common metrics for regression?
RMSE and MAE are common; use R-squared to understand variance explained.
What is supervised learning?
Uses labeled data to train models that predict outputs from inputs.
Define unsupervised learning.
Finds patterns or groupings in unlabeled data without predefined answers.
What is reinforcement learning?
Trains an agent to make decisions by maximizing rewards through trial and error interactions with an environment.
What is a p-value?
The probability of observing data as extreme as (or more extreme than) what you have, assuming the null hypothesis is true.
The null hypothesis of a p-value is the assumption that there is no effect, no difference, or no association between the variables being tested
Why is 0.05 commonly used as a p-value threshold?
It is a conventional threshold chosen as a balance between being too strict and too lenient.
What does a p-value below 0.05 indicate?
Results are considered statistically significant, meaning the observed effect is unlikely due to chance alone.
Define the null hypothesis.
A default assumption that there is no effect or no difference in the population.
What is overfitting and how do you prevent it?
Overfitting happens when a model performs well on training data but poorly on new, unseen data. It can be reduced with cross-validation, regularization (like L1/L2), pruning, or simpler models.
What is specificity?
Specificity is the proportion of true negatives that are correctly identified by the test.
TN / (TN + FP)
A high specificity means the test rarely gives false positives (i.e., it doesn’t wrongly label healthy people as sick)