Data Science Flashcards by Luke Neuendorf

What is accuracy in classification metrics?

Percentage of correct predictions overall.

Good for balanced datasets but can be misleading if classes are imbalanced.

How well did you know this?

Not at all

Perfectly

Define precision in classification metrics.

Proportion of positive predictions that are actually correct (TP / (TP + FP)).

Important when false positives are costly. E.g. cancer detection model

How well did you know this?

Not at all

Perfectly

What is recall (sensitivity) in classification metrics?

Proportion of actual positives correctly identified (TP / (TP + FN)).

Important when missing positives is costly. E.g. fraud detection

How well did you know this?

Not at all

Perfectly

What does the F1 score represent?

Harmonic mean of precision and recall.

Useful when you want a balance and classes are imbalanced.

How well did you know this?

Not at all

Perfectly

What is the Area Under the ROC Curve (AUC-ROC)?

Measures ability to distinguish between classes across all thresholds.

Higher AUC means better model discrimination.

How well did you know this?

Not at all

Perfectly

What does Precision-Recall AUC measure?

Measures the area under the curve plotting precision versus recall at different classification thresholds.

Useful for evaluating models on imbalanced datasets, where the positive class is rare.

How well did you know this?

Not at all

Perfectly

What is a confusion matrix?

Shows TP, TN, FP, FN counts for detailed error analysis.

How well did you know this?

Not at all

Perfectly

Define Mean Absolute Error (MAE).

Average absolute difference between predicted and actual values.

Easy to interpret, less sensitive to outliers.

How well did you know this?

Not at all

Perfectly

What is Mean Squared Error (MSE)?

Average squared difference between predicted and actual values.

Penalizes larger errors more heavily.

How well did you know this?

Not at all

Perfectly

Explain Root Mean Squared Error (RMSE).

Square root of MSE, same units as target variable.

Commonly used, sensitive to outliers.

How well did you know this?

Not at all

Perfectly

What is R-squared (Coefficient of Determination)?

Proportion of variance explained by the model.

Ranges from 0 to 1; higher is better.

How well did you know this?

Not at all

Perfectly

What does Log Loss (Cross-Entropy Loss) measure?

Measures uncertainty of classification predictions, penalizing wrong confident predictions.

Lower is better.

Log Loss = - (y * log(p) + (1 - y) * log(1 - p))

How well did you know this?

Not at all

Perfectly

What are Lift and Gain used for?

Used in marketing and business to measure improvement over random guessing.

How well did you know this?

Not at all

Perfectly

What metrics should be focused on for imbalanced classification?

Precision, recall, F1, or AUC rather than accuracy.

How well did you know this?

Not at all

Perfectly

What are common metrics for regression?

RMSE and MAE are common; use R-squared to understand variance explained.

How well did you know this?

Not at all

Perfectly

What is supervised learning?

Study These Flashcards

Uses labeled data to train models that predict outputs from inputs.

Define unsupervised learning.

Study These Flashcards

Finds patterns or groupings in unlabeled data without predefined answers.

What is reinforcement learning?

Study These Flashcards

Trains an agent to make decisions by maximizing rewards through trial and error interactions with an environment.

What is a p-value?

Study These Flashcards

The probability of observing data as extreme as (or more extreme than) what you have, assuming the null hypothesis is true.

The null hypothesis of a p-value is the assumption that there is no effect, no difference, or no association between the variables being tested

Why is 0.05 commonly used as a p-value threshold?

Study These Flashcards

It is a conventional threshold chosen as a balance between being too strict and too lenient.

What does a p-value below 0.05 indicate?

Study These Flashcards

Results are considered statistically significant, meaning the observed effect is unlikely due to chance alone.

Define the null hypothesis.

Study These Flashcards

A default assumption that there is no effect or no difference in the population.

What is overfitting and how do you prevent it?

Study These Flashcards

Overfitting happens when a model performs well on training data but poorly on new, unseen data. It can be reduced with cross-validation, regularization (like L1/L2), pruning, or simpler models.

What is specificity?

Study These Flashcards

Specificity is the proportion of true negatives that are correctly identified by the test.

TN / (TN + FP)

A high specificity means the test rarely gives false positives (i.e., it doesn’t wrongly label healthy people as sick)

What is multicollinearity and why is it a problem?

Multicollinearity occurs when features are highly correlated, which can make model coefficients unstable and inflate variance.

What’s the difference between bagging and boosting?

Bagging reduces variance by averaging many models trained independently; boosting reduces bias by sequentially training models to correct previous errors.

What is the difference between L1 and L2 regularization?

L1 (Lasso) encourages sparsity — it can shrink some coefficients to zero. L2 (Ridge) penalizes large weights but doesn’t zero them out.

What is the central limit theorem (CLT)?

The CLT states that the sampling distribution of the mean of any independent variable will approach a normal distribution as the sample size increases.

What is the difference between Type I and Type II errors?

Type I: false positive (rejecting a true null). Type II: false negative (failing to reject a false null).

What is PCA?

PCA is a dimensionality reduction technique that transforms high-dimensional data into a smaller number of uncorrelated variables called principal components, which capture the maximum variance in the data. It's used to simplify models, reduce noise, and visualize data. PCA is unsupervised and assumes linear relationships.

What is a Kernel in ML?

A kernel is a function that computes similarity between data points in a higher-dimensional feature space, without explicitly transforming the data — known as the “kernel trick.” Used in algorithms like Support Vector Machines (SVMs) and Kernel PCA to capture non-linear relationships. Common kernels: linear, polynomial, RBF (Gaussian).

What is a random forest?

An ensemble of decision trees trained on bootstrapped samples and random subsets of features to improve generalization.

What is gradient boosting?

An ensemble method where trees are added sequentially to correct errors made by previous trees (e.g., XGBoost, LightGBM).

What is heteroscedasticity?

When the variance of errors is not constant across observations — a problem for linear regression assumptions.

Data Science Flashcards

(34 cards)