Data Science Flashcards

(34 cards)

1
Q

What is accuracy in classification metrics?

A

Percentage of correct predictions overall.

Good for balanced datasets but can be misleading if classes are imbalanced.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define precision in classification metrics.

A

Proportion of positive predictions that are actually correct (TP / (TP + FP)).

Important when false positives are costly. E.g. cancer detection model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is recall (sensitivity) in classification metrics?

A

Proportion of actual positives correctly identified (TP / (TP + FN)).

Important when missing positives is costly. E.g. fraud detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the F1 score represent?

A

Harmonic mean of precision and recall.

Useful when you want a balance and classes are imbalanced.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Area Under the ROC Curve (AUC-ROC)?

A

Measures ability to distinguish between classes across all thresholds.

Higher AUC means better model discrimination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does Precision-Recall AUC measure?

A

Measures the area under the curve plotting precision versus recall at different classification thresholds.

Useful for evaluating models on imbalanced datasets, where the positive class is rare.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a confusion matrix?

A

Shows TP, TN, FP, FN counts for detailed error analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Mean Absolute Error (MAE).

A

Average absolute difference between predicted and actual values.

Easy to interpret, less sensitive to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Mean Squared Error (MSE)?

A

Average squared difference between predicted and actual values.

Penalizes larger errors more heavily.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain Root Mean Squared Error (RMSE).

A

Square root of MSE, same units as target variable.

Commonly used, sensitive to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is R-squared (Coefficient of Determination)?

A

Proportion of variance explained by the model.

Ranges from 0 to 1; higher is better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does Log Loss (Cross-Entropy Loss) measure?

A

Measures uncertainty of classification predictions, penalizing wrong confident predictions.

Lower is better.

Log Loss = - (y * log(p) + (1 - y) * log(1 - p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are Lift and Gain used for?

A

Used in marketing and business to measure improvement over random guessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What metrics should be focused on for imbalanced classification?

A

Precision, recall, F1, or AUC rather than accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are common metrics for regression?

A

RMSE and MAE are common; use R-squared to understand variance explained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is supervised learning?

A

Uses labeled data to train models that predict outputs from inputs.

17
Q

Define unsupervised learning.

A

Finds patterns or groupings in unlabeled data without predefined answers.

18
Q

What is reinforcement learning?

A

Trains an agent to make decisions by maximizing rewards through trial and error interactions with an environment.

19
Q

What is a p-value?

A

The probability of observing data as extreme as (or more extreme than) what you have, assuming the null hypothesis is true.

The null hypothesis of a p-value is the assumption that there is no effect, no difference, or no association between the variables being tested

20
Q

Why is 0.05 commonly used as a p-value threshold?

A

It is a conventional threshold chosen as a balance between being too strict and too lenient.

21
Q

What does a p-value below 0.05 indicate?

A

Results are considered statistically significant, meaning the observed effect is unlikely due to chance alone.

22
Q

Define the null hypothesis.

A

A default assumption that there is no effect or no difference in the population.

23
Q

What is overfitting and how do you prevent it?

A

Overfitting happens when a model performs well on training data but poorly on new, unseen data. It can be reduced with cross-validation, regularization (like L1/L2), pruning, or simpler models.

24
Q

What is specificity?

A

Specificity is the proportion of true negatives that are correctly identified by the test.

TN / (TN + FP)

A high specificity means the test rarely gives false positives (i.e., it doesn’t wrongly label healthy people as sick)

25
What is multicollinearity and why is it a problem?
Multicollinearity occurs when features are highly correlated, which can make model coefficients unstable and inflate variance.
26
What’s the difference between bagging and boosting?
Bagging reduces variance by averaging many models trained independently; boosting reduces bias by sequentially training models to correct previous errors.
27
What is the difference between L1 and L2 regularization?
L1 (Lasso) encourages sparsity — it can shrink some coefficients to zero. L2 (Ridge) penalizes large weights but doesn’t zero them out.
28
What is the central limit theorem (CLT)?
The CLT states that the sampling distribution of the mean of any independent variable will approach a normal distribution as the sample size increases.
29
What is the difference between Type I and Type II errors?
Type I: false positive (rejecting a true null). Type II: false negative (failing to reject a false null).
30
What is PCA?
PCA is a dimensionality reduction technique that transforms high-dimensional data into a smaller number of uncorrelated variables called principal components, which capture the maximum variance in the data. It's used to simplify models, reduce noise, and visualize data. PCA is unsupervised and assumes linear relationships.
31
What is a Kernel in ML?
A kernel is a function that computes similarity between data points in a higher-dimensional feature space, without explicitly transforming the data — known as the “kernel trick.” Used in algorithms like Support Vector Machines (SVMs) and Kernel PCA to capture non-linear relationships. Common kernels: linear, polynomial, RBF (Gaussian).
32
What is a random forest?
An ensemble of decision trees trained on bootstrapped samples and random subsets of features to improve generalization.
33
What is gradient boosting?
An ensemble method where trees are added sequentially to correct errors made by previous trees (e.g., XGBoost, LightGBM).
34
What is heteroscedasticity?
When the variance of errors is not constant across observations — a problem for linear regression assumptions.