#11 - Machine Learning Flashcards

(33 cards)

1
Q

What does it mean when the p-values are high and low?

A

A p-value shows the chance that observed results happened by random chance under the null hypothesis. A low p-value (≤ 0.05) suggests the null hypothesis can be rejected. A high p-value (≥ 0.05) supports the null hypothesis. A value of 0.05 is a borderline case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When is resampling done?

A

Resampling is used to improve model accuracy and estimate uncertainty. It helps validate models on different data patterns, often by creating subsets, shuffling labels, or performing cross-validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do you understand by imbalanced data?

A

Imbalanced data means that one class is much more frequent than others. This can lead to poor model performance because the model may ignore the minority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Are there any differences between expected value and mean value?

A

Both expected value and mean are similar mathematically, but the mean is used for data samples, while expected value is used for random variables in probability theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do you understand by survivorship bias?

A

Survivorship bias is the error of focusing only on successes (those that ‘survived’) while ignoring failures. This can lead to incorrect conclusions or overestimating success.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are confounding variables?

A

Confounding variables (confounders) are extra variables that affect both the independent and dependent variables, creating a false or misleading relationship between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are time series problems different from regular regression problems?

A

Time series problems involve predicting future values based on historical patterns using techniques like autocorrelation or moving averages. The key difference is the dependency on time — observations close in time are more related than distant ones. It’s not a time series problem unless the target depends on time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you handle a dataset where variables have more than 30% missing values?

A

If the dataset is small, missing values can be filled using the mean (e.g., with df.fillna(df.mean()) in pandas). For large datasets, it’s better to remove rows with too many missing values to maintain data quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is cross-validation and why is it used?

A

Cross-validation is a technique to test model performance by splitting training data into parts and rotating them for training and validation. It helps ensure the model generalizes well to unseen data. Common methods: K-Fold, Leave-One-Out, Holdout.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main differences between correlation and covariance?

A

Correlation measures the strength and direction of a relationship between variables (dimensionless), while covariance measures how two variables change together (has units). Correlation is normalized; covariance is not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you approach solving a data analytics project?

A
  1. Understand the business problem. 2. Explore and analyze data. 3. Clean and prepare data. 4. Build and evaluate models. 5. Visualize results. 6. Deploy and monitor the model. 7. Perform cross-validation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is selection bias?

A

Selection bias occurs when the dataset used for analysis is not representative of the population, often due to non-random sampling. This can lead to misleading results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is data cleaning crucial?

A

Data cleaning ensures accuracy, consistency, and reliability of data. It prevents errors in analysis and predictions, improves model performance, and is often the most time-consuming step in data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the main types of feature selection methods?

A

Feature selection methods include:
- Filter Methods: Use statistical tests (e.g., Chi-Square, Correlation).
- Wrapper Methods: Use model performance (e.g., Forward/Backward Selection, RFE).
- Embedded Methods: Built into model training (e.g., LASSO, Random Forest Importance).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When is it acceptable to treat categorical variables as continuous?

A

If the variable is ordinal (has a clear order), treating it as continuous can help improve model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you treat missing values during data analysis?

A

Use mean/median for numerical data, default values for categorical. If 80%+ is missing, drop or impute based on dataset size and context.

17
Q

What does the ROC Curve represent?

A

It shows the trade-off between true positive rate (TPR) and false positive rate (FPR) at different thresholds. Used to evaluate classifier performance.

18
Q

What is the difference between univariate, bivariate, and multivariate analysis?

A

Univariate: one variable (e.g. sales chart), Bivariate: two variables (e.g. scatterplot), Multivariate: multiple variables (e.g. social media use vs self-esteem).

19
Q

What is the difference between test set and validation set?

A

Validation set is used to tune model parameters, test set is used to evaluate final model performance.

20
Q

What is the kernel trick?

A

A technique in SVMs where non-linear data is mapped into higher dimensions to make it linearly separable using kernel functions.

21
Q

How do box plots and histograms differ?

A

Histograms show frequency distribution with bars; box plots summarize data spread, median, and outliers—useful for comparing datasets quickly.

22
Q

How do you balance or correct imbalanced data?

A

Balance data using under/over-sampling, use metrics like F1 score, MCC, AUC. Resample before cross-validation to avoid overfitting.

23
Q

Why is random forest better than multiple decision trees?

A

Random forests are ensemble models combining multiple trees, reducing overfitting and improving accuracy and robustness.

24
Q

How to evaluate the probability of seeing a shooting star in one hour given a 30% chance in 15 minutes?

A

Use: 1 - (0.7)^4 = 0.8628 or 86.28%.

25
What is the probability of heads after 10 heads from a jar with 999 fair coins and 1 double-headed coin?
Using Bayes’ Theorem: 0.7531 or 75.31% chance of heads in next toss.
26
Give examples where false positives are more important than false negatives.
In medicine (e.g., wrongly diagnosing cancer) or e-commerce (e.g., offering rewards to non-qualified users).
27
Give an example where both false positives and false negatives matter equally.
In banking, both false approvals and rejections of loan applicants can have serious consequences.
28
Is dimensionality reduction useful before fitting an SVM?
Yes, especially when the number of features > observations. It helps improve SVM performance and avoid overfitting.
29
What are the assumptions of linear regression and what if they are violated?
Assumes linearity, independence, homoscedasticity, normal distribution. Violations increase bias, variance, or produce invalid results.
30
How does feature selection work using regularization?
L1 (Lasso) regularization shrinks coefficients; some become zero and are excluded from the model.
31
How to test if a coin is biased?
Use hypothesis testing: null hypothesis = fair coin (50%), flip many times, calculate p-value, compare with alpha (0.05).
32
Why is dimensionality reduction important?
Reduces overfitting, improves model speed, helps with visualization and interpretation by removing redundant features.
33
What is the difference between grid search and random search for hyperparameter tuning?
Grid search tests all combinations (slow for many parameters), random search samples random combinations (faster, often more efficient).