#11 - Machine Learning Flashcards
(33 cards)
What does it mean when the p-values are high and low?
A p-value shows the chance that observed results happened by random chance under the null hypothesis. A low p-value (≤ 0.05) suggests the null hypothesis can be rejected. A high p-value (≥ 0.05) supports the null hypothesis. A value of 0.05 is a borderline case.
When is resampling done?
Resampling is used to improve model accuracy and estimate uncertainty. It helps validate models on different data patterns, often by creating subsets, shuffling labels, or performing cross-validation.
What do you understand by imbalanced data?
Imbalanced data means that one class is much more frequent than others. This can lead to poor model performance because the model may ignore the minority class.
Are there any differences between expected value and mean value?
Both expected value and mean are similar mathematically, but the mean is used for data samples, while expected value is used for random variables in probability theory.
What do you understand by survivorship bias?
Survivorship bias is the error of focusing only on successes (those that ‘survived’) while ignoring failures. This can lead to incorrect conclusions or overestimating success.
What are confounding variables?
Confounding variables (confounders) are extra variables that affect both the independent and dependent variables, creating a false or misleading relationship between them.
How are time series problems different from regular regression problems?
Time series problems involve predicting future values based on historical patterns using techniques like autocorrelation or moving averages. The key difference is the dependency on time — observations close in time are more related than distant ones. It’s not a time series problem unless the target depends on time.
How do you handle a dataset where variables have more than 30% missing values?
If the dataset is small, missing values can be filled using the mean (e.g., with df.fillna(df.mean())
in pandas). For large datasets, it’s better to remove rows with too many missing values to maintain data quality.
What is cross-validation and why is it used?
Cross-validation is a technique to test model performance by splitting training data into parts and rotating them for training and validation. It helps ensure the model generalizes well to unseen data. Common methods: K-Fold, Leave-One-Out, Holdout.
What are the main differences between correlation and covariance?
Correlation measures the strength and direction of a relationship between variables (dimensionless), while covariance measures how two variables change together (has units). Correlation is normalized; covariance is not.
How do you approach solving a data analytics project?
- Understand the business problem. 2. Explore and analyze data. 3. Clean and prepare data. 4. Build and evaluate models. 5. Visualize results. 6. Deploy and monitor the model. 7. Perform cross-validation.
What is selection bias?
Selection bias occurs when the dataset used for analysis is not representative of the population, often due to non-random sampling. This can lead to misleading results.
Why is data cleaning crucial?
Data cleaning ensures accuracy, consistency, and reliability of data. It prevents errors in analysis and predictions, improves model performance, and is often the most time-consuming step in data science.
What are the main types of feature selection methods?
Feature selection methods include:
- Filter Methods: Use statistical tests (e.g., Chi-Square, Correlation).
- Wrapper Methods: Use model performance (e.g., Forward/Backward Selection, RFE).
- Embedded Methods: Built into model training (e.g., LASSO, Random Forest Importance).
When is it acceptable to treat categorical variables as continuous?
If the variable is ordinal (has a clear order), treating it as continuous can help improve model performance.
How do you treat missing values during data analysis?
Use mean/median for numerical data, default values for categorical. If 80%+ is missing, drop or impute based on dataset size and context.
What does the ROC Curve represent?
It shows the trade-off between true positive rate (TPR) and false positive rate (FPR) at different thresholds. Used to evaluate classifier performance.
What is the difference between univariate, bivariate, and multivariate analysis?
Univariate: one variable (e.g. sales chart), Bivariate: two variables (e.g. scatterplot), Multivariate: multiple variables (e.g. social media use vs self-esteem).
What is the difference between test set and validation set?
Validation set is used to tune model parameters, test set is used to evaluate final model performance.
What is the kernel trick?
A technique in SVMs where non-linear data is mapped into higher dimensions to make it linearly separable using kernel functions.
How do box plots and histograms differ?
Histograms show frequency distribution with bars; box plots summarize data spread, median, and outliers—useful for comparing datasets quickly.
How do you balance or correct imbalanced data?
Balance data using under/over-sampling, use metrics like F1 score, MCC, AUC. Resample before cross-validation to avoid overfitting.
Why is random forest better than multiple decision trees?
Random forests are ensemble models combining multiple trees, reducing overfitting and improving accuracy and robustness.
How to evaluate the probability of seeing a shooting star in one hour given a 30% chance in 15 minutes?
Use: 1 - (0.7)^4 = 0.8628 or 86.28%.