L2 Flashcards
(33 cards)
What are missing values in datasets?
Missing values are NaN, null, ‘’, or placeholder values in datasets.
Missing values can affect data analysis and model performance.
What is the purpose of missing values imputation?
To fill in missing data so that models can handle complete datasets.
Models cannot directly process missing data.
What method replaces missing values with the mean or median?
Mean/Median imputation.
- Good for numerical data; fast
It is good for numerical data and is fast.
When is KNN imputation used?
When the dataset is small to medium-sized.
- Find ‘k’ nearest neighbors & average their values
KNN finds ‘k’ nearest neighbors and averages their values.
What does model-based imputation involve?
Training a regression/classification model to predict missing values.
- When relationships exist in data
This method is suitable when relationships exist in the data.
What is iterative imputation?
A method that repeatedly models each feature with missing values using other features.
- More accurate but computationally heavy
It is more accurate but computationally heavy.
What is the purpose of feature selection?
Reduces overfitting, speeds up training, improves interpretability, and reduces storage.
Effective feature selection is crucial for model performance.
What does univariate statistics do in feature selection?
Tests each feature individually for relevance to the target.
Tools include f_classif, f_regression, chi2, and mutual_info_classif.
What is mutual information used for in feature selection?
Part of univariate statistics - To assess relevance without assuming linearity, making it suitable for non-linear relationships.
It helps identify important features in complex datasets.
What is model-based feature selection?
Using trained models to determine feature importance.
- Get best fit for a particular model
- Example models: Lasso regression (L1 penalty) / Tree-based models (feature importance from splits)
- Can be single-pass or iterative
(single fit) - Build a model, select features most impt to model
- Lasso, other linear models, tree-based models
- Multivariate – linear models assume linear relation
Example models include Lasso regression and tree-based models.
What is forward selection in iterative model-based methods?
Starting with no features and adding the best one at each step.
It helps build a model gradually based on feature importance.
What does backward elimination involve?
Starting with all features and removing the least useful one at each step.
This method aims to simplify the model by eliminating unnecessary features.
What is Recursive Feature Elimination (RFE)?
An automated method for backwards removal using model feedback.
It is considered an expensive feature selection method.
What is categorical data?
Data that falls into categories, such as color, country, or product type.
measurement levels: categorical – ordinal – interval – ratio
Categorical data is crucial for various types of analysis.
What is One-Hot Encoding?
Adds one binary column per category to avoid ordinal assumptions.
- Avoids ordinal assumptions
Works well for low-cardinality data (few unique values)
**Avoid using raw labels like 1, 2, 3—they introduce false order.
It is effective for low-cardinality data.
What is Count-Based Encoding?
Replaces category with mean target value for that category.
- For regression: mean target value
- For classification: class probabilities (“people in this state have likelihood p for class 1”)
- Useful for high-cardinality features (e.g., ZIP codes, countries)
Useful for high-cardinality features like ZIP codes.
What are digital images made of?
Grids (matrices) of pixels.
- Pixel = smallest unit; stores brightness or color
Each pixel is the smallest unit that stores brightness or color.
What is the storage requirement for binary images?
1 bit per pixel.
- Each pixel is 0 / 1 (black/white)
Each pixel is either 0 (black) or 1 (white).
What is the storage requirement for grayscale images?
1 byte (8 bits) per pixel.
- Pixel value: 0–255 (brightness scale)
Pixel values range from 0 to 255, representing brightness.
What is the storage requirement for color (RGB) images?
3 bytes per pixel.
- 3 values per pixel (Red, Green, Blue)
Each pixel contains three values for Red, Green, and Blue.
What are the types of image tasks in machine learning?
- Classification: Label whole image (e.g., Fracture / No fracture)
- Detection: Locate object within image
- Segmentation: Classify each pixel (e.g., tumor pixels vs normal)
An example is classifying an image as ‘Fracture’ or ‘No fracture’.
What is the definition of precision in evaluation metrics?
Precision = TP / (TP + FP)
Precision measures the accuracy of positive predictions.
What is the definition of recall in evaluation metrics?
Recall = TP / (TP + FN).
Recall measures the ability to find all relevant instances.
What does F1 Score represent?
F1 score = harmonic mean of precision and recall.
F = 2 x (precision x recall) / (precision + recall)
It balances both metrics for better evaluation.