#10 - Machine Learning Flashcards
(29 cards)
What is Lift in model evaluation?
Lift measures the performance of a model compared to a random choice model. It shows how much better the model is at prediction.
What is model fitting?
Model fitting indicates how well a model fits the given observations.
What is sampling and its main advantage?
Sampling involves selecting a smaller, representative subset of a dataset to perform analysis. It saves time and resources while still providing meaningful insights, especially with large datasets.
What is probability sampling?
A method where each member of the population has a known, non-zero chance of being selected. Use when statistical accuracy is important. Avoid if you have limited access to the full population.
What is simple random sampling?
Each item has an equal chance of being selected, like drawing names from a hat. Use when the population is homogenous. Avoid with very large datasets due to complexity.
What is stratified sampling?
The population is divided into subgroups (strata) and samples are taken from each. Use when key groups must be represented. Avoid if strata are not well defined.
What is clustered sampling?
Population is divided into clusters, then a few clusters are randomly chosen. Use for geographically spread data or when a full list of individuals is hard to obtain. Avoid if clusters are not internally diverse.
What is non-probability sampling?
Not all members have a known or equal chance of being selected. Use in exploratory research. Avoid when statistical generalization is needed.
What is convenience sampling?
Samples are taken from an easily accessible group. Use for quick insights or pilot testing. Avoid if you want unbiased, generalizable results.
What is quota sampling?
Samples are selected to match specific proportions of characteristics. Use when you need a balanced sample but can’t do random selection. Avoid if sampling bias is a concern.
What is snowball sampling?
Subjects recruit future subjects, good for hidden or hard-to-reach populations. Use in social research or rare populations. Avoid if your study needs a representative, unbiased sample.
What is supervised learning?
A type of ML where the model learns from labeled data. Use when you have input-output pairs. Avoid when labels are not available.
What is unsupervised learning?
The model learns from unlabeled data to find patterns or structure. Use for clustering, dimensionality reduction. Avoid if task requires specific output predictions.
What is semi-supervised learning?
Combines a small amount of labeled data with a large amount of unlabeled data. Use when labeling data is expensive. Avoid if you have plenty of labeled data.
What is reinforcement learning?
The model learns by interacting with an environment and receiving rewards or penalties. Use in robotics, gaming, real-time decision making. Avoid for static datasets without feedback loops.
What is classification in ML?
A supervised learning task where the output is a category or label. Use for tasks like spam detection or image labeling. Avoid if output is numeric or continuous.
What is regression in ML?
A supervised learning task where the output is a continuous value. Use for predicting prices or trends. Avoid if outputs are discrete classes.
What is clustering in ML?
An unsupervised learning task to group similar items together. Use when exploring structure in data. Avoid if specific labels or targets are needed.
What is dimensionality reduction?
Reduces the number of input features while preserving key information. Use to speed up models or for visualization. Avoid if interpretability of original features is critical.
What is data cleaning?
Removing or correcting wrong, incomplete, or inconsistent data. Use before training. Avoid skipping, as dirty data leads to poor models.
What is normalization?
Scaling values to a specific range, usually [0,1]. Use with distance-based algorithms. Avoid if algorithm is scale-invariant.
What is standardization (Z-score scaling)?
Centers data around 0 with unit variance. Use for algorithms assuming Gaussian distribution. Avoid if interpretability is needed with original units.
What is encoding categorical variables?
Converting categories into numeric form using label or one-hot encoding. Use for ML algorithms. Avoid one-hot on high-cardinality features.
What is handling missing data?
Techniques include removing rows, filling with mean/median, or model-based imputation. Use with care depending on how much is missing. Avoid deleting data blindly.