Lecture 2 - End-to-End ML Project Flashcards
(31 cards)
What are the 8 steps of a complete ML project?
- Big picture, 2. Get data, 3. Discover/visualize, 4. Prepare data, 5. Select/train model, 6. Fine-tune, 7. Present solution, 8. Maintain system.
If one had to summarize an ML project into 3 steps, what would they be?
Preparation, Training, Deployment.
What is the Big Picture step in ML project preparation?
It involves understanding the real-world problem, defining the mechanism, and identifying the learning problem.
What is emphasized in the Big Picture step of ML?
The goal is not to build a model, but to understand and define the real-world problem.
What does problem classification mean in the Big Picture step?
It is identifying whether the problem is classification, regression, supervised, etc.
Why is measurement important in the Big Picture step?
Because it determines how outcomes are quantified and learned.
What are the 4 typical issues in the Get Data step?
- Nonrepresentative data, 2. Poor-quality data, 3. Irrelevant features, 4. Overfitting/Underfitting.
What can be done to address poor-quality data?
Remove outliers, fill in missing values (imputation), or remove features/instances.
Which 3 steps make up the ML pipeline?
Prepare data, Train model, Fine-tune model.
Which 3 steps make up the preparation part of an ML project?
Big picture, Get data, Discover/visualize.
What does deployment mean in ML?
Making the model available for use, and monitoring its performance.
Which 2 steps make up the deployment phase?
Present solution and Maintain system.
What is a pipeline in ML?
A sequence of automated steps that includes preprocessing, training, and evaluation.
What is standardization?
Scaling data so that it has zero mean and unit variance.
What is normalization?
Scaling data to fit within a specific range, usually [0,1].
What is a problem with normalization?
It is sensitive to outliers.
What is an alternative to normalization that is robust to outliers?
Robust scaling.
What is robust scaling?
A method that uses interquartile range to scale features, reducing sensitivity to outliers.
What is categorical feature encoding?
Transforming categorical variables into numerical format for ML models.
What is one-hot encoding?
Encoding categorical variables as binary vectors with a single high bit.
What is imputation?
The process of filling in missing values based on known data.
What are types of imputation algorithms?
Mean/median imputation and kNN imputation.
What is oversampling/undersampling?
Techniques to handle imbalanced data by adjusting the class distributions.
What are SMOTE and ADASYN?
Oversampling techniques that generate synthetic data points between existing minority samples.