Developing Machine Learning Solutions Flashcards
ML Lifecycle
Business goal identification
ML problem framing
Data processing (data collection, data preprocessing, and feature engineering)
Model development (training, tuning, and evaluation)
Model deployment (inference and prediction)
Model monitoring
Model retraining
Feature engineering is the process of creating, transforming, extracting, and selecting variables from data.
the process of creating, transforming, extracting, and selecting variables from data.
Model Development
Initially, upon training, the model will not yield the expected results. Therefore, developers will do additional feature engineering and tune the model’s hyperparameters before retraining.
Amazon SageMaker Data Wrangler is a
low-code no-code (LCNC) tool. It provides an end-to-end solution to import, prepare, transform, featurize, and analyze data by using a web interface. Customers can add their own Python scripts and transformations to customize workflows.
For more advanced users and data preparation at scale,
Amazon SageMaker Studio Classic comes with built-in integration of Amazon EMR and AWS Glue interactive sessions to handle large-scale interactive data preparation and machine learning workflows within your SageMaker Studio Classic notebook.
Finally, by using the SageMaker Processing API
Amazon SageMaker Feature Store helps data scientists, machine learning engineers, and general practitioners to
create, share, and manage features for ML development.
Features stored in the store can be retrieved and enriched before being served to the ML models for inference
Customers aiming at a LCNC option can use Amazon SageMaker Canvas.
With SageMaker Canvas, they can use machine learning to generate predictions without needing to write any code.
Amazon SageMaker JumpStart provides
pretrained, open source models that customers can use for a wide range of problem types.
Customers can use Amazon SageMaker Experiments to
experiment with multiple combinations of data, algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.
Amazon SageMaker Automatic Model Tuning
Hyperparameter tuning is a way to find the best version of your models. does that by running many jobs with different hyperparameters in combination and measuring each of them by a metric that you choose.
Amazon SageMaker Model Monitor, customers can
observe the quality of SageMaker ML models in production. They can set up continuous monitoring or on-schedule monitoring. SageMaker Model Monitor helps maintain model quality by detecting violations of user-defined thresholds for data quality, model quality, bias drift, and feature attribution drift.
SageMaker JumpStart provides
pretrained open source models for a range of problem types to help you get started with machine learning. models are ready to deploy or to fine-tune
AutoML is available in SageMaker Canvas. It
simplifies ML development by automating the process of building and deploying machine learning models.
Built-in models available in SageMaker require more
effort and scale if the dataset is large and significant resources are needed to train and deploy the model
If there is no built-in solution that works, try to develop one that uses
pre-made images for machine learning and deep learning frameworks for supported frameworks such as scikit-learn, TensorFlow, PyTorch, MXNet, or Chainer.
You can build your own custom Docker image that is configured to
install the necessary packages or software.
Think about bias as the gap between your predicted value and the actual value, whereas variance describes how dispersed your predicted values are.
Classification Metrics
Accuracy
Precision
Recall
F1
AUC-ROC
Regression Metrics
Mean squared error
R squared
Accuracy
To calculate the model’s accuracy, also known as its score, add up the correct predictions and then divide that number by the total number of predictions.
Problem with Accuracty
Although accuracy is a widely used metric for classification problems, it has limitations. This metric is less effective when there are a lot of true negative cases in your dataset. This is why two other metrics are often used in these situations: precision and recall.
Precision
Precision removes the negative predictions from the picture. Precision is the proportion of positive predictions that are actually correct. You can calculate it by taking the true positive count and dividing it by the total number of positives.
When the cost of false positives are high in your particular business situation, this can be a good metric
precision can be a good metric. Think about a classification model that identifies emails as spam or not. In this case, you do not want your model labeling a legitimate email as spam and preventing your users from seeing that email.
Recall
recall (or sensitivity). In recall, you are looking at the proportion of correct sets that are identified as positive. Recall is calculated by dividing the true positive count by the sum of the true positives and false negatives. By looking at that ratio, you get an idea of how good the algorithm is at detecting, for example, cats.