1. Identify Problem 2. Collect and QC 3. Prepare 4. Visualization 5. Feature Engineering 6. Model Training 7. Model Evaluation 8. Business Workflow Integration

Domain 3: Modeling Flashcards by Natasha WrightPope

This is an application agnostic standard that can be used as a baseline to understand the various phases of the ML workflow.

The Cross Industry Standard Process for Data Mining (CRISP-DM)

How well did you know this?

Not at all

Perfectly

ML lifecycle phases

Identify Problem
Collect and QC
Prepare
Visualization
Feature Engineering
Model Training
Model Evaluation
Business Workflow Integration

How well did you know this?

Not at all

Perfectly

T/F: Business problem identification requires senior leadership buy-in.

True

How well did you know this?

Not at all

Perfectly

What is the goal of ML?

To predict the value or class of an unknown quantity using a mathematical model.

How well did you know this?

Not at all

Perfectly

Data that the model can use to “learn” from, which consists of independent variables and a dependent variable.

Training data

How well did you know this?

Not at all

Perfectly

What makes a good model?

Should be able to generalize what it has learned to unseen data, namely data where the dependent variable is unknown.

How well did you know this?

Not at all

Perfectly

What makes a poor model?

One that has simply memorized the training data will have poor generalization performance and therefore will not be usable in a business process.

How well did you know this?

Not at all

Perfectly

When a model is shown labeled examples of ground truth values and learns to predict the label based on the input data or features.

Supervised learning

How well did you know this?

Not at all

Perfectly

When you do not have labeled data available and you want the model to discover patterns in the unlabeled data.

Unsupervised learning

How well did you know this?

Not at all

Perfectly

When a model or agent learns by interacting with its environment - similar to trial-and-error learning, where an agent is given rewards and penalties for actions taken and its aim is to maximize the long-term rewards.

Reinforcement learning

How well did you know this?

Not at all

Perfectly

T/F: The data type (whether it is structured or unstructured) does not dictate whether learning is supervised.

True

How well did you know this?

Not at all

Perfectly

A type of supervised learning where the label is binary, such as fraud/not fraud, cat/dog, spam/not spam

Binary classification

How well did you know this?

Not at all

Perfectly

A type of supervised learning where the label can have more than two classes

Multiclass classification

How well did you know this?

Not at all

Perfectly

A type of supervised learning where the label is a continuous number such as a house price

Regression

How well did you know this?

Not at all

Perfectly

A form of supervised machine learning where a model predicts a linear relationship between the data and the labels.

Linear models

How well did you know this?

Not at all

Perfectly

Used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.

Linear regression

How well did you know this?

Not at all

Perfectly

An idea that the label is a linear combination of the input data or feature vectors.

Linearity

How well did you know this?

Not at all

Perfectly

Two assumptions that need to be tested before a linear model can be accurately fit to the data.

Linearity, constant variance, features cannot be strongly correlated w/ one another.

How well did you know this?

Not at all

Perfectly

This is where one feature can be linearly derived from the other, in the most trivial example; they are related by a constant.

Multicollinear

How well did you know this?

Not at all

Perfectly

What is often used in machine learning as a way to penalize the model from learning weights that do not generalize well to unseen data and reduces the overall model complexity and prevents overfitting?

Regularization

How well did you know this?

Not at all

Perfectly

This tends to reduce the values of weights that are unimportant in predicting the labels, where you add an L2 penalty or quadratic penalty to the weights.

Ridge

How well did you know this?

Not at all

Perfectly

This tends to shrink the weights to zero, where where you add an L1 penalty or absolute value penalty to the weights. It also eliminates unimportant features.

Lasso

How well did you know this?

Not at all

Perfectly

This combines ridge and lasso regulation.

Elastic net

How well did you know this?

Not at all

Perfectly

Lasso regression is also known as _____.

Shrinkage

How well did you know this?

Not at all

Perfectly

T/F: Often in machine learning, it is not the model but how you engineer features that determines model performance and ultimately business value.

True

The application of linear regression to binary or multiclass classification problems using logit function.

Logistic regression

T/F: Logistic regression can apply to both binary and multiclass classification problems.

True

The _____ is one of the most common loss functions for classification problems in machine learning irrespective of the underlying algorithm.

Cross-entropy loss

Are logistic regression models large?

No, they only store coefficients and can thus be quite small.

Logistic regression often serves as a _____for model performance.

Benchmark

_____ is used to solve classification problems; _____ is used for regression problems.

Logistic regression/linear regression

What is the built-in algorithm SageMaker has that covers both linear and logistic regression use cases?

Linear learner

What data format does Linear Learner use?

Built using the MXNet framework (recognizes RecordIO data format) Algorithm also recognizes CSV data

_____ can be used for supervised learning for both classification and regression tasks that takes into consideration when the label may also be proportional to interaction terms b/w different independent variables.

Factorization Machines

What do you use when dealing with large sparse data?

Factorization Machines

What method do Factorization Machines work by?

Matrix factorization, also built on top of the MXNet framework and accepts RecordIO format, but not CSV

Which built-in algorithm to use when dealing with recommender systems or item recommendation use cases?

AWS recommends using Factorization Machines for such large sparse matrix use cases

A supervised learning algorithm on structured data that works by first building an index consisting of the distance between any two data points in your dataset; and then, when a new point whose label is unknown is provided, this algorithm calculates the nearest neighbors to that point based on a specified distance metric, and either averages the label values for those k-points in the case of regression or uses the most frequently returned label as the label for classification.

k-Nearest Neighbors

How do you train for k-nearest neighbor?

Build an index

_____corresponds to performing fast lookups against that index.

Inference

1. Sample dataset 2. Reduce dimensionality 3. Assign each vector a cluster

Steps to train a k-nearest model

Logistics regression solves binary classification problems using a loss function known as _____.

Cross-entropy loss

This algorithm is particularly popular in biological fields, and it aims to find the separating hyper-plane that separates two classes by the widest so-called margin. The wider the margin, the better the quality of the algorithm and its ability to generalize.

Support vector machines

How do you generalize to nonlinear situations where the separating boundary may not be linear with support vector machines?

introducing a kernel trick

Lets the tree learn when to spawn off new nodes based on the input data.

decision tree learning

Consists of a root node or parent node and spawns off child or leaf nodes based on certain criteria.

decision tree

Uses a metric to decide when it is appropriate to split a parent node into child nodes. Then this rule is recursively applied to the child nodes. Splitting stops when no further gains can be made or some other condition is met.

Classification and Regression Trees (CART)

How does the CART algorithm decide when to split a parent node?

Gini impurity or the entropy metric

A measure of the probability of incorrectly classifying a data point with a particular label.

Gini impurity

_____ operates by using a greedy algorithm to select which input variables to split on and for that input variable, all different split points are evaluated for the Gini impurity.

CART

_____ takes the same ideas behind decision trees, namely the CART algorithm, but instead of bagging, uses a technique called boosting.

XGBoost

_____ refers to sequential learning where each subsequent tree aims to correctly classify the errors that were misclassified by its predecessor, which can also prevent overfitting, as each individual tree can be a so-called weak learner or a shallow tree, but collectively, they can become a strong learner.

Boosting

Popular boosting algorithms

AdaBoost and Logit Boosting, which are examples of gradient boosting

_____ refers to the ability to treat the error terms as continuous variables and to use Taylor's expansion to expand them in terms of their gradients or derivatives.

Gradient boosting

A key benefit of XGBoost is its ability to _____, which can occur for common machine learning problems such as fraud detection.

scale to very large datasets

T/F: SageMaker offers a built-in XGBoost algorithm

True

The _____ is a popular and efficient open-source implementation of the gradient boosted trees algorithm

XGBoost (eXtreme Gradient Boosting)

Use XGBoost as a _____to run your customized training scripts that can incorporate additional data processing into your training jobs.

framework

Use the XGBoost built-in algorithm to _____.

build an XGBoost training container

T/F: Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.

True

_____ can be used when your data does not have labels but when you are looking to cluster your data points into “similar” groups.

k-means clustering

What are these steps: 1. identifying a random set of k points as the cluster centers 2. for each of the k centers, find a subset of points from the data that are closest to this center using a distance metric such as Euclidean distance 3. define the new centroid as the mean vector of all these points 4. repeatedly perform these steps until the algorithm converges, that is, the cluster centers do not move past a certain threshold.

how k-means training works

_____ is used when you have a continuous label (regression task), where the assumption is made that the label is linearly related to the data.

Linear regression

_____ models are powerful because they are easy to interpret, but the model makes multiple assumptions that need to be tested before a linear model can be accurately fit to the data.

Linear regression

1. Linearity - the label is a linear combination of the input data or feature vectors. 2. Constant variance - the statistical variance in the label is identical, regardless of the value of the input data. 3. Features cannot be strongly correlated with one another

Linear regression assumptions

_____ are designed to reduce decision tree overfitting by creating a collection of decision trees.

Random forests

A _____ works by building many trees, but each tree is trained on only a subset of the input features using a method known as bootstrap aggregation or bagging - which essentially refers to sampling but with replacement.

random forest

Increase the minimum samples per leaf but decrease the maximum depth of the trees.

How to avoid overfitting

Pro: training multiple trees in parallel. Con: different trees do not work together to reduce the overall errors.

Random forests

_____ are deep learning algorithms consisting of alternating convolutional layers, which apply various filters on the input data to capture different information at different scales, followed by pooling layers, which reduce the number of parameters in the network and also the spatial size of the representation.

CNNs

_____ have the ability to retain a user's session history information as part of the model training.

Recurrent neural networks (RNNs)

_____ refers to taking a model that was pretrained on one dataset, freezing the initial layers, and letting it relearn the last few layers of the model on a different dataset.

Transfer learning

T/F: It is hard for an ML model to understand contextual information

True

_____ is a service that you can use to label your image, text, audio, or even tabular data; and it lets you outsource the labeling task to a public workforce (via Amazon Mechanical Turk) or a private workforce (either a third-party labeling company or your own private workforce within your organization) to label data.

Amazon SageMaker Ground Truth

How do you determine if the model is overfitting/underfitting your data?

Comparing the performance of your model against the training/validation datasets

An _____is a function or an algorithm that adjusts the attributes of the neural network, such as weights and learning rates. Thus, it helps in reducing the overall loss and improving accuracy.

optimizer

_____ is an optimization algorithm for finding a local minimum of a differentiable function, and in machine learning, it is simply used to find the values of a function's parameters (coefficients) that minimize a cost function as far as possible.

Gradient Descent

In machine learning (ML), a _____ is used to measure model performance by calculating the deviation of a model's predictions from the correct, “ground truth” predictions.

loss function

_____ are places where the function attains its smallest value in a neighborhood of a point.

Local minima

_____ in machine learning refers to the point where the model's predictions stop improving, or the error rate becomes constant

Convergence

The _____ is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

batch size

_____ is a measure of the likelihood of an event to occur.

Probability

Which is better for handling massively distributed computational process required for ML? GPU or CPU

GPU

Which system is better for ML? Distributed or Non-Distributed?

Distributed b/c handles large volumes of data, also for fault tolerance (if one goes down the other is still up)

Which is best for ML, Spark or Non-Spark?

Spark, b/c it allows us to analyze and understand complex data sets that were previously considered too difficult to work with.

What is a trigger for model retraining?

Drift

_____ is fundamental to ensure that a machine learning model is constantly providing the most up-to-date predictions, while minimizing manual interventions and optimizing for monitoring and reliability. Can happen on a schedule or be triggered by an event.

Retraining

_____ involves lifting and shifting the batch training code defined at development time into an automated workflow.

Model retraining

T/F: You should abstract feature selection, model parameters, and other configurable pipeline parameters as input variables of the retraining pipeline.

True

When you have highly correlated features in your data, to prevent linear regression models from becoming unusable, use this to penalize the model from learning weights that do not generalize well to unseen data.

Regularization

What are three common forms of regularization?

Ridge (add L2 penalty or quadratic penalty to weights) Lasso (aka, shrinkage: add L1 penalty or absolute value penalty to weights) Elastic net (combines the two)

_____ is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds.

Cross validation

The main purpose of cross validation is to prevent _____, which occurs when a model is trained too well on the training data and performs poorly on new, unseen data.

Overfitting

_____ is a procedure to set the weights of a neural network to small random values that define the starting point for the optimization (learning or training) of the neural network model.

Weight initialization

What does every neural network consist of?

Layers of nodes (artificial neurons) Input layer 1 or more hidden layer Output layer

_____ allow us to classify and cluster data at a high velocity

Neural networks

The _____ is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

learning rate

_____ influences to what extent newly acquired information overrides old information and metaphorically represents the speed at which a machine learning model "learns".

Learning rate

_____ decides whether a neuron should be activated or not, which means that it will decide whether the neuron's input to the network is important or not in the process of prediction using simpler mathematical operations.

Activation function

_____ use a decision tree to represent how different input variables can be used to predict a target value, and they're used for both classification and regression problems.

Tree-based models

This is the building block for many complex machine learning algorithms, including deep neural networks, and it predicts the target variable using a linear function of the input features.

Liner models

What techniques help avoid over and underfitting?

Feature engineering, regularization, ensemble learning, and cross-validation

The _____ represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

area under the ROC curve (AUC)

_____ is the proportion of all classifications that were correct, whether positive or negative. It is mathematically defined as correct classification/total classification

Accuracy

Precision is the proportion of all the model's positive classifications that are actually positive. It is mathematically defined as correctly classified actual positives/everything classified as positive.

Precision

_____ improves as false positives decrease, while recall improves when false negatives decrease.

Precision

The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as ____, which is defined as correctly classified actual positives/all actual positives.

Recall

_____ is commonly used in machine learning as it gives a relatively high weight to large errors, which means it should be more useful when large errors are particularly undesirable. It is also valuable because it retains the same units as the input, making it easier to interpret.

RMSE

The percentage of positive predictions when the true value is negative, i.e., FP / (FP + TN).

False Positive Rate (FPR)

The harmonic mean of precision and recall

F1 Score

A _____ is used to measure the performance of a classifier in depth and the accuracy of a classification model.

confusion matrix

_____ is the process of measuring the quality and effectiveness of a machine learning model based on its interaction with real users and data in a live system.

Online evaluation

_____ is the process of measuring the quality and effectiveness of a machine learning model based on historical or simulated data and metrics.

Offline evaluation

T/F: Offline evaluation is usually faster, cheaper, and easier to perform than online evaluation, but it may not capture the true behavior and preferences of the users, the dynamics of the data, or the impact of the model on the system. Online evaluation can provide more realistic and actionable feedback, but it may also be more costly, risky, and complex to conduct.

True

_____ is an optimization technique often used to understand how an altered variable affects audience or user engagement. It's a common method used in marketing, web design, product development, and user experience design to improve campaigns and goal conversion rates.

A/B testing

What are some metrics used to compare models?

Time to train, quality, and engineering costs

Domain 3: Modeling Flashcards

(116 cards)