Machine Learning Glossary Flashcards

(171 cards)

1
Q

ablation

A

A technique for evaluating the importance of a feature or component by temporarily removing it from a model. You then retrain the model without that feature or component, and if the retrained model performs significantly worse, then the removed feature or component was likely important.

For example, suppose you train a classification model on 10 features and achieve 88% precision on the test set. To check the importance of the first feature, you can retrain the model using only the nine other features. If the retrained model performs significantly worse (for instance, 55% precision), then the removed feature was probably important. Conversely, if the retrained model performs equally well, then that feature was probably not that important.

Ablation can also help determine the importance of:

Larger components, such as an entire subsystem of a larger ML system
Processes or techniques, such as a data preprocessing step
In both cases, you would observe how the system’s performance changes (or doesn’t change) after you’ve removed the component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A/B testing

A

A statistical way of comparing two (or more) techniques—the A and the B. Typically, the A is an existing technique, and the B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.

A/B testing usually compares a single metric on two techniques; for example, how does model accuracy compare for two techniques? However, A/B testing can also compare any finite number of metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

accelerator chip

A

A category of specialized hardware components designed to perform key computations needed for deep learning algorithms.

Accelerator chips (or just accelerators, for short) can significantly increase the speed and efficiency of training and inference tasks compared to a general-purpose CPU. They are ideal for training neural networks and similar computationally intensive tasks.

Examples of accelerator chips include:

Google’s Tensor Processing Units (TPUs) with dedicated hardware for deep learning.
NVIDIA’s GPUs which, though initially designed for graphics processing, are designed to enable parallel processing, which can significantly increase processing speed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

accuracy

A

The number of correct classification predictions divided by the total number of predictions. That is:

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions. So, the accuracy formula for binary classification is as follows:

where:

TP is the number of true positives (correct predictions).
TN is the number of true negatives (correct predictions).
FP is the number of false positives (incorrect predictions).
FN is the number of false negatives (incorrect predictions).
Compare and contrast accuracy with precision and recall.

Click the icon for details about accuracy and class-imbalanced datasets.
See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

action

A

In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

activation function

A

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label.

Popular activation functions include:

ReLU
Sigmoid
The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

active learning

A

A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AdaGrad

A

A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. For a full explanation, see Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

agent

A

In reinforcement learning, the entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.

More generally, an agent is software that autonomously plans and executes a series of actions in pursuit of a goal, with the ability to adapt to changes in its environment. For example, an LLM-based agent might use an LLM to generate a plan, rather than applying a reinforcement learning policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

agglomerative clustering

A

A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.
Contrast with centroid-based clustering.

See Clustering algorithms in the Clustering course for more information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

anomaly detection

A

The process of identifying outliers. For example, if the mean for a certain feature is 100 with a standard deviation of 10, then anomaly detection should flag a value of 200 as suspicious.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AR

A

Abbreviation for augmented reality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

area under the PR curve

A

Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

area under the ROC curve

A

A number between 0.0 and 1.0 representing a binary classification model’s ability to separate positive classes from negative classes. The closer the AUC is to 1.0, the better the model’s ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
The sequence of examples is positive, negative,
positive, negative, positive, negative, positive, negative, positive
negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.
The sequence of examples is negative, negative, negative, negative,
positive, negative, positive, positive, negative, positive, positive,
positive.

AUC ignores any value you set for classification threshold. Instead, AUC considers all possible classification thresholds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

artificial general intelligence

A

A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

artificial intelligence

A

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

attention

A

A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.

Refer also to self-attention and multi-head self-attention, which are the building blocks of Transformers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

attribute

A

Synonym for feature.

In machine learning fairness, attributes often refer to characteristics pertaining to individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

attribute sampling

A

A tactic for training a decision forest in which each decision tree considers only a random subset of possible features when learning the condition. Generally, a different subset of features is sampled for each node. In contrast, when training a decision tree without attribute sampling, all possible features are considered for each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

AUC (Area under the ROC curve)

A

A number between 0.0 and 1.0 representing a binary classification model’s ability to separate positive classes from negative classes. The closer the AUC is to 1.0, the better the model’s ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and
9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.
The sequence of examples is positive, negative,
positive, negative, positive, negative, positive, negative, positive
negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.
The sequence of examples is negative, negative, negative, negative,
positive, negative, positive, positive, negative, positive, positive,
positive.

AUC ignores any value you set for classification threshold. Instead, AUC considers all possible classification thresholds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

augmented reality

A

A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

autoencoder

A

A system that learns to extract the most important information from the input. Autoencoders are a combination of an encoder and decoder. Autoencoders rely on the following two-step process:

The encoder maps the input to a (typically) lossy lower-dimensional (intermediate) format.
The decoder builds a lossy version of the original input by mapping the lower-dimensional format to the original higher-dimensional input format.
Autoencoders are trained end-to-end by having the decoder attempt to reconstruct the original input from the encoder’s intermediate format as closely as possible. Because the intermediate format is smaller (lower-dimensional) than the original format, the autoencoder is forced to learn what information in the input is essential, and the output won’t be perfectly identical to the input.

For example:

If the input data is a graphic, the non-exact copy would be similar to the original graphic, but somewhat modified. Perhaps the non-exact copy removes noise from the original graphic or fills in some missing pixels.
If the input data is text, an autoencoder would generate new text that mimics (but is not identical to) the original text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

automatic evaluation

A

Using software to judge the quality of a model’s output.

When model output is relatively straightforward, a script or program can compare the model’s output to a golden response. This type of automatic evaluation is sometimes called programmatic evaluation. Metrics such as ROUGE or BLEU are often useful for programmatic evaluation.

When model output is complex or has no one right answer, a separate ML program called an autorater sometimes performs the automatic evaluation.

Contrast with human evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

automation bias

A

When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.

See Fairness: Types of bias in Machine Learning Crash Course for more information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
AutoML
Any automated process for building machine learning models. AutoML can automatically do tasks such as the following: Search for the most appropriate model. Tune hyperparameters. Prepare data (including performing feature engineering). Deploy the resulting model. AutoML is useful for data scientists because it can save them time and effort in developing machine learning pipelines and improve prediction accuracy. It is also useful to non-experts, by making complicated machine learning tasks more accessible to them. See Automated Machine Learning (AutoML) in Machine Learning Crash Course for more information.
26
autorater evaluation
A hybrid mechanism for judging the quality of a generative AI model's output that combines human evaluation with automatic evaluation. An autorater is an ML model trained on data created by human evaluation. Ideally, an autorater learns to mimic a human evaluator. Prebuilt autoraters are available, but the best autoraters are fine-tuned specifically to the task you are evaluating.
27
auto-regressive model
A model that infers a prediction based on its own previous predictions. For example, auto-regressive language models predict the next token based on the previously predicted tokens. All Transformer-based large language models are auto-regressive. In contrast, GAN-based image models are usually not auto-regressive since they generate an image in a single forward-pass and not iteratively in steps. However, certain image generation models are auto-regressive because they generate an image in steps.
28
auxiliary loss
A loss function—used in conjunction with a neural network model's main loss function—that helps accelerate training during the early iterations when weights are randomly initialized. Auxiliary loss functions push effective gradients to the earlier layers. This facilitates convergence during training by combating the vanishing gradient problem.
29
average precision at k
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations. Average precision at k is, well, the average of the precision at k values for each relevant result. The formula for average precision at k is therefore: where: is the number of relevant items in the list. Contrast with recall at k.
30
axis-aligned condition
In a decision tree, a condition that involves only a single feature. For example, if area is a feature, then the following is an axis-aligned condition: area > 200 Contrast with oblique condition.
31
backpropagation
The algorithm that implements gradient descent in neural networks. Training a neural network involves many iterations of the following two-pass cycle: During the forward pass, the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch. During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s). Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons. The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate. In calculus terms, backpropagation implements the chain rule. from calculus. That is, backpropagation calculates the partial derivative of the error with respect to each parameter. Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like Keras now implement backpropagation for you. Phew! See Neural networks in Machine Learning Crash Course for more information.
32
bagging
A method to train an ensemble where each constituent model trains on a random subset of training examples sampled with replacement. For example, a random forest is a collection of decision trees trained with bagging. The term bagging is short for bootstrap aggregating. See Random forests in the Decision Forests course for more information.
33
bag of words
A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically: the dog jumps jumps the dog dog jumps the Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indexes corresponding to the words the, dog, and jumps. The non-zero value can be any of the following: A 1 to indicate the presence of a word. A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1. Some other value, such as the logarithm of the count of the number of times a word appears in the bag.
34
baseline
A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model. For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.
35
batch
The set of examples used in one training iteration. The batch size determines the number of examples in a batch. See epoch for an explanation of how a batch relates to an epoch. See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.
36
batch inference
The process of inferring predictions on multiple unlabeled examples divided into smaller subsets ("batches"). Batch inference can take advantage of the parallelization features of accelerator chips. That is, multiple accelerators can simultaneously infer predictions on different batches of unlabeled examples, dramatically increasing the number of inferences per second. See Production ML systems: Static versus dynamic inference in Machine Learning Crash Course for more information.
37
batch normalization
Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide the following benefits: Make neural networks more stable by protecting against outlier weights. Enable higher learning rates, which can speed training. Reduce overfitting.
38
batch size
The number of examples in a batch. For instance, if the batch size is 100, then the model processes 100 examples per iteration. The following are popular batch size strategies: Stochastic Gradient Descent (SGD), in which the batch size is 1. Full batch, in which the batch size is the number of examples in the entire training set. For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy. mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy. See the following for more information: Production ML systems: Static versus dynamic inference in Machine Learning Crash Course. Deep Learning Tuning Playbook.
39
Bayesian neural network
A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a standard model predicts a house price of 853,000. In contrast, a Bayesian neural network predicts a distribution of values; for example, a Bayesian model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural network relies on Bayes' Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting.
40
Bayesian optimization
A probabilistic regression model technique for optimizing computationally expensive objective functions by instead optimizing a surrogate that quantifies the uncertainty using a Bayesian learning technique. Since Bayesian optimization is itself very expensive, it is usually used to optimize expensive-to-evaluate tasks that have a small number of parameters, such as selecting hyperparameters.
41
Bellman equation
In reinforcement learning, the following identity satisfied by the optimal Q-function: Reinforcement learning algorithms apply this identity to create Q-learning using the following update rule: Beyond reinforcement learning, the Bellman equation has applications to dynamic programming. See the Wikipedia entry for Bellman equation.
42
BERT (Bidirectional Encoder Representations from Transformers)
A model architecture for text representation. A trained BERT model can act as part of a larger model for text classification or other ML tasks. BERT has the following characteristics: Uses the Transformer architecture, and therefore relies on self-attention. Uses the encoder part of the Transformer. The encoder's job is to produce good text representations, rather than to perform a specific task like classification. Is bidirectional. Uses masking for unsupervised training. BERT's variants include: ALBERT, which is an acronym for A Light BERT. LaBSE. See Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing for an overview of BERT.
43
bias (ethics/fairness)
1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include: automation bias confirmation bias experimenter's bias group attribution bias implicit bias in-group bias out-group homogeneity bias 2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include: coverage bias non-response bias participation bias reporting bias sampling bias selection bias Not to be confused with the bias term in machine learning models or prediction bias. See Fairness: Types of bias in Machine Learning Crash Course for more information.
44
bias (math) or bias term
An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following: b w0 For example, bias is the b in the following formula: In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2. The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2. Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros. Bias is not to be confused with bias in ethics and fairness or prediction bias. See Linear Regression in Machine Learning Crash Course for more information.
45
bidirectional
A term used to describe a system that evaluates the text that both precedes and follows a target section of text. In contrast, a unidirectional system only evaluates the text that precedes a target section of text. For example, consider a masked language model that must determine probabilities for the word or words representing the underline in the following question: What is the _____ with you? A unidirectional language model would have to base its probabilities only on the context provided by the words "What", "is", and "the". In contrast, a bidirectional language model could also gain context from "with" and "you", which might help the model generate better predictions.
46
bidirectional language model
A language model that determines the probability that a given token is present at a given location in an excerpt of text based on the preceding and following text.
47
bigram
An N-gram in which N=2.
48
binary classification
A type of classification task that predicts one of two mutually exclusive classes: the positive class the negative class For example, the following two machine learning models each perform binary classification: A model that determines whether email messages are spam (the positive class) or not spam (the negative class). A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class). Contrast with multi-class classification. See also logistic regression and classification threshold. See Classification in Machine Learning Crash Course for more information.
49
binary condition
In a decision tree, a condition that has only two possible outcomes, typically yes or no. For example, the following is a binary condition: temperature >= 100 Contrast with non-binary condition. See Types of conditions in the Decision Forests course for more information.
50
binning
Synonym for bucketing.
51
BLEU (Bilingual Evaluation Understudy)
A metric between 0.0 and 1.0 for evaluating machine translations, for example, from Spanish to Japanese. To calculate a score, BLEU typically compares an ML model's translation (generated text) to a human expert's translation (reference text). The degree to which N-grams in the generated text and reference text match determines the BLEU score. The original paper on this metric is BLEU: a Method for Automatic Evaluation of Machine Translation. See also BLEURT.
52
BLEURT (Bilingual Evaluation Understudy from Transformers)
A metric for evaluating machine translations from one language to another, particularly to and from English. For translations to and from English, BLEURT aligns more closely to human ratings than BLEU. Unlike BLEU, BLEURT emphasizes semantic (meaning) similarities and can accommodate paraphrasing. BLEURT relies on a pre-trained large language model (BERT to be exact) that is then fine-tuned on text from human translators. The original paper on this metric is BLEURT: Learning Robust Metrics for Text Generation.
53
boosting
A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as "weak" classifiers) into a classifier with high accuracy (a "strong" classifier) by upweighting the examples that the model is currently misclassifying. See Gradient Boosted Decision Trees? in the Decision Forests course for more information.
54
bounding box
In an image, the (x, y) coordinates of a rectangle around an area of interest, such as the dog in the image below.
55
broadcasting
Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For example, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can't add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m, n) by replicating the same values down each column. Click the icon for an example. See the following description of broadcasting in NumPy for more details.
56
bucketing
Converting a single feature into multiple binary features called buckets or bins, typically based on a value range. The chopped feature is typically a continuous feature. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as: <= 10 degrees Celsius would be the "cold" bucket. 11 - 24 degrees Celsius would be the "temperate" bucket. >= 25 degrees Celsius would be the "warm" bucket. The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically. Click the icon for additional notes. See Numerical data: Binning in Machine Learning Crash Course for more information.
57
calibration layer
A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.
58
candidate generation
The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as scoring and re-ranking) reduce those 500 to a much smaller, more useful set of recommendations. See Candidate generation overview in the Recommendation Systems course for more information.
59
candidate sampling
A training-time optimization that calculates a probability for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For instance, given an example labeled beagle and dog, candidate sampling computes the predicted probabilities and corresponding loss terms for: beagle dog a random subset of the remaining negative classes (for example, cat, lollipop, fence). The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically. Candidate sampling is more computationally efficient than training algorithms that compute predictions for all negative classes, particularly when the number of negative classes is very large.
60
categorical data
Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state, which can only have one of the following three possible values: red yellow green By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red, green, and yellow on driver behavior. Categorical features are sometimes called discrete features. Contrast with numerical data. See Working with categorical data in Machine Learning Crash Course for more information.
61
causal language model
Synonym for unidirectional language model. See bidirectional language model to contrast different directional approaches in language modeling.
62
centroid
The center of a cluster as determined by a k-means or k-median algorithm. For example, if k is 3, then the k-means or k-median algorithm finds 3 centroids. See Clustering algorithms in the Clustering course for more information.
63
centroid-based clustering
A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely used centroid-based clustering algorithm. Contrast with hierarchical clustering algorithms. See Clustering algorithms in the Clustering course for more information.
64
chain-of-thought prompting
A prompt engineering technique that encourages a large language model (LLM) to explain its reasoning, step by step. For example, consider the following prompt, paying particular attention to the second sentence: "How many g forces would a driver experience in a car that goes from 0 to 60 miles per hour in 7 seconds? In the answer, show all relevant calculations." The LLM's response would likely: - Show a sequence of physics formulas, plugging in the values 0, 60, and 7 in appropriate places. - Explain why it chose those formulas and what the various variables mean. Chain-of-thought prompting forces the LLM to perform all the calculations, which might lead to a more correct answer. In addition, chain-of-thought prompting enables the user to examine the LLM's steps to determine whether or not the answer makes sense.
65
chat
The contents of a back-and-forth dialogue with an ML system, typically a large language model. The previous interaction in a chat (what you typed and how the large language model responded) becomes the context for subsequent parts of the chat. A chatbot is an application of a large language model.
66
checkpoint
Data that captures the state of a model's parameters either during training or after training is completed. For example, during training, you can: 1. Stop training, perhaps intentionally or perhaps as the result of certain errors. 2. Capture the checkpoint. 3. Later, reload the checkpoint, possibly on different hardware. 4. Restart training.
67
class
A category that a label can belong to. For example: - In a binary classification model that detects spam, the two classes might be spam and not spam. - In a multi-class classification model that identifies dog breeds, the classes might be poodle, beagle, pug, and so on. A classification model predicts a class. In contrast, a regression model predicts a number rather than a class. See Classification in Machine Learning Crash Course for more information.
68
classification model
A model whose prediction is a class. For example, the following are all classification models: - A model that predicts an input sentence's language (French? Spanish? Italian?). - A model that predicts tree species (Maple? Oak? Baobab?). - A model that predicts the positive or negative class for a particular medical condition. In contrast, regression models predict numbers rather than classes. Two common types of classification models are: - binary classification - multi-class classification
69
classification threshold
In a binary classification, a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class. Note that the classification threshold is a value that a human chooses, not a value chosen by model training. A logistic regression model outputs a raw value between 0 and 1. Then: - If this raw value is greater than the classification threshold, then the positive class is predicted. - If this raw value is less than the classification threshold, then the negative class is predicted. For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class. The choice of classification threshold strongly influences the number of false positives and false negatives. Click the icon for additional notes. See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.
70
classifier
A casual term for a classification model.
71
class-imbalanced dataset
A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows: - 1,000,000 negative labels - 10 positive labels The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset. In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1: - 517 negative labels - 483 positive labels Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two: - 1,000,000 labels with class "green" - 200 labels with class "purple" - 350 labels with class "orange" - See also entropy, majority class, and minority class.
72
clipping
A technique for handling outliers by doing either or both of the following: - Reducing feature values that are greater than a maximum threshold down to that maximum threshold. - Increasing feature values that are less than a minimum threshold up to that minimum threshold. For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following: - Clip all values over 60 (the maximum threshold) to be exactly 60. - Clip all values under 40 (the minimum threshold) to be exactly 40. Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy. Clipping is a common technique to limit the damage. Gradient clipping forces gradient values within a designated range during training. See Numerical data: Normalization in Machine Learning Crash Course for more information.
73
Cloud TPU
A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud.
74
clustering
Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster. Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid, as in the following diagram: A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf trees" and cluster 2 as "full-size trees." As another example, consider a clustering algorithm based on an example's distance from a center point, illustrated as follows: Dozens of data points are arranged in concentric circles, almost like holes around the center of a dart board. The innermost ring of data points is categorized as cluster 1, the middle ring is categorized as cluster 2, and the outermost ring as cluster 3. See the Clustering course for more information.
75
co-adaptation
When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network's behavior as a whole. When the patterns that cause co-adaptation are not present in validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons.
76
collaborative filtering
Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems. See Collaborative filtering in the Recommendation Systems course for more information.
77
concept drift
A shift in the relationship between features and the label. Over time, concept drift reduces a model's quality. During training, the model learns the relationship between the features and their labels in the training set. If the labels in the training set are good proxies for the real-world, then the model should make good real world predictions. However, due to concept drift, the model's predictions tend to degrade over time. For example, consider a binary classification model that predicts whether or not a certain car model is "fuel efficient." That is, the features could be: - car weight - engine compression - transmission type while the label is either: - fuel efficient - not fuel efficient However, the concept of "fuel efficient car" keeps changing. A car model labeled fuel efficient in 1994 would almost certainly be labeled not fuel efficient in 2024. A model suffering from concept drift tends to make less and less useful predictions over time. Compare and contrast with nonstationarity.
78
condition
In a decision tree, any node that evaluates an expression. For example, the following portion of a decision tree contains two conditions: A decision tree consisting of two conditions: (x > 0) and (y > 0). A condition is also called a split or a test. Contrast condition with leaf. See also: - binary condition - non-binary condition. - axis-aligned-condition - oblique-condition See Types of conditions in the Decision Forests course for more information.
79
confabulation
Synonym for hallucination. Confabulation is probably a more technically accurate term than hallucination. However, hallucination became popular first.
80
configuration
The process of assigning the initial property values used to train a model, including: - the model's composing layers - the location of the data - hyperparameters such as: - learning rate - iterations - optimizer - loss function In machine learning projects, configuration can be done through a special configuration file or using configuration libraries such as the following: - HParam - Gin - Fiddle
81
confirmation bias
The tendency to search for, interpret, favor, and recall information in a way that confirms one's pre-existing beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias. Experimenter's bias is a form of confirmation bias in which an experimenter continues training models until a pre-existing hypothesis is confirmed.
82
confusion matrix
An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. For example, consider the following confusion matrix for a binary classification model: The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. For example, consider the following confusion matrix for a 3-class multi-class classification model that categorizes three different iris types (Virginica, Versicolor, and Setosa). When the ground truth was Virginica, the confusion matrix shows that the model was far more likely to mistakenly predict Versicolor than Setosa: As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7. Confusion matrixes contain sufficient information to calculate a variety of performance metrics, including precision and recall.
83
constituency parsing
Dividing a sentence into smaller grammatical structures ("constituents"). A later part of the ML system, such as a natural language understanding model, can parse the constituents more easily than the original sentence. For example, consider the following sentence: "My friend adopted two cats." A constituency parser can divide this sentence into the following two constituents: - My friend is a noun phrase. adopted two cats is a verb phrase. These constituents can be further subdivided into smaller constituents. For example, the verb phrase "adopted two cats" could be further subdivided into: - adopted is a verb. - two cats is another noun phrase.
84
contextualized language embedding
An embedding that comes close to "understanding" words and phrases in ways that fluent human speakers can. Contextualized language embeddings can understand complex syntax, semantics, and context. For example, consider embeddings of the English word cow. Older embeddings such as word2vec can represent English words such that the distance in the embedding space from cow to bull is similar to the distance from ewe (female sheep) to ram (male sheep) or from female to male. Contextualized language embeddings can go a step further by recognizing that English speakers sometimes casually use the word cow to mean either cow or bull.
85
context window
The number of tokens a model can process in a given prompt. The larger the context window, the more information the model can use to provide coherent and consistent responses to the prompt.
86
continuous feature
A floating-point feature with an infinite range of possible values, such as temperature or weight. Contrast with discrete feature.
87
convenience sampling
Using a dataset not gathered scientifically in order to run quick experiments. Later on, it's essential to switch to a scientifically gathered dataset.
88
convergence
A state reached when loss values change very little or not at all with each iteration. For example, the following loss curve suggests convergence at around 700 iterations: A model converges when additional training won't improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence. See also early stopping. See Model convergence and loss curves in Machine Learning Crash Course for more information.
89
convex function
A function in which the region above the graph of the function is a convex set. The prototypical convex function is shaped something like the letter U. For example, the following are all convex functions: U-shaped curves, each with a single minimum point. In contrast, the following function is not convex. Notice how the region above the graph is not a convex set: A W-shaped curve with two different local minimum points. A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not U-shaped.
90
convex optimization
The process of using mathematical techniques such as gradient descent to find the minimum of a convex function. A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently. For complete details, see Boyd and Vandenberghe, Convex Optimization.
91
convex set
A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset. For instance, the following two shapes are convex sets: One illustration of a rectangle. Another illustration of an oval. In contrast, the following two shapes are not convex sets: One illustration of a pie-chart with a missing slice. Another illustration of a wildly irregular polygon.
92
convolution
In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights. The term "convolution" in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer. Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter. See Introducing Convolutional Neural Networks in the Image Classification course for more information.
93
convolutional filter
One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28. In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values. See Convolution in the Image Classification course for more information.
94
convolutional layer
A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter: A 3x3 matrix with the following values: [[0,1,0], [1,0,1], [0,1,0]] The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations: An animation showing two matrixes. The first matrix is the 5x5 matrix: [[128,97,53,201,198], [35,22,25,200,195], [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]]. The second matrix is the 3x3 matrix: [[181,303,618], [115,338,605], [169,351,560]]. The second matrix is calculated by applying the convolutional filter [[0, 1, 0], [1, 0, 1], [0, 1, 0]] across different 3x3 subsets of the 5x5 matrix. See Fully Connected Layers in the Image Classification course for more information.
95
convolutional neural network
A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers: convolutional layers pooling layers dense layers Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
96
convolutional operation
The following two-step mathematical operation: Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.) Summation of all the values in the resulting product matrix. For example, consider the following 5x5 input matrix: The 5x5 matrix: [[128,97,53,201,198], [35,22,25,200,195], [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]]. Now imagine the following 2x2 convolutional filter: The 2x2 matrix: [[1, 0], [0, 1]] Each convolutional operation involves a single 2x2 slice of the input matrix. For example, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows: Applying the convolutional filter [[1, 0], [0, 1]] to the top-left 2x2 section of the input matrix, which is [[128,97], [35,22]]. The convolutional filter leaves the 128 and 22 intact, but zeroes out the 97 and 35. Consequently, the convolution operation yields the value 150 (128+22). A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.
97
cost
Synonym for loss.
98
co-training
A semi-supervised learning approach particularly useful when all of the following conditions are true: The ratio of unlabeled examples to labeled examples in the dataset is high. This is a classification problem (binary or multi-class). The dataset contains two different sets of predictive features that are independent of each other and complementary. Co-training essentially amplifies independent signals into a stronger signal. For example, consider a classification model that categorizes individual used cars as either Good or Bad. One set of predictive features might focus on aggregate characteristics such as the year, make, and model of the car; another set of predictive features might focus on the previous owner's driving record and the car's maintenance history. The seminal paper on co-training is Combining Labeled and Unlabeled Data with Co-Training by Blum and Mitchell.
99
counterfactual fairness
A fairness metric that checks whether a classification model produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classification model for counterfactual fairness is one method for surfacing potential sources of bias in a model. See either of the following for more information: Fairness: Counterfactual fairness in Machine Learning Crash Course. When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness
100
coverage bias
Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist: coverage bias: The population represented in the dataset doesn't match the population that the machine learning model is making predictions about. sampling bias: Data is not collected randomly from the target group. non-response bias (also called participation bias): Users from certain groups opt-out of surveys at different rates than users from other groups. For example, suppose you are creating a machine learning model that predicts people's enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias: coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie. sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows. non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.
101
crash blossom
A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.
102
critic
Synonym for Deep Q-Network.
103
cross-entropy
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.
104
cross-validation
A mechanism for estimating how well a model would generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set.
105
cumulative distribution function (CDF)
A function that defines the frequency of samples less than or equal to a target value. For example, consider a normal distribution of continuous values. A CDF tells you that approximately 50% of samples should be less than or equal to the mean and that approximately 84% of samples should be less than or equal to one standard deviation above the mean.
106
data analysis
Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.
107
data augmentation
Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
108
DataFrame
A popular pandas data type for representing datasets in memory. A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number. Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type. See also the official pandas.DataFrame reference page.
109
data parallelism
A way of scaling training or inference that replicates an entire model onto multiple devices and then passes a subset of the input data to each device. Data parallelism can enable training and inference on very large batch sizes; however, data parallelism requires that the model be small enough to fit on all devices. Data parallelism typically speeds training and inference. See also model parallelism.
110
Dataset API (tf.data)
A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.
111
data set or dataset
A collection of raw data, commonly (but not exclusively) organized in one of the following formats: - a spreadsheet - a file in CSV (comma-separated values) format
112
decision boundary
The separator between classes learned by a model in a binary class or multi-class classification problems. For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the orange class and the blue class:
113
decision forest
A model created from multiple decision trees. A decision forest makes a prediction by aggregating the predictions of its decision trees. Popular types of decision forests include random forests and gradient boosted trees. See the Decision Forests section in the Decision Forests course for more information.
114
decision threshold
In a binary classification, a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class. Note that the classification threshold is a value that a human chooses, not a value chosen by model training. A logistic regression model outputs a raw value between 0 and 1. Then: If this raw value is greater than the classification threshold, then the positive class is predicted. If this raw value is less than the classification threshold, then the negative class is predicted. For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class. The choice of classification threshold strongly influences the number of false positives and false negatives.
115
decision tree
A supervised learning model composed of a set of conditions and leaves organized hierarchically. For example, the following is a decision tree:
116
decoder
In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation. Decoders are often a component of a larger model, where they are frequently paired with an encoder. In sequence-to-sequence tasks, a decoder starts with the internal state generated by the encoder to predict the next sequence. Refer to Transformer for the definition of a decoder within the Transformer architecture. See Large language models in Machine Learning Crash Course for more information.
117
deep model
A neural network containing more than one hidden layer. A deep model is also called a deep neural network. Contrast with wide model.
118
deep neural network
A neural network containing more than one hidden layer. A deep model is also called a deep neural network. Contrast with wide model.
119
Deep Q-Network (DQN)
In Q-learning, a deep neural network that predicts Q-functions. Critic is a synonym for Deep Q-Network.
120
demographic parity
A fairness metric that is satisfied if the results of a model's classification are not dependent on a given sensitive attribute. For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other. Contrast with equalized odds and equality of opportunity, which permit classification results in aggregate to depend on sensitive attributes, but don't permit classification results for certain specified ground truth labels to depend on sensitive attributes. See "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for demographic parity. See Fairness: demographic parity in Machine Learning Crash Course for more information.
121
denoising
A common approach to self-supervised learning in which: Noise is artificially added to the dataset. The model tries to remove the noise. Denoising enables learning from unlabeled examples. The original dataset serves as the target or label and the noisy data as the input. Some masked language models use denoising as follows: Noise is artificially added to an unlabeled sentence by masking some of the tokens. The model tries to predict the original tokens.
122
dense feature
A feature in which most or all values are nonzero, typically a Tensor of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero: 8 3 7 5 2 4 0 4 9 6 Contrast with sparse feature.
123
dense layer
A hidden layer in which each node is connected to every node in the subsequent hidden layer. A fully connected layer is also known as a dense layer.
124
depth
The sum of the following in a neural network: - the number of hidden layers - the number of output layers, which is typically 1 - the number of any embedding layers For example, a neural network with five hidden layers and one output layer has a depth of 6. Notice that the input layer doesn't influence depth.
125
depthwise separable convolutional neural network (sepCNN)
A convolutional neural network architecture based on Inception, but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception. A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n). To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.
126
derived label
Synonym for proxy label. Data used to approximate labels not directly available in a dataset. For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons. As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain. Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate. See Datasets: Labels in Machine Learning Crash Course for more information.
127
device
An overloaded term with the following two possible definitions: A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs. When training an ML model on accelerator chips (GPUs or TPUs), the part of the system that actually manipulates tensors and embeddings. The device runs on accelerator chips. In contrast, the host typically runs on a CPU.
128
differential privacy
In machine learning, an anonymization approach to protect any sensitive data (for example, an individual's personal information) included in a model's training set from being exposed. This approach ensures that the model doesn't learn or remember much about a specific individual. This is accomplished by sampling and adding noise during model training to obscure individual data points, mitigating the risk of exposing sensitive training data. Differential privacy is also used outside of machine learning. For example, data scientists sometimes use differential privacy to protect individual privacy when computing product usage statistics for different demographics.
129
dimension reduction
Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding vector.
130
dimensions
Overloaded term having any of the following definitions: - The number of levels of coordinates in a Tensor. For example: - A scalar has zero dimensions; for example, ["Hello"]. - A vector has one dimension; for example, [3, 5, 7, 11]. - A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]]. You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a two-dimensional matrix. - The number of entries in a feature vector. - The number of elements in an embedding layer.
131
direct prompting
Synonym for zero-shot prompting. A prompt that does not provide an example of how you want the large language model to respond. For example: Parts of one prompt Notes What is the official currency of the specified country? The question you want the LLM to answer. India: The actual query.
132
discrete feature
A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous feature.
133
discriminative model
A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is: p(output | features, weights) For example, a model that predicts whether an email is spam from features and weights is a discriminative model. The vast majority of supervised learning models, including classification and regression models, are discriminative models. Contrast with generative model.
134
discriminator
A system that determines whether examples are real or fake. Alternatively, the subsystem within a generative adversarial network that determines whether the examples created by the generator are real or fake. See The discriminator in the GAN course for more information.
135
disparate impact
Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others. For example, suppose an algorithm that determines a Lilliputian's eligibility for a miniature-home loan is more likely to classify them as "ineligible" if their mailing address contains a certain postal code. If Big-Endian Lilliputians are more likely to have mailing addresses with this postal code than Little-Endian Lilliputians, then this algorithm may result in disparate impact. Contrast with disparate treatment, which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decision-making process.
136
disparate treatment
Factoring subjects' sensitive attributes into an algorithmic decision-making process such that different subgroups of people are treated differently. For example, consider an algorithm that determines Lilliputians' eligibility for a miniature-home loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian's affiliation as Big-Endian or Little-Endian as an input, it is enacting disparate treatment along that dimension. Contrast with disparate impact, which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.
137
distillation
The process of reducing the size of one model (known as the teacher) into a smaller model (known as the student) that emulates the original model's predictions as faithfully as possible. Distillation is useful because the smaller model has two key benefits over the larger model (the teacher): Faster inference time Reduced memory and energy usage However, the student's predictions are typically not as good as the teacher's predictions. Distillation trains the student model to minimize a loss function based on the difference between the outputs of the predictions of the student and teacher models. Compare and contrast distillation with the following terms: fine-tuning prompt-based learning See LLMs: Fine-tuning, distillation, and prompt engineering in Machine Learning Crash Course for more information.
138
distribution
The frequency and range of different values for a given feature or label. A distribution captures how likely a particular value is. The following image shows histograms of two different distributions: On the left, a power law distribution of wealth versus the number of people possessing that wealth. On the right, a normal distribution of height versus the number of people possessing that height.
139
divisive clustering
See hierarchical clustering. A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms: - Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree. - Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree. Contrast with centroid-based clustering. See Clustering algorithms in the Clustering course for more information.
140
downsampling
Overloaded term that can mean either of the following: - Reducing the amount of information in a feature in order to train a model more efficiently. For example, before training an image recognition model, downsampling high-resolution images to a lower-resolution format. - Training on a disproportionately low percentage of over-represented class examples in order to improve model training on under-represented classes. For example, in a class-imbalanced dataset, models tend to learn a lot about the majority class and not enough about the minority class. Downsampling helps balance the amount of training on the majority and minority classes. See Datasets: Imbalanced datasets in Machine Learning Crash Course for more information.
141
DQN
Abbreviation for Deep Q-Network. In Q-learning, a deep neural network that predicts Q-functions. Critic is a synonym for Deep Q-Network. In reinforcement learning, the function that predicts the expected return from taking an action in a state and then following a given policy. Q-function is also known as state-action value function.
142
dropout regularization
A form of regularization useful in training neural networks. Dropout regularization removes a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
143
dynamic
Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning: - A dynamic model (or online model) is a model that is retrained frequently or continuously. - Dynamic training (or online training) is the process of training frequently or continuously. - Dynamic inference (or online inference) is the process of generating predictions on demand.
144
dynamic model
A model that is frequently (maybe even continuously) retrained. A dynamic model is a "lifelong learner" that constantly adapts to evolving data. A dynamic model is also known as an online model. Contrast with static model.
145
eager execution
A TensorFlow programming environment in which operations run immediately. In contrast, operations called in graph execution don't run until they are explicitly evaluated. Eager execution is an imperative interface, much like the code in most programming languages. Eager execution programs are generally far easier to debug than graph execution programs.
146
early stopping
A method for regularization that involves ending training before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a validation dataset starts to increase; that is, when generalization performance worsens.
147
earth mover's distance (EMD)
A measure of the relative similarity of two distributions. The lower the earth mover's distance, the more similar the distributions.
148
edit distance
A measurement of how similar two text strings are to each other. In machine learning, edit distance is useful for the following reasons: - Edit distance is easy to compute. - Edit distance can compare two strings known to be similar to each other. - Edit distance can determine the degree to which different strings are similar to a given string. There are several definitions of edit distance, each using different string operations. See Levenshtein distance for an example.
149
Einsum notation
An efficient notation for describing how two tensors are to be combined. The tensors are combined by multiplying the elements of one tensor by the elements of the other tensor and then summing the products. Einsum notation uses symbols to identify the axes of each tensor, and those same symbols are rearranged to specify the shape of the new resulting tensor. NumPy provides a common Einsum implementation.
150
embedding layer
A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature. For example, Earth currently supports about 73,000 tree species. Suppose tree species is a feature in your model, so your model's input layer includes a one-hot vector 73,000 elements long. For example, perhaps baobab would be represented something like this: An array of 73,000 elements. The first 6,232 elements hold the value 0. The next element holds the value 1. The final 66,767 elements hold the value zero. A 73,000-element array is very long. If you don't add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species. In certain situations, hashing is a reasonable alternative to an embedding layer. See Embeddings in Machine Learning Crash Course for more information.
151
embedding space
The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Embedding space is trained to capture structure that is meaningful for the intended application. The dot product of two embeddings is a measure of their similarity.
152
embedding vector
Broadly speaking, an array of floating-point numbers taken from any hidden layer that describe the inputs to that hidden layer. Often, an embedding vector is the array of floating-point numbers trained in an embedding layer. For example, suppose an embedding layer must learn an embedding vector for each of the 73,000 tree species on Earth. Perhaps the following array is the embedding vector for a baobab tree: An array of 12 elements, each holding a floating-point number between 0.0 and 1.0. An embedding vector is not a bunch of random numbers. An embedding layer determines these values through training, similar to the way a neural network learns other weights during training. Each element of the array is a rating along some characteristic of a tree species. Which element represents which tree species' characteristic? That's very hard for humans to determine. The mathematically remarkable part of an embedding vector is that similar items have similar sets of floating-point numbers. For example, similar tree species have a more similar set of floating-point numbers than dissimilar tree species. Redwoods and sequoias are related tree species, so they'll have a more similar set of floating-pointing numbers than redwoods and coconut palms. The numbers in the embedding vector will change each time you retrain the model, even if you retrain the model with identical input.
153
empirical cumulative distribution function (eCDF or EDF)
A cumulative distribution function based on empirical measurements from a real dataset. The value of the function at any point along the x-axis is the fraction of observations in the dataset that are less than or equal to the specified value.
154
empirical risk minimization (ERM)
Choosing the function that minimizes loss on the training set. Contrast with structural risk minimization.
155
encoder
In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation. Encoders are often a component of a larger model, where they are frequently paired with a decoder. Some Transformers pair encoders with decoders, though other Transformers use only the encoder or only the decoder. Some systems use the encoder's output as the input to a classification or regression network. In sequence-to-sequence tasks, an encoder takes an input sequence and returns an internal state (a vector). Then, the decoder uses that internal state to predict the next sequence. Refer to Transformer for the definition of an encoder in the Transformer architecture. See LLMs: What's a large language model in Machine Learning Crash Course for more information.
156
ensemble
A collection of models trained independently whose predictions are averaged or aggregated. In many cases, an ensemble produces better predictions than a single model. For example, a random forest is an ensemble built from multiple decision trees. Note that not all decision forests are ensembles. See Random Forest in Machine Learning Crash Course for more information.
157
entropy
In information theory, a description of how unpredictable a probability distribution is. Alternatively, entropy is also defined as how much information each example contains. A distribution has the highest possible entropy when all values of a random variable are equally likely. The entropy of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) has the following formula: H = -p log p - q log q = -p log p - (1-p) * log (1-p) where: - H is the entropy. - p is the fraction of "1" examples. - q is the fraction of "0" examples. Note that q = (1 - p) - log is generally log2. In this case, the entropy unit is a bit. For example, suppose the following: - 100 examples contain the value "1" - 300 examples contain the value "0" Therefore, the entropy value is: - p = 0.25 - q = 0.75 - H = (-0.25)log2(0.25) - (0.75)log2(0.75) = 0.81 bits per example A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s) would have an entropy of 1.0 bit per example. As a set becomes more imbalanced, its entropy moves towards 0.0. In decision trees, entropy helps formulate information gain to help the splitter select the conditions during the growth of a classification decision tree. Compare entropy with: - gini impurity - cross-entropy loss function Entropy is often called Shannon's entropy. See Exact splitter for binary classification with numerical features in the Decision Forests course for more information.
158
environment
In reinforcement learning, the world that contains the agent and allows the agent to observe that world's state. For example, the represented world can be a game like chess, or a physical world like a maze. When the agent applies an action to the environment, then the environment transitions between states.
159
episode
In reinforcement learning, each of the repeated attempts by the agent to learn an environment.
160
epoch
A full training pass over the entire training set such that each example has been processed once. An epoch represents N/batch size training iterations, where N is the total number of examples. For instance, suppose the following: - The dataset consists of 1,000 examples. - The batch size is 50 examples. Therefore, a single epoch requires 20 iterations: 1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations See Linear regression: Hyperparameters in Machine Learning Crash Course for more information.
161
epsilon greedy policy
In reinforcement learning, a policy that either follows a random policy with epsilon probability or a greedy policy otherwise. For example, if epsilon is 0.9, then the policy follows a random policy 90% of the time and a greedy policy 10% of the time. Over successive episodes, the algorithm reduces epsilon's value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.
162
equality of opportunity
A fairness metric to assess whether a model is predicting the desirable outcome equally well for all values of a sensitive attribute. In other words, if the desirable outcome for a model is the positive class, the goal would be to have the true positive rate be the same for all groups. Equality of opportunity is related to equalized odds, which requires that both the true positive rates and false positive rates are the same for all groups. Suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of "admitted" with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they're a Lilliputian or a Brobdingnagian. The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted. While equality of opportunity is satisfied, the following two fairness metrics are not satisfied: - demographic parity: Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted. - equalized odds: While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate. See Fairness: Equality of opportunity in Machine Learning Crash Course for more information.
163
equalized odds
A fairness metric to assess whether a model is predicting outcomes equally well for all values of a sensitive attribute with respect to both the positive class and negative class—not just one class or the other exclusively. In other words, both the true positive rate and false negative rate should be the same for all groups. Equalized odds is related to equality of opportunity, which only focuses on error rates for a single class (positive or negative). For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected. Suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows: Equalized odds is satisfied because qualified Lilliputian and Brobdingnagian students both have a 50% chance of being admitted, and unqualified Lilliputian and Brobdingnagian have an 80% chance of being rejected. Equalized odds is formally defined in "Equality of Opportunity in Supervised Learning" as follows: "predictor Ŷ satisfies equalized odds with respect to protected attribute A and outcome Y if Ŷ and A are independent, conditional on Y."
164
Estimator
A deprecated TensorFlow API. Use tf.keras instead of Estimators.
165
evals
Primarily used as an abbreviation for LLM evaluations. More broadly, evals is an abbreviation for any form of evaluation.
166
evaluation
The process of measuring a model's quality or comparing different models against each other. To evaluate a supervised machine learning model, you typically judge it against a validation set and a test set. Evaluating a LLM typically involves broader quality and safety assessments.
167
example
The values of one row of features and possibly a label. Examples in supervised learning fall into two general categories: - A labeled example consists of one or more features and a label. Labeled examples are used during training. - An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.
168
experience replay
In reinforcement learning, a DQN technique used to reduce temporal correlations in training data. The agent stores state transitions in a replay buffer, and then samples transitions from the replay buffer to create training data.
169
experimenter's bias
See confirmation bias. The tendency to search for, interpret, favor, and recall information in a way that confirms one's pre-existing beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias. Experimenter's bias is a form of confirmation bias in which an experimenter continues training models until a pre-existing hypothesis is confirmed.
170
exploding gradient problem
The tendency for gradients in deep neural networks (especially recurrent neural networks) to become surprisingly steep (high). Steep gradients often cause very large updates to the weights of each node in a deep neural network. Models suffering from the exploding gradient problem become difficult or impossible to train. Gradient clipping can mitigate this problem. Compare to vanishing gradient problem.
171