Machine Learning Flashcards
Terms and concepts related to machine learning (250 cards)
ablation
A technique for evaluating the importance of a feature or component by temporarily removing it from a model. You then retrain the model without that feature or component, and if the retrained model performs significantly worse, then the removed feature or component was likely important.
For example, suppose you train a classification model on 10 features and achieve 88% precision on the test set. To check the importance of the first feature, you can retrain the model using only the nine other features. If the retrained model performs significantly worse (for instance, 55% precision), then the removed feature was probably important. Conversely, if the retrained model performs equally well, then that feature was probably not that important.
Ablation can also help determine the importance of:
- Larger components, such as an entire subsystem of a larger ML system
- Processes or techniques, such as a data preprocessing step
In both cases, you would observe how the system’s performance changes (or doesn’t change) after you’ve removed the component.
A/B testing
A statistical way of comparing two (or more) techniques—the A and the B. Typically, the A is an existing technique, and the B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.
A/B testing usually compares a single metric on two techniques; for example, how does model accuracy compare for two techniques? However, A/B testing can also compare any finite number of metrics.
accelerator chip
A category of specialized hardware components designed to perform key computations needed for deep learning algorithms.
Accelerator chips (or just accelerators, for short) can significantly increase the speed and efficiency of training and inference tasks compared to a general-purpose CPU. They are ideal for training neural networks and similar computationally intensive tasks.
Examples of accelerator chips include:
- Google’s Tensor Processing Units (TPUs) with dedicated hardware for deep learning.
- NVIDIA’s GPUs which, though initially designed for graphics processing, are designed to enable parallel processing, which can significantly increase processing speed.
accuracy
The number of correct classification predictions divided by the total number of predictions. That is:
Accuracy = correct predictions ÷ (correct predictions + incorrect predictions)
For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:
Accuracy = 40 ÷ (40 + 10) = 80%
Binary classification provides specific names for the different categories of correct predictions and incorrect predictions. So, the accuracy formula for binary classification is as follows:
Accuracy = TP + TN ÷ (TP + TN + FP + FN)
where:
- TP is the number of true positives (correct predictions).
- TN is the number of true negatives (correct predictions).
- FP is the number of false positives (incorrect predictions).
- FN is the number of false negatives (incorrect predictions).
Compare and contrast accuracy with precision and recall.
action
In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.
activation function
A function that enables neural networks to learn nonlinear (complex) relationships between features and the label. The plots of activation functions are never single straight lines. Popular activation functions include:
- ReLU
- Sigmoid
active learning
A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.
AdaGrad
A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. AdaGrad was one of the first algorithms to use adaptive learning rates and set the stage for further development in this area.
agent
In reinforcement learning, the entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.
agglomerative clustering
A form of hierarchical clustering, agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
anomaly detection
The process of identifying outliers. For example, if the mean for a certain feature is 100 with a standard deviation of 10, then anomaly detection should flag a value of 200 as suspicious.
artificial general intelligence
A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.
artificial intelligence
A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.
Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.
attention
A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.
Refer also to self-attention and multi-head self-attention, which are the building blocks of Transformers.
attribute
Synonym for feature.
In machine learning fairness, attributes often refer to characteristics pertaining to individuals.
attribute sampling
A tactic for training a decision forest in which each decision tree considers only a random subset of possible features when learning the condition. Generally, a different subset of features is sampled for each node. In contrast, when training a decision tree without attribute sampling, all possible features are considered for each node.
AUC (Area under the ROC curve)
Area under the receiver operating characteristic curve. A number between 0.0 and 1.0 representing a binary classification model’s ability to separate positive classes from negative classes. The closer the AUC is to 1.0, the better the model’s ability to separate classes from each other. AUC ignores any value you set for classification threshold. Instead, AUC considers all possible classification thresholds.
augmented reality
A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view.
autoencoder
A system that learns to extract the most important information from the input. Autoencoders are a combination of an encoder and decoder. Autoencoders rely on the following two-step process:
- The encoder maps the input to a (typically) lossy lower-dimensional (intermediate) format.
- The decoder builds a lossy version of the original input by mapping the lower-dimensional format to the original higher-dimensional input format.
Autoencoders are trained end-to-end by having the decoder attempt to reconstruct the original input from the encoder’s intermediate format as closely as possible. Because the intermediate format is smaller (lower-dimensional) than the original format, the autoencoder is forced to learn what information in the input is essential, and the output won’t be perfectly identical to the input.
For example:
- If the input data is a graphic, the non-exact copy would be similar to the original graphic, but somewhat modified. Perhaps the non-exact copy removes noise from the original graphic or fills in some missing pixels.
- If the input data is text, an autoencoder would generate new text that mimics (but is not identical to) the original text.
See also variational autoencoders.
automation bias
When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.
AutoML
Any automated process for building machine learning models. AutoML can automatically do tasks such as the following:
- Search for the most appropriate model.
- Tune hyperparameters.
- Prepare data (including performing feature engineering).
- Deploy the resulting model.
AutoML is useful for data scientists because it can save them time and effort in developing machine learning pipelines and improve prediction accuracy. It is also useful to non-experts, by making complicated machine learning tasks more accessible to them.
auto-regressive model
A model that infers a prediction based on its own previous predictions. For example, auto-regressive language models predict the next token based on the previously predicted tokens. All Transformer-based large language models are auto-regressive.
In contrast, GAN-based image models are usually not auto-regressive since they generate an image in a single forward-pass and not iteratively in steps. However, certain image generation models are auto-regressive because they generate an image in steps.
auxiliary loss
A loss function—used in conjunction with a neural network model’s main loss function—that helps accelerate training during the early iterations when weights are randomly initialized.
Auxiliary loss functions push effective gradients to the earlier layers. This facilitates convergence during training by combating the vanishing gradient problem.
average precision
A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).
See also Area under the PR Curve.