The Hundred- Page Machine Learning (Book) Flashcards

(96 cards)

1
Q

What is Machine Learning

A

Machine learning is a subfield of computer science that is concerned with building algorithms
which, to be useful, rely on a collection of examples of some phenomenon. These examples
can come from nature, be handcrafted by humans or generated by another algorithm.
Machine learning can also be defined as the process of solving a practical problem by 1)
gathering a dataset, and 2) algorithmically building a statistical model based on that dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Learning

A

Learning can be supervised, semi-supervised, unsupervised and reinforcement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supervised Learning

A

The goal of a supervised learning algorithm is to use the dataset to produce a model
that takes a feature vector x as input and outputs information that allows deducing the label
for this feature vector. For instance, the model created using the dataset of people could
take as input a feature vector describing a person and output a probability that the person
has cancer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unsupervised Learning

A

In unsupervised learning, the dataset is a collection of unlabeled examples {xi}N
i=1.
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is
to create a model that takes a feature vector x as input and either transforms it into
another vector or into a value that can be used to solve a practical problem. For example,
in clustering, the model returns the id of the cluster for each feature vector in the dataset.
In dimensionality reduction, the output of the model is a feature vector that has fewer
features than the input x; in outlier detection, the output is a real number that indicates
how x is di

erent from a “typical” example in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semi-Supervised Learning

A

In semi-supervised learning, the dataset contains both labeled and unlabeled examples.
Usually, the quantity of unlabeled examples is much higher than the number of labeled
examples. The goal of a semi-supervised learning algorithm is the same as the goal of
the supervised learning algorithm. The hope here is that using many unlabeled examples can
help the learning algorithm to find (we might say “produce” or “compute”) a better model2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reinforcement Learning

A

Reinforcement learning is a subfield of machine learning where the machine “lives” in an
environment and is capable of perceiving the state of that environment as a vector of
features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. The goal
of a reinforcement learning algorithm is to learn a policy. A policy is a function f (similar
to the model in supervised learning) that takes the feature vector of a state as input and
outputs an optimal action to execute in that state. The action is optimal if it maximizes the
expected average reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

bag of words?

A

-the first feature is equal to 1 if the email message contains the word “a”; otherwise,
this feature is 0;
* the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,
this feature equals 0;
* …
* the feature at position 20,000 is equal to 1 if the email message contains the word
“zulu”; otherwise, this feature is equal to 0.

Now you have a machine-readable input data, but the output labels are still in the form of
human-readable text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

scalar

A

A scalar is a simple numerical value, like 15 or ≠3.25. Variables or constants that take scalar
values are denoted by an italic letter, like x or a.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

vector

A

A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold
character, for example, x or w. Vectors can be visualized as arrows that point to some
directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional
vectors, a = [2, 3], b = [≠2, 5], and c = [1, 0] is given in fig. 1. We denote an attribute of a
vector as an italic value with an index, like this: w(j) or x(j)

. The index j denotes a specific
dimension of the vector, the position of an attribute in the list. For instance, in the vector a
shown in red in fig. 1, a(1) = 2 and a(2) = 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

set

A

A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The sum of two vectors

A

The sum of two vectors x + z is defined as the vector [x(1) + z(1), x(2) + z(2),…,x(m) + z(m)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

vector multiplied by a scalar

A

A vector multiplied by a scalar is a vector. For example xc = [cx(1), cx(2), . . . , cx(m)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

dot-product of two vectors

A

A dot-product of two vectors is a scalar. For example, wx def
= qm
i=1 w(i)
x(i)
. In some books,
the dot-product is denoted as w · x. The two vectors must be of the same dimensionality.
Otherwise, the dot-product is undefined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(0, 1) contains 0 and 1? what about [0,1]?

A

() no [] yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Derivative and Gradient

A

A derivative fÕ of a function f is a function or a value that describes how fast f grows (or
decreases). If the derivative is a constant value, like 5 or ≠3, then the function grows (or
decreases) constantly at any point x of its domain. If the derivative fÕ is a function, then the
function f can grow at a different pace in different regions of its domain. If the derivative fÕ
is positive at some point x, then the function f grows at this point. If the derivative of f is
negative at some x, then the function decreases at this point. The derivative of zero at x
means that the function’s slope at x is horizontal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

probability distribution

A

The probability distribution of a discrete random variable is described by a list of probabilities
associated with each of its possible values. This list of probabilities is called probability mass
function (pmf). For example: Pr(X = red)=0.3, Pr(X = yellow)=0.45, Pr(X = blue) =
0.25. Each probability in a probability mass function is a value greater than or equal to 0.
The sum of probabilities equals 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

continuous random variable

A

A continuous random variable takes an infinite number of possible values in some interval.
Examples include height, weight, and time. Because the number of values of a continuous
random variable X is infinite, the probability Pr(X = c) for any c is 0. Therefore, instead
of the list of probabilities, the probability distribution of a continuous random variable (a
continuous probability distribution) is described by a probability density function (pdf). The
pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Bayes’ Rule

A

The conditional probability Pr(X = x|Y = y) is the probability of the random variable X to
have a specific value x given that another random variable Y has a specific value of y. The
Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Model-Based vs. Instance-Based Learning

A

Most supervised learning algorithms are model-based. We have already seen one such
algorithm: SVM. Model-based learning algorithms use the training data to create a model
that has parameters learned from the training data. In SVM, the two parameters we saw
were wú and bú. After the model was built, the training data can be discarded.
Instance-based learning algorithms use the whole dataset as the model. One instance-based
algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification, to
predict a label for an input example the kNN algorithm looks at the close neighborhood of
the input example in the space of feature vectors and outputs the label that it saw the most
often in this close neighborhood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Shallow vs. Deep Learning

A

A shallow learning algorithm learns the parameters of the model directly from the features
of the training examples. Most supervised learning algorithms are shallow. The notorious
exceptions are neural network learning algorithms, specifically those that build neural
networks with more than one layer between input and output. Such neural networks are
called deep neural networks. In deep neural network learning (or, simply, deep learning),
contrary to shallow learning, most model parameters are learned not directly from the features
of the training examples, but from the outputs of the preceding layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Decision Tree Learning

A

A decision tree is an acyclic graph that can be used to make decisions. In each branching
node of the graph, a specific feature j of the feature vector is examined. If the value of the
feature is below a specific threshold, then the left branch is followed; otherwise, the right
branch is followed. As the leaf node is reached, the decision is made about the class to which
the example belongs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

k-Nearest Neighbors

A

k-Nearest Neighbors (kNN) is a non-parametric learning algorithm. Contrary to other
learning algorithms that allow discarding the training data after the model is built, kNN
keeps all training examples in memory. Once a new, previously unseen example x comes in,
the kNN algorithm finds k training examples closest to x and returns the majority label (in
case of classification) or the average label (in case of regression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Building Blocks of a Learning Algorithm

A

1) a loss function;
2) an optimization criterion based on the loss function (a cost function, for example); and
3) an optimization routine that leverages training data to find a solution to the optimization
criterion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Gradient descent

A

Gradient descent is an iterative optimization algorithm for finding the minimum of a function.
To find a local minimum of a function using gradient descent, one starts at some random
point and takes steps proportional to the negative of the gradient (or approximate gradient)
of the function at the current point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Stochastic gradient descent
Stochastic gradient descent (SGD) is a version of the algorithm that speeds up the computation by approximating the gradient using smaller batches (subsets) of the training data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales – for each parameter according to the history of gradients. As a result, – is reduced for very large gradients and vice-versa. Momentum is a method that helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations. In neural network training, variants of SGD such as RMSprop and Adam, are most frequently used.
26
Feature Engineering
The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a labor-intensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge.
27
Bias and variability
Google it punk!
28
One-Hot Encoding
Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones.
29
Binning
is when you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range. In some cases, a carefully designed binning can help the learning algorithm to learn using fewer examples. It happens because we give a “hint” to the learning algorithm that if the value of a feature falls within a specific range, the exact value of the feature doesn’t matter.
30
Normalization
Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1]. For example, suppose the natural range of a particular feature is 350 to 1450. By subtracting 350 from every value of the feature, and dividing the result by 1100, one can normalize those values into the range [0, 1].
31
Why do we normalize?
Why do we normalize? Normalizing the data is not a strict requirement. However, in practice, it can lead to an increased speed of learning. Remember the gradient descent example from the previous chapter. Imagine you have a two-dimensional feature vector. When you update the parameters of w(1) and w(2), you use partial derivatives of the average squared error with respect to w(1) and w(2). If x(1) is in the range [0, 1000] and x(2) the range [0, 0.0001], then the derivative with respect to a larger feature will dominate the update.
32
Standardization
Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ = 0 and ‡ = 1, where μ is the mean (the average value of the feature, averaged over all examples in the dataset) and ‡ is the standard deviation from the mean.
33
You may ask when you should use normalization and when standardization.
Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task. * unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization; * standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve); * again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range; * in all other cases, normalization is preferable.
34
Dealing with Missing Features
* Removing the examples with missing features from the dataset. That can be done if your dataset is big enough so you can sacrifice some training examples. * Using a learning algorithm that can deal with missing feature values (depends on the library and a specific implementation of the algorithm). * Using a data imputation technique.
35
Data Imputation Techniques
One technique consists in replacing the missing value of a feature by an average value of this feature in the dataset: Another technique is to replace the missing value by the same value outside the normal range of values. For example, if the normal range is [0, 1], then you can set the missing value equal to 2 or ≠1. The idea is that the learning algorithm will learn what is it better to do when the feature has a value significantly different from other values. Alternatively, you can replace the missing value by a value in the middle of the range. For example, if the range for a feature is [≠1, 1], you can set the missing value to be equal to 0. Here, the idea is that if we use the value in the middle of the range to replace missing features, such value will not significantly affect the prediction. A more advanced technique is to use the missing value as the target variable for a regression problem. You can use all remaining features [x(1) i , x(2) i ,...,x(j≠1) i , x(j+1) i ,...,x(D) i ] to form a feature vector xˆi, set yˆi = x(j) , where j is the feature with a missing value. Now we can build a regression model to predict yˆ from the feature vectors xˆ. Of course, to build training examples (xˆ, yˆ), you only use those examples from the original dataset, in which the value of feature j is present. Finally, if you have a significantly large dataset and just a few features with missing values, you can increase the dimensionality of your feature vectors by adding a binary indicator feature for each feature with missing values. Let’s say feature j = 12 in your D-dimensional dataset has missing values. For each feature vector x, you then add the feature j = D + 1 which is equal to 1 if the value of feature 12 is present in x and 0 otherwise. The missing feature value then can be replaced by 0 or any number of your choice.
36
Learning Algorithm Selection
Explainability In-memory vs. out-of-memory Number of features and examples Categorical vs. numerical features Nonlinearity of the data Training speed Prediction speed
37
Three Sets wehn working with data
1) training set, 2) validation set, and 3) test set.
38
Underfit - high bias
- your model is too simple for the data (for example a linear model can often underfit); - the features you engineered are not informative enough.
39
Overfitting - high variance
* your model is too complex for the data (for example a very tall decision tree or a very deep or wide neural network often overfit); * you have too many features but a small number of training examples.
40
Regularization
Regularization is an umbrella-term that encompasses methods that force the learning algorithm to build a less complex model. In practice, that often leads to slightly higher bias but significantly reduces the variance. This problem is known in the literature as the bias-variance trade off.
41
elastic net regularization
L1 and L2 regularization methods are also combined in what is called elastic net regular- ization with L1 and L2 regularizations being special cases. You can find in the literature the name ridge regularization for L2 and lasso for L1.
42
For classification, things are a little bit more complicated. The most widely used metrics and tools to assess the classification model are:
* confusion matrix, * accuracy, * cost-sensitive accuracy, * precision/recall, and * area under the ROC curve.
43
Confusion Matrix
The confusion matrix is a table that summarizes how successful the classification model is at predicting examples belonging to various classes. One axis of the confusion matrix is the label that the model predicted, and the other axis is the actual label. In a binary classification problem, there are two classes.
44
Precision/Recall
Precision is the ratio of correct positive predictions to the overall number of positive predictions: Recall is the ratio of correct positive predictions to the overall number of positive examples in the test set:
45
Accuracy
Accuracy is given by the number of correctly classified examples divided by the total number of classified examples. In terms of the confusion matrix, it is given by:
46
Cost-Sensitive Accuracy
For dealing with the situation in which different classes have different importance, a useful metric is cost-sensitive accuracy. To compute a cost-sensitive accuracy, you first assign a cost (a positive number) to both types of mistakes: FP and FN. You then compute the counts TP, TN, FP, FN as usual and multiply the counts for FP and FN by the corresponding cost before calculating the accuracy using eq. 5.
47
ROC curves
Googel/gpt
48
Hyperparameter Tuning
As you already know, hyperparameters aren’t optimized by the learning algorithm itself. The data analyst has to “tune” hyperparameters by experimentally finding the best combination of values, one per hyperparameter.
49
grid search?
Grid search is the most simple hyperparameter tuning strategy. Let’s say you train an SVM and you have two hyperparameters to tune: the penalty parameter C (a positive real number) and the kernel (either “linear” or “rbf”). If it’s the first time you are working with this dataset, you don’t know what is the possible range of values for C. The most common trick is to use a logarithmic scale. For example, for C you can try the following values: [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]. In this case you have 14 combinations of hyperparameters to try: [(0.001, “linear”), (0.01, “linear”), (0.1, “linear”), (1.0, “linear”), (10, “linear”), (100, “linear”), (1000, “linear”), (0.001, “rbf”), (0.01, “rbf”), (0.1, “rbf”), (1.0, “rbf”), (10, “rbf”), (100, “rbf”), (1000, “rbf”)].
50
random search
Random search differs from grid search in that you no longer provide a discrete set of values to explore for each hyperparameter; instead, you provide a statistical distribution for each hyperparameter from which values are randomly sampled and set the total number of combinations you want to try.
51
Bayesian hyperparameter optimization.
Bayesian techniques differ from random or grid search in that they use past evaluation results to choose the next values to evaluate. The idea is to limit expensive optimization of the objective function by choosing the next hyperparameter values based on those that have done well in the past.
52
Cross-Validation
When you don’t have a decent validation set to tune your hyperparameters on, the common technique that can help you is called cross-validation. When you have few training examples, it could be prohibitive to have both validation and test set. You would prefer to use more data to train the model. In such a case, you only split your data into a training and a test set. Then you use cross-validation to on the training set to simulate a validation set.
53
Deep Learning
Deep learning refers to training neural networks with more than two non-output layers. In the past, it became more diffcult to train such networks as the number of layers grew. The two biggest challenges were referred to as the problems of exploding gradient and vanishing gradient as gradient descent was used to train the network parameters.
54
backpropagation
Backpropagation is an efficient algorithm for computing gradients on neural networks using the chain rule.
55
exploding gradient and vanishing gradient
google/gpt
56
hidden layers.
The layers that are neither input nor output are often called hidden layers.
57
CNN, RNN
google-gpt
58
Multiclass Classification
In multiclass classification, the label can be one of the C classes: y œ {1,...,C}. Many machine learning algorithms are binary; SVM is an example. Some algorithms can naturally be extended to handle multiclass problems. ID3 and other decision tree learning algorithms can be simply changed Logistic regression can be naturally extended to multiclass learning problems by replacing the sigmoid function with the softmax function which we already saw in Chapter 6.
59
One vs rest idea to transform multiclass problem?
The idea is to transform a multiclass problem into C binary classification problems and build C binary classifiers. For example, if we have three classes, y œ {1, 2, 3}, we create copies of the original datasets and modify them. In the first copy, we replace all labels not equal to 1 by 0. In the second copy, we replace all labels not equal to 2 by 0. In the third copy, we replace all labels not equal to 3 by 0. Now we have three binary classification problems where we have to learn to distinguish between labels 1 and 0, 2 and 0, and between labels 3 and 0. Once we have the three models and we need to classify the new input feature vector x, we apply the three models to the input, and we get three predictions. We then pick the prediction of a non-zero class which is the most certain. Remember that in logistic regression, the model returns not a label but a score (0, 1) that can be interpreted as the probability that the label is positive.
60
One-class classification,
One-class classification, also known as unary classification or class modeling, tries to identify objects of a specific class among all objects, by learning from a training set containing only the objects of that class. That is different from and more diff cult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all classes. A typical one-class classification problem is the classification of the traffc in a secure network as normal. In this scenario, there are few,if any, examples of the traffc under an attack or during an intrusion. However, the examples of normal traffc are often in abundance. One-class classification learning algorithms are used for outlier detection, anomaly detection, and novelty detection. There are several one-class learning algorithms. The most widely used in practice are one-class Gaussian, one-class kmeans, one-class kNN, and one-class SVM.
61
one-class gaussian
The idea behind the one-class gaussian is that we model our data as if it came from a Gaussian distribution, more precisely multivariate normal distribution (MND). The probability density function (pdf) for MND is given by the following equation: ##some eq where f ,(x) returns the probability density corresponding to the input feature vector x. Probability density can be interpreted as the likelihood that example x was drawn from the probability distribution we model as an MND.
62
Multi-Label Classification
In multi-label classification, each training example doesn’t just have one label, but several of them. For instance, if we want to describe an image, we could assign several labels to it: “people,” “concert,” “nature,” all three at the same time
63
Ensemble Learning
Ensemble learning is a learning paradigm that, instead of trying to learn one super-accurate model, focuses on training a large number of low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model. Two most widely used and effective ensemble learning algorithms are random forest and gradient boosting.
64
What Is Bagging in Machine Learning?
Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.
65
Random Forest/Gradient boosting
google
66
Sequence-to-sequence learning
Sequence-to-sequence learning (often abbreviated as seq2seq learning) is a generalization of the sequence labeling problem. In seq2seq, Xi and Yi can have di erent length. seq2seq models have found application in machine translation (where, for example, the input is an English sentence, and the output is the corresponding French sentence), conversational interfaces (where the input is a question typed by the user, and the output is the answer from the machine), text summarization, spelling correction, and many others.
67
embedding
The role of the encoder is to read the input and generate some sort of state (similar to the state in RNN) that can be seen as a numerical representation of the meaning of the input the machine can work with. The meaning of some entity, whether it be an image, a text or a video, is usually a vector or a matrix that contains real numbers. This vector (or matrix) is called in the machine learning jargon the embedding of the input.
68
architecture with attention.
Attention mechanism is implemented by an additional set of parameters that combine some information from the encoder (in RNNs, this information is the list of state vectors of the last recurrent layer from all encoder time steps) and the current state of the decoder to generate the label. That allows for even better retention of long-term dependencies than provided by gated units and bidirectional RNN.
69
Active learning
Active learning is an interesting supervised learning paradigm. It is usually applied when obtaining labeled examples is costly. That is often the case in the medical or financial domains, where the opinion of an expert may be required to annotate patients’ or customers’ data. The idea is that we start the learning with relatively few labeled examples, and a large number of unlabeled ones, and then add labels only to those examples that contribute the most to the model quality.
70
semi-supervised learning
In semi-supervised learning (SSL) we also have labeled a small fraction of the dataset; most of the remaining examples are unlabeled. Our goal is to leverage a large number of unlabeled examples to improve the model performance without asking an expert for additional labeled examples. For example, it was shown that for some datasets, such as MNIST (a frequent testbench in computer vision that consists of labeled images of handwritten digits from 0 to 9) the model trained in a semi-supervised way has an almost perfect performance with just 10 labeled examples per class (100 labeled examples overall). For comparison, MNIST contains 70,000 labeled examples (60,000 for training and 10,000 for test). The neural network architecture that attained such a remarkable performance is called a ladder network.
71
autoencoder
An autoencoder is a feed-forward neural network with an encoder-decoder architecture. It is trained to reconstruct its input. So the training example is a pair (x, x). We want the output xˆ of the model f(x) to be as similar to the input x as possible.
72
One-Shot Learning
One of them is one-shot learning. In one-shot learning, typically applied in face recognition, we want to build a model that can recognize that two photos of the same person represent that same person. If we present to the model two photos of two different people, we expect the model to recognize that the two people are different. One way to build such a model is to train a siamese neural network (SNN). An SNN can be implemented as any kind of neural network, a CNN, an RNN, or an MLP. What matters is how we train the network. To train an SNN, we use the triplet loss function. For example, let us have three images of a face: the image A (for anchor), the image P (for positive) and the image N (for negative). A and P are two different pictures of the same person; N is a picture of another person. Each training example i is now a triplet (Ai, Pi, Ni).
73
Zero-Shot Learning
We finish this chapter with zero-shot learning. It is a relatively new research area, so there are no algorithms that proved to have a significant practical utility yet. Therefore, I only outline here the basic idea and leave the details of various algorithms for further reading. In zero-shot learning (ZSL) we want to train a model to assign labels to objects. The most frequent application is to learn to assign labels to images.
74
Handling Imbalanced Datasets
If you set the cost of misclassification of examples of the minority class higher, then the model will try harder to avoid misclassifying those examples, obviously for the cost of misclassification of some examples of the majority class, as illustrated in Figure 1b. Some SVM implementations (including SVC in scikit-learn) allow you to provide weights for every class. The learning algorithm takes this information into account when looking for the best hyperplane. If your learning algorithm doesn’t allow weighting classes, you can try to increase the importance of examples of some class by making multiple copies of the examples of this class (this is called oversampling). An opposite approach is to randomly remove from the training set some examples of the majority class (undersampling).
75
popular algorithms that oversample the minority class
There two popular algorithms that oversample the minority class by creating synthetic examples: the synthetic minority oversampling technique (SMOTE) and the adaptive synthetic sampling method (ADASYN).
76
Combining Models
In practice, we can sometimes get an additional performance gain by combining strong models made with different learning algorithms. In this case, we usually use only two or three models. There are three typical ways to combine models: 1) averaging, 2) majority vote, and 3) stacking. Averaging works for regression as well as those classification models that return classification scores. You simply apply all your models, let’s call them base models, to the input x and then average the predictions. To see if the averaged model works better than each individual algorithm, you test it on the validation set using a metric of your choice. Majority vote works for classification models. You apply all your base models to the input x and then return the majority class among all predictions. In the case of a tie, you either randomly pick one of the classes, or, you return an error message (if the fact of misclassifying would incur a significant cost). Stacking consists of building a meta-model that takes the output of your base models as input. Let’s say you want to combine a classifier f1 and a classifier f2, both predicting the same set of classes. To create a training example (xˆi, yˆi) for the stacked model, you set xˆi = [f1(x), f2(x)] and yˆi = yi.
77
Advanced Regularization
In neural networks, besides L1 and L2 regularization, you can use neural network specific regularizers: dropout, batch normalization, and early stopping. Batch normalization is technically not a regularization technique, but it often has a regularization effect on the model.
78
dropout
The concept of dropout is very simple. Each time you run a training example through the network, you temporarily exclude at random some units from the computation. The higher the percentage of units excluded the higher the regularization effect. Neural network libraries allow you to add a dropout layer between two successive layers, or you can specify the dropout parameter for the layer. The dropout parameter is in the range [0, 1] and it has to be found experimentally by tuning it on the validation data.
79
Batch normalization
Batch normalization (which rather has to be called batch standardization) is a technique that consists of standardizing the outputs of each layer before the units of the subsequent layer receive them as input. In practice, batch normalization results in a faster and more stable training, as well as in some regularization effect.
80
Early stopping
Early stopping is the way to train a neural network by saving the preliminary model after every epoch and assessing the performance of the preliminary model on the validation set. As you remember from the section about gradient descent in Chapter 4, as the number of epochs increases, the cost decreases. The decreased cost means that the model fits the training data well. However, at some point, after some epoch e, the model can start overfitting: the cost keeps decreasing, but the performance of the model on the validation data deteriorates. If you keep, in a file, the version of the model after each epoch, you can stop the training once you start observing a decreased performance on the validation set. Alternatively, you can keep running the training process for a fixed number of epochs and then, in the end, you pick the best model. Models saved after each epoch are called checkpoints. Some machine learning practitioners rely on this technique very often; others try to properly regularize the model to avoid such undesirable behavior.
81
data augmentation.
Another regularization technique that can be applied not just to neural networks, but to virtually any learning algorithm, is called data augmentation. This technique is often used to regularize models that work with images. Once you have your original labeled training set, you can create a synthetic example from an original example by applying various transformations to the original image: zooming it slightly, rotating, flipping, darkening, and so on. You keep the original label in these synthetic examples. In practice, this often results in increased performance of the model.
82
Handling Multiple Inputs
In many of your practical problems, you will work with multimodal data. For example, your input could be an image and text and the binary output could indicate whether the text describes this image or not. With neural networks, you have more flexibility. You can build two subnetworks, one for each type of input. For example, a CNN subnetwork would read the image while an RNN subnetwork would read the text. Both subnetworks have as their last layer an embedding: CNN has an embedding for the image, while RNN has an embedding for the text. You can then concatenate two embeddings and then add a classification layer, such as softmax or sigmoid, on top of the concatenated embeddings. Neural network libraries provide simple to use tools that allow concatenating or averaging layers from several subnetworks.
83
Transfer learning
Transfer learning is probably where neural networks have a unique advantage over the shallow models. In transfer learning, you pick an existing model trained on some dataset, and you adapt this model to predict examples from another dataset, different from the one the model was built on. With neural networks, the situation is much more favorable. Transfer learning in neural networks works like this. 1. You build a deep model on the original big dataset (wild animals). 2. You compile a much smaller labeled dataset for your second model (domestic animals). 3. You remove the last one or several layers from the first model. Usually, these are layers responsible for the classification or regression; they usually follow the embedding layer. 4. You replace the removed layers with new layers adapted for your new problem. 5. You “freeze” the parameters of the layers remaining from the first model. 6. You use your smaller labeled dataset and gradient descent to train the parameters of only the new layers.
84
For faster comput? cProfile package?
Use cProfile package in Python to find ineFFciencies in your code. - TEST Finally, when nothing can be improved in your code from the algorithmic perspective, you can further boost the speed of your code by using: * multiprocessing package to run computations in parallel, and * PyPy, Numba or similar tools to compile your Python code into fast, optimized machine code.
85
Unsupervised Learning
Unsupervised learning deals with problems in which your dataset doesn’t have labels. This property is what makes it very problematic for many practical applications. The absence of labels which represent the desired behavior for your model means the absence of a solid reference point to judge the quality of your model. In this book, I only present unsupervised learning methods that allow building models that can be evaluated based on data as opposed to human judgment.
86
Clustering
Clustering is a problem of learning to assign a label to examples by leveraging an unlabeled dataset. Because the dataset is completely unlabeled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.
87
K-Means
The k-means clustering algorithm works as follows. First, the analyst has to choose k — the number of classes (or clusters). Then we randomly put k feature vectors, called centroids, to the feature space1. We then compute the distance from each example x to each centroid c using some metric, like the Euclidean distance. Then we assign the closest centroid to each example (like if we labeled each example with a centroid id as the label). For each centroid, we calculate the average feature vector of the examples labeled with it. These average feature vectors become the new locations of the centroids.
88
DBSCAN and HDBSCAN
Clustering methods look in the text.
89
Dimensionality Reduction
Many modern machine learning algorithms, such as ensemble algorithms and neural networks handle well very high-dimensional examples, up to millions of features. With modern computers and graphical processing units (GPUs), dimensionality reduction techniques are used much less in practice than in the past. The most frequent use case for dimensionality reduction is data visualization: humans can only interpret on a plot the maximum of three dimensions. Another situation in which you could benefit from dimensionality reduction is when you have to build an interpretable model and to do so you are limited in your choice of learning algorithms. For example, you can only use decision tree learning or linear regression. By reducing your data to lower dimensionality and by figuring out which quality of the original example each new feature in the reduced feature space reflects, one can use simpler algorithms. Dimensionality reduction removes redundant or highly correlated features; it also reduces the noise in the data — all that contributes to the interpretability of the model. The three most widely used techniques of dimensionality reduction are principal com- ponent analysis (PCA), uniform manifold approximation and projection (UMAP), and autoencoders.
90
Learning to rank
Learning to rank is a supervised learning problem. Among others, one frequent problem solved using learning to rank is the optimization of search results returned by a search engine for a query. In search result ranking optimization, a labeled example Xi in the training set of size N is a ranked collection of documents of size ri (labels are ranks of documents). A feature vector represents each document in the collection. The goal of the learning is to find a ranking function f which outputs values that can be used to rank documents.
91
mean average precision (MAP).
Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.
92
LambdaMART
LambdaMART is a technique where ranking is transformed into a pairwise classification or regression problem. The algorithms consider a pair of items at a single time, coming up with a viable ordering of those items before initiating the final order of the entire list.
93
Learning to Recommend
Leaning to recommend is an approach to build recommender systems. Usually, we have a user who consumes some content. We have the history of consumption, and we want to suggest this user new content that the user would like. It could be a movie on Netflix or a book on Amazon. Traditionally, two approaches were used to give recommendations: content-based filtering and collaborative filtering.
94
Content-based filtering
Content-based filtering is based on learning what do users like based on the description of the content they consume. For example, if the user of a news site often reads news articles on science and technology, then we would suggest to this user more documents on science and technology. More generally, we could create one training set per user and add news articles to this dataset as a feature vector x and whether the user recently read this news article as a label y. Then we build the model of each user and can regularly examine each new piece of content to determine whether a specific user would read it or not. The content-based approach has many limitations. For example, the user can be trapped in the so-called filter bubble: the system will always suggest to that user the information that looks very similar to what user already consumed. That could result in complete isolation of the user from information that disagrees with their viewpoints or expands them. On a more practical side, the users might just get recommendations of items they already know about, which is undesirable.
95
Collaborative filtering
Collaborative filtering has a significant advantage over content-based filtering: the recommen- dations to one user are computed based on what other users consume or rate. For instance, if two users gave high ratings to the same ten movies, then it’s more likely that user 1 will appreciate new movies recommended based on the tastes of the user 2 and vice versa. The drawback of this approach is that the content of the recommended items is ignored.
96
Word Embeddings
Google - find in book