The Hundred- Page Machine Learning (Book) Flashcards
(96 cards)
What is Machine Learning
Machine learning is a subfield of computer science that is concerned with building algorithms
which, to be useful, rely on a collection of examples of some phenomenon. These examples
can come from nature, be handcrafted by humans or generated by another algorithm.
Machine learning can also be defined as the process of solving a practical problem by 1)
gathering a dataset, and 2) algorithmically building a statistical model based on that dataset.
Types of Learning
Learning can be supervised, semi-supervised, unsupervised and reinforcement.
Supervised Learning
The goal of a supervised learning algorithm is to use the dataset to produce a model
that takes a feature vector x as input and outputs information that allows deducing the label
for this feature vector. For instance, the model created using the dataset of people could
take as input a feature vector describing a person and output a probability that the person
has cancer.
Unsupervised Learning
In unsupervised learning, the dataset is a collection of unlabeled examples {xi}N
i=1.
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is
to create a model that takes a feature vector x as input and either transforms it into
another vector or into a value that can be used to solve a practical problem. For example,
in clustering, the model returns the id of the cluster for each feature vector in the dataset.
In dimensionality reduction, the output of the model is a feature vector that has fewer
features than the input x; in outlier detection, the output is a real number that indicates
how x is di
erent from a “typical” example in the dataset.
Semi-Supervised Learning
In semi-supervised learning, the dataset contains both labeled and unlabeled examples.
Usually, the quantity of unlabeled examples is much higher than the number of labeled
examples. The goal of a semi-supervised learning algorithm is the same as the goal of
the supervised learning algorithm. The hope here is that using many unlabeled examples can
help the learning algorithm to find (we might say “produce” or “compute”) a better model2.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine “lives” in an
environment and is capable of perceiving the state of that environment as a vector of
features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. The goal
of a reinforcement learning algorithm is to learn a policy. A policy is a function f (similar
to the model in supervised learning) that takes the feature vector of a state as input and
outputs an optimal action to execute in that state. The action is optimal if it maximizes the
expected average reward.
bag of words?
-the first feature is equal to 1 if the email message contains the word “a”; otherwise,
this feature is 0;
* the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,
this feature equals 0;
* …
* the feature at position 20,000 is equal to 1 if the email message contains the word
“zulu”; otherwise, this feature is equal to 0.
Now you have a machine-readable input data, but the output labels are still in the form of
human-readable text.
scalar
A scalar is a simple numerical value, like 15 or ≠3.25. Variables or constants that take scalar
values are denoted by an italic letter, like x or a.
vector
A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold
character, for example, x or w. Vectors can be visualized as arrows that point to some
directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional
vectors, a = [2, 3], b = [≠2, 5], and c = [1, 0] is given in fig. 1. We denote an attribute of a
vector as an italic value with an index, like this: w(j) or x(j)
. The index j denotes a specific
dimension of the vector, the position of an attribute in the list. For instance, in the vector a
shown in red in fig. 1, a(1) = 2 and a(2) = 3.
set
A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S.
The sum of two vectors
The sum of two vectors x + z is defined as the vector [x(1) + z(1), x(2) + z(2),…,x(m) + z(m)].
vector multiplied by a scalar
A vector multiplied by a scalar is a vector. For example xc = [cx(1), cx(2), . . . , cx(m)].
dot-product of two vectors
A dot-product of two vectors is a scalar. For example, wx def
= qm
i=1 w(i)
x(i)
. In some books,
the dot-product is denoted as w · x. The two vectors must be of the same dimensionality.
Otherwise, the dot-product is undefined.
(0, 1) contains 0 and 1? what about [0,1]?
() no [] yes.
Derivative and Gradient
A derivative fÕ of a function f is a function or a value that describes how fast f grows (or
decreases). If the derivative is a constant value, like 5 or ≠3, then the function grows (or
decreases) constantly at any point x of its domain. If the derivative fÕ is a function, then the
function f can grow at a different pace in different regions of its domain. If the derivative fÕ
is positive at some point x, then the function f grows at this point. If the derivative of f is
negative at some x, then the function decreases at this point. The derivative of zero at x
means that the function’s slope at x is horizontal.
probability distribution
The probability distribution of a discrete random variable is described by a list of probabilities
associated with each of its possible values. This list of probabilities is called probability mass
function (pmf). For example: Pr(X = red)=0.3, Pr(X = yellow)=0.45, Pr(X = blue) =
0.25. Each probability in a probability mass function is a value greater than or equal to 0.
The sum of probabilities equals 1
continuous random variable
A continuous random variable takes an infinite number of possible values in some interval.
Examples include height, weight, and time. Because the number of values of a continuous
random variable X is infinite, the probability Pr(X = c) for any c is 0. Therefore, instead
of the list of probabilities, the probability distribution of a continuous random variable (a
continuous probability distribution) is described by a probability density function (pdf). The
pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1
Bayes’ Rule
The conditional probability Pr(X = x|Y = y) is the probability of the random variable X to
have a specific value x given that another random variable Y has a specific value of y. The
Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:
Model-Based vs. Instance-Based Learning
Most supervised learning algorithms are model-based. We have already seen one such
algorithm: SVM. Model-based learning algorithms use the training data to create a model
that has parameters learned from the training data. In SVM, the two parameters we saw
were wú and bú. After the model was built, the training data can be discarded.
Instance-based learning algorithms use the whole dataset as the model. One instance-based
algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification, to
predict a label for an input example the kNN algorithm looks at the close neighborhood of
the input example in the space of feature vectors and outputs the label that it saw the most
often in this close neighborhood.
Shallow vs. Deep Learning
A shallow learning algorithm learns the parameters of the model directly from the features
of the training examples. Most supervised learning algorithms are shallow. The notorious
exceptions are neural network learning algorithms, specifically those that build neural
networks with more than one layer between input and output. Such neural networks are
called deep neural networks. In deep neural network learning (or, simply, deep learning),
contrary to shallow learning, most model parameters are learned not directly from the features
of the training examples, but from the outputs of the preceding layers.
Decision Tree Learning
A decision tree is an acyclic graph that can be used to make decisions. In each branching
node of the graph, a specific feature j of the feature vector is examined. If the value of the
feature is below a specific threshold, then the left branch is followed; otherwise, the right
branch is followed. As the leaf node is reached, the decision is made about the class to which
the example belongs.
k-Nearest Neighbors
k-Nearest Neighbors (kNN) is a non-parametric learning algorithm. Contrary to other
learning algorithms that allow discarding the training data after the model is built, kNN
keeps all training examples in memory. Once a new, previously unseen example x comes in,
the kNN algorithm finds k training examples closest to x and returns the majority label (in
case of classification) or the average label (in case of regression).
Building Blocks of a Learning Algorithm
1) a loss function;
2) an optimization criterion based on the loss function (a cost function, for example); and
3) an optimization routine that leverages training data to find a solution to the optimization
criterion.
Gradient descent
Gradient descent is an iterative optimization algorithm for finding the minimum of a function.
To find a local minimum of a function using gradient descent, one starts at some random
point and takes steps proportional to the negative of the gradient (or approximate gradient)
of the function at the current point.