CH1: The Machine Learning Landscape Flashcards

Question

What is a example of semisupervised learning algorithms?

Answer 1

Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, deep belief networks (DBNs) are based on unsu‐ pervised components called restricted Boltzmann machines (RBMs) stacked on top of one another.

Answer 2

RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.

Answer 3

The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure 1-12). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time.

Answer 4

A policy defines what action the agent should choose when it is in a given situation.

Answer 5

For example, many robots implement Reinforcement Learning algorithms to learn how to walk. DeepMind’s AlphaGo program is also a good example of Reinforcement Learning: It learned its winning policy by analyzing millions of games, and then playing many games against itself. Note that learning was turned off during the games against the champion; AlphaGo was just applying the policy it had learned.

Answer 6

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data.

Answer 7

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data.

Answer 8

This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning

Answer 9

If you want a batch learning system to know about new data you need to train a new version of the system from scratch on the full dataset then stop the old system and replace it with the new one This solution is simple and often works fine, but training using the full set of data can take many hours, Also, training on the full set of data requires a lot of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.). If you have a lot of data and you automate your system to train from scratch every day, it will end up costing you a lot of money. If the amount of data is huge, it may even be impossible to use a batch learning algorithm. Finally, if your system needs to be able to learn autonomously and it has limited resources then carrying around large amounts of training data and taking up a lot of resources to train for hours every day is a showstopper.

Answer 10

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives

Answer 11

Online learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a good option if you have limited computing resources: once an online learning system has learned about new data instances, it does not need them anymore, so you can discard them (unless you want to be able to roll back to a previous state and “replay” the data). This can save a huge amount of space.

Answer 12

Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data

Answer 13

Out-of-core learning is usually done offline (i.e., not on the live system), so online learning can be a confusing name. Think of it as incremental learning.

Answer 14

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.

Answer 15

If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data

Answer 16

if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

Answer 17

A big challenge with online learning is that if bad data is fed to the system, the sys‐ tem’s performance will gradually decline. If we are talking about a live system, your clients will notice.

Answer 18

To reduce this risk, you need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance. You may also want to monitor the input data and react to abnormal data (e.g., using an anomaly detection algorithm).

Answer 19

Most Machine Learning tasks are about making predictions. This means that given a number of training examples, the system needs to be able to generalize to examples it has never seen before. Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.

Answer 20

this is called instance-based learning: the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure.

Answer 21

Another way to generalize from a set of examples is to build a model of these exam‐ ples, then use that model to make predictions. This is called model-based learning

Answer 22

model selection: you selected a linear model of life satisfac‐ tion with just one attribute

Answer 23

To answer this question, you need to specify a performance measure. You can either define a utility function (or fitness function) that measures how good your model is, or you can define a cost function that measures how bad it is.

Answer 24

For linear regression problems, people typically use a cost function that measures the distance between the linear model’s predictions and the training examples; the objective is to minimize this distance.

Answer 25

This is where the Linear Regression algorithm comes in: you feed it your training examples and it finds the parameters that make the linear model fit best to your data. This is called training the model.

Answer 26

You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function)

Answer 27

In summary: * You studied the data. * You selected a model. * You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function). * Finally, you applied the model to make predictions on new cases (this is called inference), hoping that this model will generalize well.

Answer 28

In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data.

Answer 29

1. Insufficient Quantity of Training Data 2. nonrepresentative training data 3. poor-quality data 4. irrelevant features

Answer 30

1. overfitting the training data 2. underfitting the triaing data

Answer 31

The idea that data matters more than algorithms for complex problems

Answer 32

In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is true whether you use instance-based learning or model-based learning.

Answer 33

if the sample is too small, you will have sampling noise (i.e., nonrepresentative data as a result of chance)

Answer 34

very large samples can be nonrepresentative if the sampling method is flawed. This is called sampling bias

Answer 35

Nonresponse bias occurs when survey participants are unwilling or unable to respond to a survey question or an entire survey

Answer 36

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poorquality measurements)W

Answer 37

A critical part of the success of a Machine Learning project is coming up with a good set of features to train on

Answer 38

* Feature selection: selecting the most useful features to train on among existing features. * Feature extraction: combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help). * Creating new features by gathering new data.

Answer 39

it means that the model performs well on the training data, but it does not generalize well

Answer 40

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data.

Answer 41

* To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model * To gather more training data * To reduce the noise in the training data (e.g., fix data errors and remove outliers)

Answer 42

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.

Answer 43

For example, the linear model we defined earlier has two parameters, θ0 and θ1 . This gives the learning algorithm two degrees of freedom to adapt the model to the training data: it can tweak both the height (θ0 we forced θ1) and the slope (θ1) of the line.

Answer 44

You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well.

Answer 45

A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.

Answer 46

The amount of regularization to apply during learning can be controlled by a hyper‐ parameter

Answer 47

If you set the regularization hyper‐ parameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution.

Answer 48

it occurs when your model is too simple to learn the underlying structure of the data.

Answer 49

* Selecting a more powerful model, with more parameters * Feeding better features to the learning algorithm (feature engineering) * Reducing the constraints on the model (e.g., reducing the regularization hyper‐ parameter)

Answer 50

The only way to know how well a model will generalize to new cases is to actually try it out on new cases. \this is done by spplitting the data into a training and test set

Answer 51

The error rate on new cases is called the generalization error (or out-ofsample error), and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.

Answer 52

If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the train‐ ing data.

Answer 53

One option is to train both and compare how well they generalize using the test set.

Answer 54

One option is to train 100 different models using 100 different values for this hyperparameter. Suppose you find the best hyperparame‐ ter value that produces a model with the lowest generalization error, say just 5% error.

Answer 55

holdout validation: you simply hold out part of the training set to evaluate several candidate models and select the best one. More specifically, you train multiple models with various hyperparameters on the reduced training set (i.e., the full training set minus the validation set), and you select the model that performs best on the validation set. After this holdout vali‐ dation process, you train the best model on the full training set (including the valida‐ tion set), and this gives you the final model. Lastly, you evaluate this final model on the test set to get an estimate of the generalization error.

Answer 56

However, if the validation set is too small, then model evaluations will be imprecise: you may end up selecting a suboptimal model by mistake. Conversely, if the validation set is too large, then the remaining training set will be much smaller than the full training set.

Answer 57

One way to solve this problem is to perform repeated cross-validation, using many small validation sets. Each model is evaluated once per validation set, after it is trained on the rest of the data. By averaging out all the evaluations of a model, we get a much more accurate measure of its performance.

Answer 58

the training time is multiplied by the number of valida‐ tion sets

Answer 59

the most important rule to remember is that the validation set and the test must be as representative as possible of the data you expect to use in production,

Answer 60

One sol‐ ution is to hold out part of the training pictures (from the web) in yet another set that Andrew Ng calls the train-dev set. After the model is trained (on the training set, not on the train-dev set), you can evaluate it on the train-dev set: if it performs well, then the model is not overfitting the training set, so if performs poorly on the validation set, the problem must come from the data mismatch. Conversely, if the model performs poorly on the train-dev set, then the model must have overfit the training set, so you should try to simplify or regularize the model, get more training data and clean up the training data

Answer 61

if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other There is no model that is a priori guaranteed to work better (hence the name of the theorem). The only way to know for sure which model is best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and you evaluate only a few reasonable models

CH1: The Machine Learning Landscape Flashcards

(86 cards)