General Flashcards
What are “labels”?
The thing we’re predicting. e.g. the “y” variable in a simple linear regression.
What is a “feature”?
An input variable, e.g. the “x” in a simple linear regression. There may be multiple input variables, indicated as {x1, x2, … xN}.
What is an “example”?
An instance of data, either “labeled” or “unlabeled”. Labeled examples are used to train the model, which can then predict the labels on unlabeled examples.
What is “training” and “inference”?
Training is using labeled data to create, or “learn”, the model. Inference is using the model to make predictions on unlabeled data.
What is “regression” and “classification”?
A regression model predicts continuous values, e.g. prices. A classification model predicts discrete values, e.g. spam or not spam, or the animal in an image.
What is “linear regression”?
Linear regression is a method for finding the straight line or hyperplane that best fits a set of points
What is “L2 Loss”?
Also known as “squared error”, this is simply the sum of the “(prediction - actual)2” for the examples.
How would you write the equation for a simple linear regression model?
y’ = b + w1 x1, where y’ is the predicted label, b is the bias (y intercept), w1 is weight of feature 1, and x1 is a feature (a known input). With more features, “b + w1 x1 + w2 x2…”
What is the process of trying many example models to find one that minimizes loss called?
Empirical risk minimization.
What is MSE?
Mean square error. The sum of the squared losses for example example, divided by the number of examples. It is commonly used, but not the only function, or best for all cases.
What does it mean that a model has “converged”?
In computing the model, the loss has reach a point with no or little improvement in each iteration.
What is the shape of the “loss” vs “weight” plot for regression problems?
Convex (i.e. bowl). They only have one minimum (i.e. one place with a slope is exactly 0).
What is the symbol “Δ”?
The uppercase form of the Greek letter delta, it is used to represent “change in”. e.g. in the “slop intercept” equation “y = mx + b”, then “Δy = m Δx”.
What is “∂y / ∂x”
The partial derivative of f with respect to x (the derivative of f considered as a function of x alone).
What is “∇f”
The gradient of a function, denoted as shown, is the vector of partial derivatives with respect to all of the independent variables.
What is a “gradient”?
The gradient is a multi-variable generalization of the derivative. Like the derivative, the gradient represents the slope of the tangent of the graph of the function. More precisely, the gradient points in the direction of the greatest rate of increase of the function, and its magnitude is the slope of the graph in that direction
What is the gradient decent algorithm?
The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible. To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point.
What is the “learning rate”?
Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.
What are hyperparameters?
Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. eg. If you pick a learning rate that is too small, learning will take too long.
In gradient descent, what is a “batch”?
The total number of examples you use to calculate the gradient in a single iteration.
What is SGD and “mini-batch SGD”?
Stochastic Gradient Decent uses one random example for each iteration of training. This is fast but can be noisy. “mini-batch” chooses a number of random samples (typically 10 - 1000), which has less noise and is still much more efficient that “full batch”.
What is root mean squared error? What is nice about it?
It is just the square root of MSE (mean squared error). A nice property of RMSE is that it can be interpreted on the same scale as the original targets
What two sets are the data typically divided into?
The training set, used to build the model, and the test set, used to validate the accuracy of the model. The subsets should be consistent throughout training, and never train on the test data.
What other subset might be added to the training and test sets, and why?
A validation set can be used to avoid overfitting. Tweak the hyperparameters until the validation set does well, then verify against the test set that it holds.