MT1 Flashcards
(47 cards)
Prediction
Given input X, we are interested predicting the output, Y.
Complicated models are good at prediction, but hard to understand.
100% Prediction: We care more about prediction accuracy, and will sacrifice interpret ability for that.
Inference
Given input X, we are interested in understanding it’s relationship with Y.
100% Inference: We care more about interpret ability, and will sacrifice accuracy for that.
Estimating f
- Gather data from a subset
of the population of interest, because it is (usually) impossible to sample the entire true population. Through experimentation, observation etc.
We now have a set of TRAINING data where predictor X and response Y are BOTH known. The true relationship f between X and Y will never be known, but we want to get as close as possible.
- We want to predict what future unknown Y values will be based on given X values.
- Using the gathered data, we can try out different models on that data, to see which minimizes the residual error, and fits through testing and refinement, and use that to predict future values.
- We can split the original data set into training and testing, and test the chosen model on the testing.
Parameters
Quantities such as mean, std deviation, and proportions etc… are important values called the “parameters” of a TRUE population.
Since we will never know these true parameters, we calculate estimates of them from the sample data (subset) taken from the population. These estimations are called “Statistics”.
Statistics are estimations of the parameters
Parametric vs Non-parametric
Parametric: procedures rely on assumptions about the shape of the distribution of the underlying population from which the sample was taken. Most common assumption is that population is normally distributed. Generally better at inference.
Non-parametric has no assumptions about underlying population. The model structure is determined by the data. Generally better at prediction.
CAVEAT: Connect the dots: A perfect non-parametric fit, but horrible prediction.
Response variable
Response variable Y will generally be in the form of categorical (color, shape etc…) or numeric.
MSE
MEAN squared error: MSE is the distance from a response value Y in the training data, to the predicted response value (on the prediction line) at a give X value, squared.
We want to find a line that minimized the MSE for FUTURE predictions. THIS IS WHAT MAKES A GOOD MODEL!!!!!! Minimize the mean squared error for FUTURE observations.
Overfitting
Adding flexibility to the model (i.e. from linear to quadratic regression), will always decrease the MSE on the training data, but not necessarily the TESTING MSE.
i.e. Connect the dots fits the training data perfectly, (0 MSE) but does horribly on future observations.
Irreducible error
The inherent natural variability of the true population of interest.
Error due to squared bias
This is a REducible error.
The inability for a statistical method to capture the TRUE relationship of the data.
If the average of a model’s predictions across different testing data are substantially different than the TRUE response values, that model is said to have high bias.
i.e. If we fit a linear model, to data whose true relationship is quadratic, it will have a higher MSE, it will have high bias.
Error due to variance
This is a REducible error.
The amount to which the MSE of a model fit varies across data sets.
- We have a set of training data.
- We choose a statistical method, and apply it to that training data, which generates a model fit representing a relationship (hopefully the true relationship), and a resulting MSE from that fit.
- We then apply that model to a new set of testing data, which results in a new MSE for the predictions.
- The difference between the training MSE fits, and the testing MSE fits is called the variance.
- If the MSE difference is very high, the model has high variance, and if the MSE difference is very low, the model has low variance.
Variance is only concerned with with how much the MSE of our chosen model fit varies between different data sets. NOT with how accurate it’s predictions are.
If we fit a highly quadratic (flexible) model to data whose true relationship is linear or close to linear (not flexible), it will fit the training data very well (the prediction line will go through, or be very close to the true response values MSE ~0, aka low bias), but once we apply that model to a new data set, sometimes it’s predictions may be good (low MSE), but sometimes the true response values will not fall close to that line anymore (since the line was so specific to the training data) and result in a much larger MSE. The MSE will vary a lot, meaning that it is hard to predict how well this model will fit future training sets.
Quick Bias vs Variance
Bias: The difference in MSE between the model fit, and the true relationship. Concerned with accuracy of the model.
Variance: The difference in MSE of the model fit across different data sets. Concerned with consistency of MSE across predictions.
Overfitting
Overfitting is when a highly flexible model (i.e. quadratic) is chosen to fit training data whose true relationship is not very flexible (i.e. linear). This results in low bias, high variance.
Underfitting
Underfitting is when a low flexibility model (i.e linear or low quadratic) is chosen to fit a highly flexible mode. This results in high bias, low variance.
Classification
When Y is a categorical variable, then we must use classification techniques. Mean squared error no longer applies, so we are concerned with error rates.
Error rate is the number of times our model incorrectly classifies data. We are more interested in the error rate of the testing set, rather then the training set.
Bayes Classifier
The Bayes Classifier is the true relationship of the data when the response variable is categorical.
It is the f that we are attempting to estimate, and has no reducible error, only irreducible error.
K-nearest Neighbors
This is a simple, non-parametric (no assumptions on underlying data) and lazy (minimal or no training phase) classification algorithm that attempts to estimate the Bayes classifier.
When a new data point is added to a data set, the algorithm looks at the K nearest data points around the new point. The majority class of K wins, and the new data is predicted as the majority class.
KNN predicts discrete values in classification.
It can also be used for regression, by finding the K nearest neighbors of a new continuous data point, and outputting the average of those K points.
Regression
Regression (analysis) is a SET of statistical methods used to estimate the relationship between a response variable and one or more predictor variables. More specifically, regression estimates the average value of response Y, when one predictor varies, and all other predictors are held constant.
It is primarily used for prediction or forecasting, but can also show which predictors have the greatest influence on the response variable, as well as probability distributions.
KNN K=1
Each data point’s closest neighbor is itself, and will result in overfitting. The training data will have ZERO misclassifications because the classification boundary will separate all classes from each other (low bias), but the misclassification rate on testing data will vary widely (high variance).
Small values of K tend to favor classes in it’s immediate area.
KNN K=n
The majority class from the training data will always be predicted, because there won’t be any classification boundaries.
All testing data will be predicted as the majority class from the training data (High bias), but we will always know what the prediction will be (low variance).
Large values of K tend to favor majority classes.
P-value
The p-value is a measure of the strength against the null hypothesis.
The likelihood of the observation of interest occurring randomly if the NULL hypothesis were true.
Coin Flip:
Ho = Fair Coin
Ha = two tailed coin.
We get 6T in a row, which is a 1.5% probability of happening with a fair coin. That is very rare! P-value is 0.015.
For this example, it is so unlikely to have happened randomly, that we DON’T THINK IT WAS RANDOM. Reject the NULL.
Linear (and multi) regression assumption
1–There exists some approximately linear relationship between Y and X. (no other kind of relationship)
2–The distribution of error has constant variance.
3–Error is normally distributed.
4–Errors are independent of each other, thus not affecting each other.
Hypothesis testing steps.
1– Choose the null and alternative hypothesis
2– Decide on assumptions. i.e. significance level.
3– Test statistics and/or p-value.
4– Decision: Reject or fail to reject the null hypothesis.
5–Interpretation back to original context. What to do with the decision? Plain english.
How to interpret ^B2
This is estimated slope of the second variable in a multiple linear regression (more than one predictor variable).
It is interpreted by holding all other predictors constant, and observing the resulting increase/decrease in the response variable.