Regression Flashcards
KNN Classification - training and predicting
Training: store all the data
Prediction:
1. calculate the distance from x to all points in your dataset
2. sort the points in your dataset by increasing distance from x
3. predict the majority label of the k closest point
Distance
- euclidean distance, manhattan distance, cosine distance = 1 - cosine similarity
KNN Regression
We take the average of the k nearest items
KNN Hyperparameters
K
- how many amount of neighbors we’re using. General rule: k=sqrt(n). Then do your grid search from here.
KNN - noise vs signal
lower k tends to overfit
higher k tends to underfit - capture less noise, and low signal
standardization
use standardization when your data has varying scales
(data point - mean) / standard deviation
put each scale to mean = 0
KNN (pros, cons)
pros:
- super simple
- training is trivial
- easy o add more data
- few hyperparameter
cons
- if you have a lot of features, you need a lot more data, but can be costly to gather more data
- high prediction cost
- bad with high dimensions. Anything more than 5 is bad.
- categorical features don’t work well
Mean Squared Error (MSE)
expected value of the square of the error
MSE = 1/n * E( predicted - actual)**2
- the average squared difference between the estimated values and the actual values
- mean of all the square errors in your model and data
Irreducible error
Error that we can’t do anything about Even if we had all possible data and could build a perfect model, we can’t predict values exactly.
Bias and variance
- errors that we can control
bias = failing to capture some of the signal (underfit) variance = error we get when from real world data. Where are the errors coming from and How consistently we're off.
When capturing more signal, you’re naturally capture more noise and variance.
k fold cross validation
train test split - reserve data for the ultimate testing set.
then do k-fold training: training and validation set
churn
decision rules
Linear Regression - scatterplot
good practice to plot a scatter plot, if it appears a linear relationship, between dependent and independent variables, it is good hint that we can use linear regression as a learning algorithm.
Feature Engineering
anytime you use your current features to create new features.
Linear Regression with single feature
y = mx + b
Linear Regression - how to pick the best line
Residual = the distance between our predicted value and the actual value
Find the line that minimizes the total sum of squared residuals.
Linear Regression with multiple features
the linear relationship is:
y = B01 + B1x1 + B2x2 + Bpxp
Linear Regression - R-squared
give you a sense of how well your model performs - how much variance of the data you’re capturing. It tells you how much better your model is doing better than the dumb model. Close to 0 = dumb model. 1 = overfitting
Something to watch out for: the more features you have, the higher r-squared is. Hence, people adjusted r-squared is a better measure
Linear Regression - Adjusted r-squared
normalized to the increase of features. It is a better measure of accuracy of your model because r-squared is susceptible to number of features
How do you know if a feature and dependent variable has a linear relationship?
You can plot the residuals (y_predicted - y_actual). We want to see our residuals normally distributed around 0 and shows no trend.
studentized residuals
divide residual by the estimate of standard deviation of the residuals.
It helps you to find outliers. You can remove a point from the data, plot the residual plot again to find outliers.
homoscedasticity vs. heteroscedasticity
homoscedasticity = When variance of your residual is constant.
how to test?
- Divide your data into 2 parts, if variance on left equals to right, it’s homo, else, it’s heteroscedasiticity
- sm.stats.diagnostic.het_goldfeldquandt then look for p-value. Null hypothesis = homoscedasticity
Normality
residuals are normally distributed
To see if residuals are normally distributed, we look at QQ plot - If they follow the same line, it is normal.
We can also use Jarque-Bera(JB). Look at p-value.
Null: it follows normality
Alt: doesn’t follow normality
Multicollinearity
When you have two features or more that depend on each other. Ie, height and weight.
- It means you can’t be confidence about your features so you will have to remove one of more features to be confident about your feature
- Variance inflation factor VIF. If VIF > 10, it’s a good indication that you have multiplelinearity
Linear Regression (prediction vs inferential)
- when there’s a linear relation between outcome and features
- Goal: find the coefficients of the features
- what coefficients should I choose to minimize my sum of squared residuals? - the difference between my observations and predictions
When I’m making prediction, and I don’t care about inferring the coefficients. I build a linear regression model and predict.
If you want to interpret the data - if you want to see how confidence you are with each coefficient, then you have to follow the following conditions.
- Linearity
- Homoscedascity
- Multicolinearity
- Normality
- All data is i.i.d (individually independent distributed)