pros: - super simple - training is trivial - easy o add more data - few hyperparameter cons - if you have a lot of features, you need a lot more data, but can be costly to gather more data - high prediction cost - bad with high dimensions. Anything more than 5 is bad. - categorical features don't work well

Regression Flashcards by Helen Tong

KNN Classification - training and predicting

Training: store all the data
Prediction:
1. calculate the distance from x to all points in your dataset
2. sort the points in your dataset by increasing distance from x
3. predict the majority label of the k closest point

Distance
- euclidean distance, manhattan distance, cosine distance = 1 - cosine similarity

How well did you know this?

Not at all

Perfectly

KNN Regression

We take the average of the k nearest items

How well did you know this?

Not at all

Perfectly

KNN Hyperparameters

- how many amount of neighbors we’re using. General rule: k=sqrt(n). Then do your grid search from here.

How well did you know this?

Not at all

Perfectly

KNN - noise vs signal

lower k tends to overfit

higher k tends to underfit - capture less noise, and low signal

How well did you know this?

Not at all

Perfectly

standardization

use standardization when your data has varying scales
(data point - mean) / standard deviation

put each scale to mean = 0

How well did you know this?

Not at all

Perfectly

KNN (pros, cons)

pros:
- super simple
- training is trivial
- easy o add more data
- few hyperparameter

cons

if you have a lot of features, you need a lot more data, but can be costly to gather more data
high prediction cost
bad with high dimensions. Anything more than 5 is bad.
categorical features don’t work well

How well did you know this?

Not at all

Perfectly

Mean Squared Error (MSE)

expected value of the square of the error
MSE = 1/n * E( predicted - actual)**2
- the average squared difference between the estimated values and the actual values
- mean of all the square errors in your model and data

How well did you know this?

Not at all

Perfectly

Irreducible error

Error that we can’t do anything about Even if we had all possible data and could build a perfect model, we can’t predict values exactly.

How well did you know this?

Not at all

Perfectly

Bias and variance

errors that we can control

bias = failing to capture some of the signal (underfit)
variance = error we get when from real world data. Where are the errors coming from and How consistently we're off.

When capturing more signal, you’re naturally capture more noise and variance.

How well did you know this?

Not at all

Perfectly

k fold cross validation

train test split - reserve data for the ultimate testing set.
then do k-fold training: training and validation set

How well did you know this?

Not at all

Perfectly

churn

decision rules

How well did you know this?

Not at all

Perfectly

Linear Regression - scatterplot

good practice to plot a scatter plot, if it appears a linear relationship, between dependent and independent variables, it is good hint that we can use linear regression as a learning algorithm.

How well did you know this?

Not at all

Perfectly

Feature Engineering

anytime you use your current features to create new features.

How well did you know this?

Not at all

Perfectly

Linear Regression with single feature

y = mx + b

How well did you know this?

Not at all

Perfectly

Linear Regression - how to pick the best line

Residual = the distance between our predicted value and the actual value

Find the line that minimizes the total sum of squared residuals.

How well did you know this?

Not at all

Perfectly

Linear Regression with multiple features

Study These Flashcards

the linear relationship is:

y = B01 + B1x1 + B2x2 + Bpxp

Linear Regression - R-squared

Study These Flashcards

give you a sense of how well your model performs - how much variance of the data you’re capturing. It tells you how much better your model is doing better than the dumb model. Close to 0 = dumb model. 1 = overfitting

Something to watch out for: the more features you have, the higher r-squared is. Hence, people adjusted r-squared is a better measure

Linear Regression - Adjusted r-squared

Study These Flashcards

normalized to the increase of features. It is a better measure of accuracy of your model because r-squared is susceptible to number of features

How do you know if a feature and dependent variable has a linear relationship?

Study These Flashcards

You can plot the residuals (y_predicted - y_actual). We want to see our residuals normally distributed around 0 and shows no trend.

studentized residuals

Study These Flashcards

divide residual by the estimate of standard deviation of the residuals.
It helps you to find outliers. You can remove a point from the data, plot the residual plot again to find outliers.

homoscedasticity vs. heteroscedasticity

Study These Flashcards

homoscedasticity = When variance of your residual is constant.

how to test?

Divide your data into 2 parts, if variance on left equals to right, it’s homo, else, it’s heteroscedasiticity
sm.stats.diagnostic.het_goldfeldquandt then look for p-value. Null hypothesis = homoscedasticity

Normality

Study These Flashcards

residuals are normally distributed

To see if residuals are normally distributed, we look at QQ plot - If they follow the same line, it is normal.

We can also use Jarque-Bera(JB). Look at p-value.
Null: it follows normality
Alt: doesn’t follow normality

Multicollinearity

Study These Flashcards

When you have two features or more that depend on each other. Ie, height and weight.

It means you can’t be confidence about your features so you will have to remove one of more features to be confident about your feature
Variance inflation factor VIF. If VIF > 10, it’s a good indication that you have multiplelinearity

Linear Regression (prediction vs inferential)

Study These Flashcards

when there’s a linear relation between outcome and features
Goal: find the coefficients of the features
what coefficients should I choose to minimize my sum of squared residuals? - the difference between my observations and predictions

When I’m making prediction, and I don’t care about inferring the coefficients. I build a linear regression model and predict.

If you want to interpret the data - if you want to see how confidence you are with each coefficient, then you have to follow the following conditions.

Linearity
Homoscedascity
Multicolinearity
Normality
All data is i.i.d (individually independent distributed)

Linear Regression - Top bottom, bottom top

bottom top: start by each feature, start with the one giving you the lowest RMSE, then add second, add third and see if you can reduce your RMSE. top bottom: consider all features, remove one by one

Regularized regression - what does it do?

It introduces parameters into linear regression to allow us to adjust for biases and variances tradeoff. It introduce an alpha to the linear regression equation. The bigger the alpha, the smaller beta gets. When beta is small, the model is less sensitive to each individual feature, it is less likely to overfit.

regularized regression - criteria

All predictors need to be on the same scale. | standardized_feature = (raw_feature - mean(raw_feature))/st_dev ( raw_feature)

ridge regression

optimization: alpha(beta**2) | As alpha increases, it pushes all betas proportionately close to 0

overfitting and underfitting

When overfitting, we're capturing a lot of signal but we're also introducing a lot of noise. So when we fit our real world data into our model, it is going to give me inconsistent outcome.

lasso regression

optimization: alpha( |beta| ) As alpha increase, some betas drop to 0 sooner than other betas. Benefit: allow for feature selection - I believe there are features that matter more than others, and I want the features that matter more to stay in the model longer as I'm increasing alpha, and features that matter less to drop out.

Logistic Regression - decision boundary

setting a threshold to determine class 1 and class 2. We set this boundary based on business needs

Logistic Regression - find the best model

minimize log loss

how do you define which model is better in logistic regression?

ROC curve it works like this: false positive rate on x, true positive rate on y. Then start moving threshold to see how the fpr and tpr rate changes. The bigger the area of this curve, the better we're doing with our model.

logistic regression - performance

depends on what class you're interested in, we need to adjust our performance measurement using a confusion matrix. For example, if positive is more important, we want to reduce false negative, etc. Whenever accuracy is not good indicator, always use confusion matrix and other measurement like recall, precision, etc.

linear regression - performance

whichever model with the lowest RMSE is the better model

what does it mean that logistic regression is a soft classifier?

able to provide probabilities of an outcome belong to one class vs the other.

confusion matrix

it is a table used to describe the performance of a classification model. Accuracy = how often is the classifier correct? Recall = when it's actually yes, how often does it predict yes? Precision = when it predicts yes, how often is it correct?

how do you deal with imbalanced class?

Before you do any of these, make sure to do train test split. Real world data should reflect real data as accurately as possible. 1. oversampling - one way to do it is to bootstrap 2. undersampling - appropriate when you have more data than you can process. Or when data in the majority class is less important to the result. ie. 1 million rows, 100,000 in one class. We randomly make 100,000 data from the majority class. 3. SMOTE - it generates some new data points looking at two samples of data but usually results is not good.

using a confusion matrix in a business case

1. build a model and create a confusion matrix 2. assign dollar value to a confusion matrix with a business use case 3. multiply them together to calculate profit 4. test it with a range of threshold to see which threshold gives us the highest revenue

Regression Flashcards

(39 cards)