Machine learning for Regression Flashcards

1
Q

1) Data Preparation

A

You can view the data and do preparation steps e.g. clean up columns (select, renaming), cleanup values, update or create new values etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2) Exploratory Data Analysis

A

Looking at each column for unique values, number of unique values,
Visualize the column you want to predict (matplotlib and seaborn for visualization)
See the distribution of the target variable and manipulate to make it suitable for ML models e.g. normal distribution is ideal for the ml models.
Check for missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Long Tail Distribution

A

If majority of the data is concentrated at lower values and small number of them in higher values. This type of distribution is not good for machine learning algorithms because long tailed data confuses the algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to get rid of long tail distribution problem for ML?

A

Apply logarithmic distribution to get more compact values. If we have very large values, the logarithmic values are not very large.
Log of zero doesn’t exist, we can solve it by adding 1 to the data.
In numpy, we have np.log1p which adds 1 to the values. Tail is gone and the shape resembles normal distribution with a clear centre and approximate symmetrical. If the target variable looks like that, the models do a lot better in prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of Distribution

A

+ Symmetric Distributions
Mirrored distributions around mean
+ Left Skewed Distributions
Long tail on the left
+ Right Skewed Distributions
Long tail on the right
+ Bimodal distributions
Symmetric and peaks on left and right
+ Uniform Distribution
Approximately uniformed distribution i.e. roughly symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of Probability Distributions

A

+Discrete Distributions
Finite number of outcomes
+Continuous Distributions
Infinite number of outcomes e.g. time and distance
X(variable) ~ N(Type) (mean, variance)(characteristics) but the characteristics could vary by the type of distributions
Distributions are usually created on the outcome variable.

Discrete Distributions:
Uniform distribution
+Outcomes are equally likely or equiprobable

Bernoulli Distribution
+ Events with only two types of outcomes e.g. true or false.
+ Regardless of whether one outcome is more likely than the other
+ Any event with two outcomes can be transformed into a Bernoulli event

Binomial distribution
+ If we are carry out multiple iterations of the two outcomes
+e.g. we can flip the coin 3 times and would like to know the likely hood of getting heads twice.

Poisson Distribution:
Test out how unusual an event frequency is for a given interval
If the frequency changes so is the expectation of the outcome.

Continuous Distribution:
The probability distribution would be a curve as compared to unconnected individual bars.

Normal distributions
+ Often observed in nature
+ Symmetrical distribution around mean
+ extreme left or right are the outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

3) Setting up Validation Framework

A

Split data into 3 parts
1) Training dataset (60%)
2) Validation dataset (20%)
3) Test dataset (20%)

It may happen that total number of records is not equal to number of test data + number of val data
For this, we can take out val data and test data , the remaining records are for training data.

Make sure that before splitting the data, it should be shuffled and it should not be in sequence so that values are present in all three data sets. We can get index and shuffle it using numpy as in
idx = np.arrange(n), np.random.shuffle(idx)
df_shuffle = df.iloc[idx]
df_train = df_shuffle.iloc[:n_train].copy()

delete the target variable from feature Matrix i.e. df_train and test and validation data.

Apply log transformation to target variable y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Linear Regression (formula)

A

Model outputs a number.
g(xi)=w0 + w1x1 + w2x2 + w3x3 ….
We combine the features / observations/characteristics in a way that it is close to the outcome /target variable.
We don’t take features as it is, we multiply it with some weight
w0 is bias which means prediction when we don’t know any thing about the features.
More compact formula
g(xi) = w0 + summation of wixi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Undo log(y+1) (Getting original prediction values)

A

Since we did log(y+1) in previous steps to make outcome distribution more normal, so the model is still outputting the same logarithmic output, we need to undo it to give us really price. The way to undo is through exponent. e.g. np exp(x)
np.expm1(x) subtracts 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Compact Linear Regression Formula

A

g(xi)=w0+xiT.w
More compact
w = [w0=1, w1, w2, w3…..wn]
xi = [x0=1, x1, x2, x3….xn]
wTXi = xiTw

We just prepend 1 in the beginning for w and x. So now the result is the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Linear Regression Formula

A

Matrix and Vector W so each row is multiplied with each element of the column.
It will be
X1Tw
x2Tw
….
xnTw

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

4) Training Linear Regression: Normal Equation

A

g(X)=Xw~y
We need to find a way to get w.
Xinv.Xw=Xinv.y if the inverse of X exist otherwise solution does not exist.
XT.X is called gram matrix and for this inverse exists because it’s always square.
XT.Xw=XT.y
(XT.X)inv. XT.Xw=(XT.X)inv.XT.y
Iw=(XT.X)inv.XT.y
Iw=w

w0 = bias term
w=rest are weights

We should add bias term i.e. it is to let the model know if there is no information about the car, what should be the prediction.
Negative coefficient means that outcomes goes down because of some factor e.g. age of the car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Numpy: np.column_stack()

A

Use np.column_stack([ones, X]) to add columns to an existing matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

5) Baseline Model

A

1) Using all the numerical columns which best describes an observations
2) Creating numpy matrix out of dataframe e.g. df.values or df.to_numpy()
3) Checking for missing values e.g. df.isnull().sum(), we can fill_na(0). If we fill missing values with 0, we make the model ignore the features. e.g.
g(xi) = w0 + xi1w1 + xi2w2 = w0 + xi2.w2 , If xi1 was horsepower and it doesn’t make sense that a car has 0 horsepower. We can replace null values with mean values as well.
4) We can use these weights for predictions. y=w_0 + X_train.dot(w)
5) We can plot predictions to check target variable and prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

RMSE

A

How to objectively assess the performance of the model?
RMSE stands for root mean square error.
g(xi) = prediction for xi
y=actual prediction

Summation((g(xi) - yi)^2)<1 to m>/m
We take the difference between the prediction and actual target variable, take squared value, sum it up for all observations and divide by the number of observations i.e. m to get the value. We take the square root of the mean squared value.

Lesser the RMSE, better the model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

6) Model Validation

A

We prepare the validation dataset in a similar way as training dataset. We generate the predicted target variable on validation dataset.
We take the RMSE for predicted target variable and the actual predictions.

17
Q

7) Simple Feature Engineering discussion

A

We can generate new features based on existing feature e.g. calculate age through year.

18
Q

Why use df.copy?

A

So that our original dataframe is not modified and we are making all the changes on the copied dataframe.

19
Q

8) Feature Engineering: Categorical Variables

A

Categorical Variables are columns/ features which are string e.g. car make or columns which are numerical but categorical e.g. number of doors like 2, 3, 4.
The way to encode these variables is creating a binary column i.e. 1 categorical value translates into a new column and then we add either 1 or 0 based on the value.

After adding the feature, we can check for RMSE and see the impact of feature.

e.g. for car make, we can add binary columns for the first 5 most popular ones

20
Q

9) Regularization

A

Sometimes the bias and weights can become large and increase RMSE significantly which means something went wrong. So if check our weight equation i.e. (XT.X)inv.XT.y, the problem is the inverse matrix and sometimes inverse does not exist. This can happen due to duplicate features and columns can have same values.
Numpy would complain that this is a singular matrix and it cannot compute inverse.
If there is a slight difference between two columns, they do not remain same columns and inverse is possible. This can happen due to noise in the data and it tries to calculate inverse, even though inverse should not exist.
This problem can be resolved if we add a small number on the diagonal of the matrix.
The larger the number we add to the diagonal the more controlled are the values in inv matrix.
In code we can do it as e.g.
XTX=XTX+0.01*np.eye(3)
This is Regularization which means controlling the numbers. The greater the Regularization parameter, the more controlled the values and inverse could be possible.

21
Q

10) Tuning the model

A

We need to find the best regularization parameter for our model.
We use the validation dataset to find the regularization parameter from a list of values for the parameter.
Apply to the model and calculate bias and RMSE.

22
Q

11) Using the model

A

Now we want to train model on both train and validation dataset and apply it on the test dataset for final predictions.

Now we take the model and input any new car from the test dataset and predict the price of the car.
In real scenario, we might have a dictionary with all the possible values which user has inputted, we sent it to the model and the model returns the price.
Turn this dictionary from user to prepare it as the model expects it.
Convert the prediction to the price so we take the exp of the value. np.expm1(y_pred).

23
Q

The formula for linear Regression model is also called

A

Normal equation

24
Q

The binary columns created for categorical values is an encoding also called

A

one hot encoding