Data Mining - Chapter 6 (Regression) Flashcards

1
Q

What is the formula of a multiple linear regression model?

A

y = bo + b1x1 + b2x2 + … + bpxp + e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the multiple linear regression model?

A

It is a model used to fit a relationship between a numerical outcome variable (Y) and a set of predictors (X1, X2, etc).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which two objectives are there to use multiple linear regression?

A
  1. Explanatory
    Explaining or quantifying the average effect of inputs on an outcome. (Mostly used in stats. Find a model that best explains the underlying relationship in a poulation.)
  2. Predictive
    Predicting the outcome value for new records, given their input values. Finding a model that best predicts new individual records. (Useful for decision-making)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the four main characteristics of an explanatory multiple regression model?

A
  • A good model is one that fits the data closely
  • The entire dataset is used or estimating the best-fit model, to maximize the amount of information we have about the population.
  • Performance measures for this model measures how close the data fits the model and how strong the average relatinship is.
  • Focus is on the coefficients (b)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four main characteristics of a predictive multiple regression model?

A
  • A good model is one that predicts new records accurately.
  • The dataset is split into a training set and a validation/test set.
  • Performance measures for this model measure the predictive accuracy.
  • Focus is on the predictions (y-hat).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which method can be used to estimate the unknown parameters in a regression model?

A

Ordinary Least Squares (OLS)

-> It minimizes the errors associated with predicting values for the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why does the OLS make use a least squares criterion?

A

You are looking at the deviations between the observed and the predicted values. If you do not square those deviations, you allow the positive and the negative deviations to cancel eachother out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you estimate the error for a single outcome?

A

Ei = Yhat - Yi (The predicted outcome minus the actual observation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you estimate the total error of a multiple regression model?

A

sum(Yi - Yhat)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which 4 assumptions do we make when using a multiple linear regression model for prediction?

A
  1. The noise e follows a normal distribution
  2. The choice of predictors and their form is correct (linearity)
  3. The records are independent of each other
  4. The variability in the outcome values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are reasons to reduce the number of predictors in your model?

A
  1. May be expensive or not feasible to collect the data for all those predictors.
  2. Might be unable to measure all these predictors accurately.
  3. Less parsimony -> we get less insight on the influence of individual parameters
  4. Multicollinearity
  5. Using predictors that are uncorrelated with the outcome variable increases the variance of predictors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly