Topic 5: Linear Model Flashcards

1
Q

Define bivarinate data and variables involved

A

Bivarinate data is a pair of variables (xi,yi) with i=1,2,3,…n

x: independet variable
y: dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does a scatter plot summarize?

A

Scatter plot is a numerical summary of the relationship between 2 variables on the same 2D plane, creating a cloud of data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe linear association

A

Linear association shows how tightly the points cluster around the line.

Strong/positive association: tightly clustered
Weak/negative association: not tightly clustered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many numerical summaries does scatter plot takes into account?

A

5 numerical summaries:
- mean & SD of x
- mean & SD of y
- correlation coefficient (r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the center and spread of data cloud in scatter plot

A
  • Center: point of average (mean of x, mean of y)
  • Horizontal spread: measured by SD of x (most in 2SD)
  • Vertical spread: measured by SD of y (most in 2SD)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe correlation coefficient and its features

A

Correlation coefficient is the numerical summary measuring clustering around line, showing sign and strength of association.

  • Pure number with no unit
  • Lies between -1 and 1
  • (+) r: upward slope
  • (-) r: downward slope
  • r = +/-1: perfect correlation; closer to -1/+1 tightly clustered
  • r = 0: points don’t fit around the line
  • Symmetry: r is not affected by interchanging the variable
  • Scaling: shift & scale invariant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to calculate the population/sample correlation coefficient ?

A

Mean of the product of the variables in standard units

  • (data point - mean)/popsd or sd (z score of each variable)
  • product of the 2 z score of x & y vairbles
  • mean of the product = r
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is SD line?

A

SD line connects point average to point 1SD away from the mean in both x & y direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some features and limitations of SD line?

A

Features: it goes through point of average and captures the exact relationship if there is

Limitations: not use r
- cannot distinguish different cloud clustering
- can over or underestimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some warnings regarding correlation coefficient?

A
  • Outliers can overly influence r
  • r cannot detect nonlinear association
  • Same r value can come from very diff. data set
  • Rates of averages can inflate the r
  • Association doesn’t mean causation
  • Small SDs can make r look bigger
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is regression line and equation?

A

Regression lines takes into account all 5 numerical summaries.
Connects point of average to (mean of x + SDx, mean of y + rSDy)
y = intercept + slope
x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is graph of averages?

A

Graph of averages plots average y for each x.
If the points give a straight line, it is the regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some ways of prediction for y value when given a x value?

A
  • Baseline prediction: y prediction = average of y values over all x values
  • Prediction in a strip: y prediction = average of all y values associated with the given x
  • Based on regression line: use the line equation to predict y
  • Predicting percentile marks: x given in a certain percentile –> predict y percentile
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What steps are taken to predict y value based on a given x percentile?

A
  • calculate z score in x direction: Zx (qnorm)
  • Zy = r * Zx
  • Zy turn back into percentile in y direction (pnorm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are residuals?

A

Residuals are vertical distances of data points above or below the regression line, representing errors between actual value and prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe population RMS error and its equation

A

Average of residuals, like “SD for the line”

RMS error pop = RMS (gaps of mean)
RMS in baseline prediction: RMS error = SDy
RMS error pop = sqrt(1 - r^2) * SDy

17
Q

What is RMS error of r=+/-1 and r=0?

A

r=+/-1: RMS error = 0
r=0: RMS error = SDy

18
Q

What is residual plot and what do you look for in this plot?

A

Residual plot: residuals vs x
We look for randomness in residual plot (if random, linear model is appropriate)

19
Q

What are vertical strips for?

A

Vertical strips on scatter plot

If within vertical strips, there is equal spread in y direction –> homoscedastic data –> RMS error used as measure for individual strips

If within vertical strips, there is unequal spread in y direction
–> heteroscedastic data –> RMS error CANNOT be used

20
Q

How can normal approximation be used in vertical strips?

A

Mean = mean of y + ZxrSDy
SD = RMS error
z score for the threshold
Use pnorm()