Module 6 Flashcards

(65 cards)

1
Q

scatterplots

A

plots that graph bivariate, quantitative data. Horizontal axis has explanatory variable and veritcal axis has the response variable. Each observation is plotted as a point. These plots are used to help us visualize relationships between two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Association

A

used to describe a relationship between 2 variables. There is positive, negative and no association.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

correlation

A

measure of the strength of linear association between 2 variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a curve in a scatterplot indicate?

A

no linear relationship, measuring correlation is not appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Properties of r

A
  1. independent of units
  2. two variables have the same association whether you are looking at the explanatory or response variable.
  3. Magnitude determines strength of relationship
  4. Falls between -1 and 1
  5. Sign determines type of relationship(postive or negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What plot must be analyzed before looking at a linear relationship?

A

a scatterplot must be analyzed to see if it makes sense to look at a linear relationship and make sure there ar eno curves present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do r values of -1, 0 and 1 indicate?

A
  • 1: perfect negative LINEAR correlation
    0: no linear correlation
    1: perfect positive LINEAR correlation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what do different ranges of r indicate?

A

0 to 0.4 or -0.4 to 0: weak postive/negative correlation

  1. 4 to 0.8 or -0.8 to -0.4: moderate positive/negative correlation
  2. 8 to 1 or -1 to -0.8: strong positive/negative correlation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

regression equation and variable meanings

A
yhat=a+bx
a=y-intercept
b=slope
x=explanatory variable
yhat=predicted mean value of response variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to interpret slope

A

the slope equals the amount that the predicted mean value of the response variable(y) changes when the explanatory variable increases by unit(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

residual definition

A

difference between observed and predicted values. Residual=observed-predicted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do residuals represent graphically?

A

the vertical distance each point lies from the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How many residuals does each observation have?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are properties of residuals if the observation is larger and if it is smaller than the predicted value?

A

Larger: Postive residual value, will lie above line

Smaller: Negative residual value, will lie below line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which regression line fits a data set best?

A

the line with the smallest sum of squared errors or minimizes the sum of the squared residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Coefficient of determination

A

r^2, lies between 0 and 1, determines percentage of variation in the observed values of the response variable that is explained by the regression line, measure of usefulness for making predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

extrapolation

A

using a regression line to make predictions for the response variable outside the range of the explanatory variable, may result in incorrect predictions if the linear relationship does not hold past the range of the explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

regression outlier

A

a data point that falls far from the regression line relative to other data points, these points are removed if they are the result of a measurement or recording error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

influential observation

A

an observation that, when removed, causes the regression line and equation to change considerably, does not have to be a regression outlier, a data point separated in the x-direction from the other data points, removed if result of measurement or recording error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Simpson’s paradox

A

the direction of an association can change between two variables can change after adding a third variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

deterministic relationship

A

a relationship in which y is completely determined by the value of x, not a regression model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

probabilistic relationship

A

a relationship in which the value of y is related to the value of x but not all variation in y is explain by the x value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

population regression line

A

µy=a+bx
a=population y-intercept
b=population slope

describes how the population mean of each conditional distribution for the response variable(y) depends on the value of the value of the explanatory variable(x), describes variability of y observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Sample regression line

A

normal regression line that describes the relationship between x and the estimate means of y at various values of x, different for each sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Purpose of sum of squares in regression models
quantify explain and unexplained variability in regression
26
total sum of squares
shows total variation
27
regression sum of squares
shows explained variation
28
error sum of squares
shows unexplained variation
29
regression identity
SST=SSR+SSE | r^2=SSR\SST
30
what type of variation does a residual quantify?
unexplained
31
What information can we get from the distance between ybar, the regression line and the point on a graph
if the distance from ybar to the regression line is greater than the distance from the point to the regression line, we have more explained variability and r^2 will be larger and vice versa
32
purpose of a regression t-test
used to determine if x is useful in predicting y
33
What are the options for the null hypotheses in a regression t-test?
1. B=0 2. Model is useful in predicting y 3. variables are independent
34
What are you testing in a regression t-test?
whether the regression equation or the mean of y gives a better prediction for y given a value of x
35
what does B=0 indicate conceptually?
ybar is more useful in predicting any value of x
36
What does B=0 indicate graphically?
if we graph a line equal to y, it would be a horizontal line that crosses the y-axis at the mean of y
37
standardized/studentized residuals
measures how many standard errors each observation is from the regression line
38
how do you use standardized/studentized residuals to find outliers?
make histograms and boxplots of the to find outliers, see which observations are more than 3 standard errors from the regression line
39
most important simple linear regression assumption
linear relationship between explanatory and response variables
40
residual standard deviation definition
aka standard error of the estimate, used to estimate the common std of the conditional distributions of y
41
residual std interpretation
indicates, on average, how much the predicted values of the response variable, yhat, differ from the observed values of the response variable, y
42
confidence interval for population regression line
used to estimate the mean of y for all observations that have a particular observation of x, give range of plausible values for the mean, same assumptions as simple linear regression
43
confidence interval for y
used to estimate the value of y for an individual who has a particular value of x, gives range of plausible values for a randomly selected subject, same assumptions as simple linear regression
44
reasons to use multiple regression
1. make better predictions by using several explanatory variables at once 2. consider simultaenous impact of predictors of interest on response 3. the effect of an explanatory variable can change after you account for potential lurking variables
45
how does multiple linear regression work?
you analyze the association between 2 variables while controlling/fixing the values of other variables
46
similarities between multiple and simple linear regression
1. Use least squares to estimate B 2. same calculations for residuals 3. same calculation of standard error estimate 4. same assumptions
47
differences between multiple and simple linear regression
1. multiple linear regression has more mode flexibility 2. multiple regression does not fit a 2-D line to data 3. interpretations of B changes 4. multiple regression is more complex due to having multiple explanatory variables 5. can answer different types of questions
48
population multiple regression model
relates the mean value of y of a quantitative response variable y to a set of explanatory variables. µy=alpha+B1X1+B2X2+...BnXn
49
sample multiple regression line
when we substitute the values of x1, x2,...xn, the equation specifies the population mean of y for all subjects with those subjects with those values. y=same as population multiple regression model equation
50
multiple regression coefficient interpretation
holding all other variables constant, for every one unit increase in xn, ybar increases ..... units
51
multiple correlation
R, describes association between y and a set explanatory variables, same weak/moderate/strong categories as before, ranges from 0 to 1, percentage variability in y that is explained by the regression equation
52
R^2
R^2=SSR/SST, larger value equals more variability being explained by the model, meaning better predictions. Increases as predictors are added to the model, does not depend on units
53
what do R^2 values of 0 and 1 indicate
0: all yhat=yhat 1: all of the y=yhat
54
how to solve for F in a multiple regression model
F=MSR/MSE
55
how to find degrees of freedom for a multiple regression model
df=# of explanatory variables | df2=n-#of predictors in the model
56
purpose of confidence intervals in multiple regression
to estimate values of beta parameters and give plausible values for the parameter
57
what does 0 being in the confidence interval indicate during multiple regression?
the explanatory variable may have no effect on the response variable when other explanatory variables are held constant
58
multiple regression model assumptions
L.I.N.E. Linearity: the relationship in the population is the same as what we are using in the data to describe it Independence Normality: normal y distribution for each setting of the explanatory variable Equality of variance: distribution of y values have the same variance for every setting of the explanatory variables
59
how to check normality in the multiple regression model
Check normality of residuals: make sure QQ plot is roughly linear and histogram is rougly bell shaped
60
What should a scatterplot of the residuals look like when checking multiple regression assumptions?
fall roughly in a horizontal band that is centered about the x-axis and not exhibit any curvature or pattern
61
What do residual plots that only violate the linearity assumption look like?
linear or curved
62
What does a residual plot that only violates the equal variance assumption look like?
fanning shape
63
what does a residual plot that violates the linearity and equal variance assumption look like?
linear/curved WITH fanning
64
What must each graph look like to pass each assumption of multiple regression?
Linearity: scatter plot shows linear relationship, residual plot shows no curve/pattern Independence: assume this is not violated Normality: normality probablity plot shows a straight line Equal variance: residual plot shows no fanning or pattern,
65
How do you know if a new model is better in multiple regression?
1. Adjusted R^2 increases 2. F-stat increases 3. MSE decreases 4. Fewer variables arent useful