Exam 2 Flashcards

(67 cards)

1
Q

What is sample correlation

A

Measures the strength of the linear relationship between two variables. It is denoted by R, and zero indicates no corrrelation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Association vs correlation

A

Association is about general relatedness of two variables, while correlation is about linearity specifically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

rnorm() function

A

Generates a vector of random numbers with a normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

myCor() function

A

Peforms correlation and linear regression for all paris of numeric columns in input dataframe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Do outliers have an effect on correlation

A

Yes, they can make the correlation artifically high or low

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is jittering

A

Adding a small amount of random normally distributed noise to both the x and y values. Allows us to see observations more clearly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What function allows us to calculate correlation

A

cor()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What function allows us to fit a regression line to data

A

lm()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What function allows us to calculate the confidence interval for the true correlation

A

cor.test()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is bootstrapping

A

Allows us to estimate the distribution of an estimator by resampling data. samples with replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Parametric vs non parametric tests

A

Parametric tests assume certain conditions of the data (usually assumptions about normality, variance, standard deviation, etc). Nonparametric tests make fewer assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does it mean if the bootstrap confidence interval is wider than the theoretical ci

A

Underlying assumptions of the model are maybe not satisfied. Heteroskedasticity and outliers can contribute to this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a permutation test / what does it do

A

Allows us to quantify the difference/relationship between groups/variables. Permuted data is essential just reshuffled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

sample() function

A

Takes a sample of data with or without replacement. if replace = TRUE, bootstrapping. if replace = FALSE, permutation test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

rep()

A

rep(x, …)
Replicates the values in x?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

corrplot()

A

Visual represantion of correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are residuals?

A

Estimated errors of regression (aka difference between estimated and actual values of regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do we estimate the standard deviation of the residuals

A

Use sample standard deviation of the residuals as our estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the assumptions of a linear regression model

A

linearity and normal distribution of errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a good way to see if data is normally distribtued

A

Make a normal quantile plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

If a linear fits data well, what should the residuals vs. fitted values plot look like?

A

Formless blob. No patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

After fitting a model, what do we need to do?

A

See if we’ve met the model assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What diagnostics dow e perform to see if we’ve met model assumptions

A

normal quantile plot to test for normality, and plot of fitted vs residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does r squared do

A

Measures the percentage of variability of Y explained by the model (X’s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Influential point vs. outlier
Influential point: a point if, which removed, causes large change in fit of model outlier: point with a large residual
25
Steps of multiple linear regression
Check model assumptions, identify significant predictors, perform regression, check relationships (plots), identify variables (response and predictors)
26
What is heteroskedasticity
a situation where variance is not contanst across all observations. Like if residuals get bigger as x increases
27
In simple linear regression, what is r squared
The square of the correlation
28
Explain multiple r squared
the amount of variability in the response variable Y that is explained by the regression model
29
FOr multiple regression, what is an ideal R squared
If a more complicated model has a much higher R squared, then it is probably a better model. If a more complicated model has only a slightly higher R squared, then the simpler model is better
30
When do you use adjusted r squared
To account for the number of terms used in a multiple regression model. It is smaller than R squared because it adds a penalty for the number of terms used in the model
31
What is the AIC and what are its assumptions
measures the goodness of fit of a model. Assumes normally distrubuted errors with constant variance
32
When comparing models, is a larger or smaller AIC better
Smaller AIC better
33
What is BIC and what are its assumptions
Similar to AIC (measures goodness of fit of a model), but ti gives a larger penalty for using more parameters to fit model. Assumes normally distributed errors with constant variance. Smaller is better again
34
What functions do you use to make a correlation plot
siggcor<- cor.mtest(x, conf.level = .95) corrplot.mixed(x, p.mat = sigcorr$p
35
regsubsets()
Performs best subset selection (identifies best model based on # of predictors), where best is quantified using RSS
36
What is RSS
sum of squares of residuals aka deviations predicated from actual data
37
How to find best model according to R squared
which.max(x$rsq) where x is summary statistics for a model (using regsubsets)
38
What are indicator variables
Aka dummy variables. Binary variables that take the value zero 1. Each indicator variable if the observation is of the specified level, zero otherwise.
39
When including categorical variables into the model, are all categorical variables entered?
No, one level of each categorical variavle is not entered into the model. The omitted level becomes the reference level for the model
40
What do we call variance that is not constant
Heteroskedasticity
41
When do we use a box cox procedure
If there is heteroskedasticity and the variance is some function of the mean boxcox(x) where x is a linear model (lm) of some data
42
What is backwards stepwise regression
Variable selection method in multiple regression --> starts with all predictors and iteratively removes least significant ones until only statistically significant variables are left
43
What is best subset regression
Model selection technique that examines all possible combinations of predictor varibales to find the best fitting model for a given response variable (selecting the best SUSBET of predictors)
44
When can't you use best subset regression?
If we have categorical predictors
45
How do evaluate categorical variables
If any level (indicator variables) are significant, leave the entire categorical variable (all levels) in the model. If all levels non-significant, remove the categorical predictor Use ANOVA table which tests all levels simultaneously
46
What do interaction plots help us to do?
Look at how means change for various combinations of data & variables
48
Anova() function
Anova(x, type = 3) where x is an object with a model of some sort?
49
what are the assumptions of anova
data has a normal distribution, same standard deviations (but maybe diff means), and all observations are independent. conditions met as long as ratio of largest to smallest group sample standard deviations less than 2
50
how do we check anova assumptions
qq plot of residuals and fitted vs residuals plot
51
before you fit an anova model, what must you do
check the ratio of max/min standard deviations
52
which function(s) do we use to fit an anova model
aov() or lm()
53
how do we fit a model without an intercept
use “-1” egs. lm(x ~ y -1)
54
Can we use confidence intervals to compare group means
No
55
How do we test for equality of variances between groups (assuming normality)? What r function? What assumptions does this test make?
Using bartlett’s testz bartlett.test(data, groups) Non signficant p value means there is no difference in variances
56
How do we test for equality of variances between groups (assuming no normality)? What r function? What assumptions does this test make?
Using the levene test leveneTest(data, groups)
57
When do we use welchs anova (as opposed to a regular one way anova)? Which r function?
When we have unequal variances. oneway.test(x ~ y, data = df)
58
When do we use a kruskall wallis test (opposed to one way anova)?What r function?
Non parametric test- we use it when the assumptions of anova are not met (aka when variances are not equal or when data isnt normally distributed) kruskal.test()
59
What is a box cox transformation? How do we interpret it
Makes data more normally distributed. Transforms the data by applying a power to it (lamda). Lamda is the x axis of the box cox plod
60
When do we use a two way anova? Which r function do we use?
When we have two independent variables (factors) example: comparing avg test scores of students at different schools AND different grade levels aov()
61
When do we use summary.lm
To get summary information of an anova model?
62
Why/when do we use tukey's comparison test? Which function
It tells us how different the means are. TukeyHSD(aov)
63
If we use boxCox on data that is not normally distributed with unequal variance and lamda is zero, what should we do?
Log transformation
64
How do we use the Anova() function
Anova(lm, type = 3)
65
Explain covariate vs factor
A covariate is a categorical predictor, while a factor is a categorical predictor
66
What does ancova do/mean
Ancova stands for analysis of covariance. We are fitting a separrate slope between x and y for each level of some other categorical variable.
67
What is a GLM
Generalized linear model. An umbrella term for multiple kinds of linear models including: regression, one way anova, two way anova, one and two sample tests