corrolation Flashcards

1
Q

What is correlation in statistics?

A

Correlation quantifies the extent of association between two continuous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is correlation different from regression?

A

Correlation measures the strength of a relationship between two variables, while regression explains one variable in terms of another with an equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Pearson’s correlation coefficient (𝑟)?

A

A measure of linear correlation between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a correlation coefficient of 0 mean?

A

It means there is no linear relationship between the two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a perfect correlation?

A

A perfect correlation occurs when all data points fall exactly on a straight line, with 𝑟 = ±1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does concurvity refer to?

A

It describes a non-linear association between two continuous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In R, how can you compute the correlation coefficient between two variables, LLL and TotalHeight?

A

diffx <- hgt$LLL - mean(hgt$LLL)

diffy <- hgt$TotalHeight - mean(hgt$TotalHeight)

r <- sum(diffx * diffy) / sqrt(sum(diffx^2) * sum(diffy^2))
print(r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a simpler way to compute correlation in R?

A

cor(x = hgt$TotalHeight, y = hgt$LLL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is covariance?

A

The numerator in the correlation formula, representing how two variables vary together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is correlation different from covariance?

A

Correlation standardizes covariance to a range of -1 to +1, making it comparable across different units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the null (𝐻0) and alternative (𝐻1) hypotheses for testing correlation?

A

H0:ρ=0 (No association between variables)

𝐻1:𝜌≠0 (There is an association)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What test statistic is used to test correlation?

A

A 𝑡-test with 𝑛−2 degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you compute a two-tailed 𝑝-value for correlation in R?

A

2 * pt(q = t_stat, df = n-2, lower.tail = FALSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What function in R performs a correlation test?

A

cor.test(x = hgt$TotalHeight, y = hgt$LLL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the alternative hypothesis in a correlation test?

A

H: ρ≠0 (The true correlation is not zero, meaning an association exists).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we interpret a very small 𝑝-value in a correlation test?

A

It suggests strong evidence against the null hypothesis, meaning there is likely an association between the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the confidence interval in a correlation test represent?

A

It provides a range within which the true population correlation coefficient (𝜌) is likely to lie.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is correlation not the same as causation?

A

Correlation only shows an association, but a causal link requires further evidence, such as controlled experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a “spurious” or “nonsense” correlation?

A

A correlation between two variables that occurs due to chance or a hidden third variable rather than a causal relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are three possible explanations for a correlation?

A

Chance (random coincidence)

A third variable affecting both

Genuine causation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Anscombe’s quartet?

A

A set of four datasets that have the same correlation coefficient but different distributions, illustrating the limitations of correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the correlation coefficient (𝑟) for each pair in Anscombe’s quartet?

A

r=0.816 for all pairs, despite vastly different data patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the Anscombe’s quartet demonstrate about correlation?

A

That correlation alone does not capture the true nature of relationships between variables; visualization is essential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the datasauRus dataset illustrate?

A

That datasets with different structures can have identical correlation values, emphasizing the need for visualization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are the three main reasons to study relationships between variables?
Description – To describe patterns in data. Explanation – To understand causal relationships. Prediction – To estimate unknown values.
26
What is simple linear regression?
A statistical method to model the relationship between one explanatory variable and one response variable.
27
What is the general form of a simple linear regression equation?
Y=β0+β1X+ε Y is the response variable, 𝑋 is the explanatory variable, 𝛽0 is the intercept, 𝛽1 is the slope, ε is the error term.
28
What are the assumptions of simple linear regression?
Linearity – The relationship between 𝑋 and 𝑌 is linear. Independence – Observations are independent. Homoscedasticity – Constant variance of errors. Normality – Errors are normally distributed.
29
What is the dependent (response) variable in a regression model?
The variable we aim to explain or predict (denoted as 𝑌).
30
What is the independent (predictor) variable in a regression model?
The variable used to explain or predict the response variable (denoted as 𝑋)
31
What does the intercept (𝛽0) represent in a regression model?
It is the expected value of 𝑌 when 𝑋=0, or where the regression line crosses the y-axis.
32
When might the intercept (𝛽0) not be meaningful?
When the explanatory variable (𝑋) cannot realistically take a value of zero (e.g., age in a salary regression).
33
What does the slope (𝛽1) represent in a regression model?
It describes the expected change in 𝑌 for a one-unit increase in 𝑋.
34
What does the sign of the slope (𝛽1) indicate?
𝛽1>0 → Positive relationship 𝛽1=0 → No relationship 𝛽1<0 → Negative relationship
35
What is the error term (𝜖𝑖) in a regression model?
It represents the difference between the observed and predicted values of 𝑌, accounting for variability not explained by 𝑋.
36
What assumption is made about the error term in simple linear regression?
Errors (𝜖𝑖) are assumed to be normally distributed with mean zero: ϵi ∼N(0,σ^2)
37
What is the least squares (LS) criterion in regression?
It finds the line that minimizes the sum of squared residuals (SSR), ensuring the best fit for the data.
38
Why is the least squares method preferred?
It ensures that the fitted line has the smallest possible sum of squared differences between observed and predicted values, leading to optimal parameter estimates.
39
What does the least squares criterion minimize in regression?
The sum of squared residuals (SSR), ensuring the best-fitting line.
40
What are residuals in regression?
The vertical differences between observed data points and the fitted regression line.
41
What is the formula for estimating the intercept (𝛽0) in simple linear regression?
β^0= yˉ−β^1xˉ
42
Why is least squares estimation preferred?
It provides optimal estimates and coincides with maximum likelihood estimation under normality assumptions.
43
What R function is used to fit a simple linear regression model?
lm(), which stands for linear model.
44
What is the basic syntax for fitting a linear model in R?
simple.lm <- lm(y ~ x, data=exData)
45
How can you retrieve the coefficients (𝛽0,𝛽1) from a fitted linear model in R?
By printing the model object:simple.lm
46
Given the fitted model: 𝑦^𝑖=12.426+1.902xi what is the predicted value when 𝑥=50?
y^=12.426+(1.902×50)=107.526
47
What function in R gives predicted values for the fitted regression model?
fitted()
48
Why is it risky to use a regression model to make predictions outside the range of observed 𝑥 values?
The relationship may not remain linear beyond the observed data, leading to inaccurate predictions.
49
What does the Residual Standard Error (RSE) in an R regression summary represent?
The estimated standard deviation of residuals, which measures the spread of observed values around the fitted regression line.
50
What does the Adjusted R-squared value account for in regression?
It adjusts for the number of predictors, providing a more accurate measure of model fit when multiple explanatory variables are present.
51
What does the Multiple R-squared value in an R regression summary indicate?
The proportion of variance in the response variable explained by the explanatory variable(s).
52
How is the F-statistic in an R regression summary interpreted?
It tests whether at least one predictor variable is significantly associated with the response variable.
53
What is the null hypothesis for testing the significance of a regression coefficient?
H0: βi =0, meaning the predictor has no effect on the response variable.
54
What does a very small p-value (e.g., < 0.001) for a regression coefficient indicate?
Strong evidence against 𝐻0, suggesting the predictor is statistically significant.
55
In the R summary output, what does the Significance Codes section indicate?
It categorizes p-values using ***, **, *, and . to show different levels of statistical significance.
56
What does a Residuals section in an R regression summary show?
The distribution of residuals (errors), including minimum, 1st quartile, median, 3rd quartile, and maximum values.
57
Given the regression equation: WeightA^ =102.18+1.72×SST what does the slope coefficient 1.72 mean?
For each 1-unit increase in SST, the predicted WeightA increases by 1.72 units on average.
58
What does the Residual Standard Error = 2.093 in an R summary output mean?
On average, the actual WeightA values deviate by about 2.093 units from the predicted values.
59
What is goodness of fit in regression?
It refers to how well the model explains the variability in the response variable.
60
How is R² interpreted in regression?
R² ranges from 0 to 1: R² = 1 → Perfect fit R² = 0 → Model explains no variability R² = 0.7954 → Model explains 79.54% of the response variable’s variability.
61
What does the ANOVA table show in regression analysis?
It decomposes total variability into: Model Sum of Squares (SSModel) → Explained variation Residual Sum of Squares (SSRes) → Unexplained variation F-statistic → Measures model significance
62
How is the F-statistic in ANOVA related to the t-statistic for a predictor?
F=t² Example: If t = 11.99, then F = (11.99)² = 143.9.
63
What does a high F-statistic and a small p-value indicate?
It suggests that at least one predictor variable significantly explains variation in the response.
64
What does Residual Standard Error (RSE) represent in the regression summary?
It is the standard deviation of residuals, measuring how much actual values deviate from predictions.
65
What are the key takeaways from regression analysis?
Regression models explain relationships between variables. R² measures how much variation is explained. ANOVA tests overall model significance. F-statistic determines if predictors improve the model. RSE shows the spread of residuals.
66
How is Mean Squared Error (MSE) related to RSE?
MSE=RSE Example: If RSE = 2.093, then MSE = (2.093)² = 4.38.