6. correlations and bivariate regression Flashcards

(24 cards)

1
Q

def correlation test

A

measures the strength and direction of a linear association between two quantitative (numerical) variables
It tells us:

Direction: Are the variables increasing or decreasing together?

Strength: How closely do they follow a straight-line pattern?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

def linear relationship

A

one variable changes at a constant rate with respect to another. This can be positive or negative.

In simple terms: When you plot the data, the points follow a straight line (or close to it).

📈 Examples:
Positive linear: More hours studied → higher test scores.
Negative linear: More absences → lower grades.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

def correlation coefficient

A

a nuber denoted r

*shows the strength and direction of the relationship.
r = 1 → Perfect positive correlation (as one variable increases, so does the other).
r = -1 → Perfect negative correlation (as one increases, the other decreases).
r = 0 → No linear correlation (no clear straight-line trend).
🔍 This value is called the Pearson correlation coefficient.

ex: Hours Studied vs Exam Score (r= +1.0)
-> Strength: Perfect linear relationship
Direction: Positive (more hours studied → higher exam scores).

*Measures how closely the data points cluster around the best-fitting line (regression line): When you draw a regression line (line of best fit) through a scatter plot, r tells you how tightly the data points “hug” that line.
-If r is close to ±1:
The points lie close to the line.
The relationship is strong and predictable.
-If r is close to 0:
The points are widely scattered (spread).
There is little or no linear relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

scatter plot def

A

helps you see the relationship between two variables.
Even if r = 0, there might still be a non-linear relationship — so always plot your data! see graphs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what abt outliers in a correlation?

A

Outliers matter: One unusual data point can skew the correlation.

r = 0 doesn’t mean no relationship — just no linear one.

Correlation ≠ Causation!
Just because two things go up or down together doesn’t mean one causes the other.

Example: Ice cream sales and homicides both rise in summer — but eating ice cream doesn’t cause violence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

def regression

A

statistical method used to predict or explain the relationship between two variables.

It finds the line that best fits the data in a scatter plot. This line helps you understand how changes in one variable (x) affect the other (y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

linear regression

A
  1. linear regression: models the relationship between variables by fitting a line to the data. The general form is: Y=a+bX
    -> The sign (+ or –) shows the direction of the relationship and the size (absolute value of b) shows the magnitude or strength of the effect.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

linear regression equation

A

y=a+bx
y: Dependent variable (what you’re trying to predict or explain)

x: Independent variable (the one you think influences y)

b: Slope (shows how much y changes when x increases by 1 unit)

a: Intercept (the value of y when x = 0; where the line crosses the y-axis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

interpret the slope of a regression

A

If b > 0 → There’s a positive relationship (as x increases, y increases)

If b < 0 → There’s a negative relationship (as x increases, y decreases)

If b = 0 → No linear relationship (x has no linear effect on y)

The larger the absolute value of b (+ or -), the stronger the relationship between x and y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

def of OLS (ordinary least squared regression)

A

This is the most common method for fitting the regression line.
OLS finds the line that minimizes the total squared distance (errors) between the actual data points and the line.

In simple terms: it makes the line fit the data as closely as possible.

Larger the coefficient = larger linear regression: the regression coefficient (b) tells you how much Y changes for a 1-unit increase in X so a larger coefficient means:
-A steeper slope (pente plus forte) of the regression line.
-A bigger effect of X on Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

def least squares line

A

=regression line/ OLS line
-> It’s the best-fitting straight line through the data.
“Best-fitting” means it minimizes the errors between observed values and predicted values.
-> OLS squares these differences and adds them up:
SumofSquaredResiduals=∑(𝑌𝑖−𝑌^𝑖)²
so The OLS line is the one that makes this sum as small as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is Bivariate linear regression

A

“Bivariate” means it involves two variables: one independent (x) and one dependent (y).

It’s the simplest form of linear regression.

🔍 Example:
If you’re studying how hours studied (x) affect exam scores (y), a regression line can tell you how much score increases per extra hour studied.

see graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

difference btwn regression, linear regression, OLS and Bivariate linear regression

A

*Regression= General term for modeling the relationship between variables
*Linear Regression= A specific type of regression where the relationship is linear (straight line)
*Bivariate Linear Regression= Linear regression with 1 independent and 1 dependent variable
-> use when you want to predict one variable from another, and you think the relationship is linear (with OLS)
*OLS (Ordinary Least Squares)= The method most commonly used to estimate the line of best fit in linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are residuals in OLS?

A

In Ordinary Least Squares (OLS) regression, the goal is to draw a line through the scatterplot that best fits the data.

🔹 What are Residuals?
Residuals measure the difference between the actual values and the values predicted by the regression line:

Residual=𝑦𝑖−𝑦^𝑖
with
yi : actual value (observed data point)
𝑦^𝑖 : predicted value (from the regression line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

so in a scatter plot, difference btwn the line, dots and residuals

A

The line = all the expected (predicted) values
The dots = all the actual values from your dataset
The residual = the vertical distance between a dot (actual Y) and the line (predicted Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

def R² (R-squared) in OLS

A

*it is the proportion of the total variation in Y that is explained by X (the predictor): How well your regression model explains the variation in the outcome (Y)
R² tells you how much of the variation in the dependent variable (Y) is explained by your independent variable(s) (X).

*It ranges from 0 to 1 (or 0% to 100%).
R² = 1 → perfect prediction (100% of the variation in Y is explained)
R² = 0 → the model explains none of the variation in Y
-> higher R² (all else equal) means a better fit

*Interpretation Example:
If R² = 0.08, this means:
“8% of the variability in Y is explained by X.”
So, example: 8% of the variation in how important people find migration is explained by the number of terrorist attacks
-> Knowing how many terrorist attacks occurred explains only 8% of why people think migration is important.
->That’s a low R² — meaning the predictor has limited explanatory power. There may be many other factors influencing how people perceive migration (e.g., media coverage, politics…)

17
Q

errors (ε) in OLS

A

The residuals (errors) should be:
*Randomly scattered above and below the regression line.
*Normally distributed, with constant variance (this is called homoscedasticity).
*This is important for your model to be reliable and valid.

18
Q

how to find if the mains assumptions of OLS are here on the scatter plot diagram

A

After running a regression, you should check a residuals plot to spot issues. Look out for:
*linearity= is there a straight line?
*Homoscedasticity (Equal variance of residuals): if the spread of Y values gets wider or narrower across X, that may suggest heteroscedasticity

Attention:Outliers or influential points: these can distort the regression line.

19
Q

main assumptions of OLS (ordinary least squares)

A

For OLS to be valid, several key assumptions must be met:

*Linearity: The relationship between X and Y is linear.
*Independence: Observations are independent of each other.
*Homoscedasticity: The variance of residuals is constant across all values of X.
*Normality of Errors: The residuals are normally distributed.
*Predictors (X variables) should not be too highly correlated with each other.

20
Q

common pb in OLS regression

A

*Heteroscedasticity: Residuals have unequal variance — leads to biased standard errors.
*Outliers: Can drag the regression line toward them.
*Influential points: Especially problematic if they’re extreme on both X and Y.
*Omitted variables: Important predictors left out can bias your results.
*Reverse causality: X and Y may influence each other.
*Correlation (When two variables move together — as one changes, the other tends to change too ex: height and shoe size) ≠ Causation (When a change in one variable directly produces a change in another– just bcs 2 things happened tgther doesn’t mean that one caused the other ex: virus causes illness): Regression shows association, not proof of cause (tells you how much Y changes on average when X changes (the strength and direction of association).

21
Q

def spurious correlation

A

relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor

22
Q

correlation test, what do you look at

A
  1. first (after stata) look at the coefficient (if positiv, pos correlation btwn the 2 variables…)
  2. p value (is this correlation statistically significant, and are we confident enough in this result)
    if p< 0,05 then reject null hypotheses (there is not correlation btwn the 2 variables) so there is a correlation!
23
Q

difference btwn regression and correlation test

A
  1. Correlation Test: Measures the strength and direction of a linear relationship between two continuous variables (X and Y).
    -Produces a single number — the correlation coefficient (r), ranging from -1 to +1.
    -Used when:
    You want a simple measure of association without implying causation.
    You want to test if two variables move together (are linearly related).
    You don’t necessarily have a “dependent” and “independent” variable — just interested in association.
    You want to check assumptions or explore data before modeling.
  2. Regression Analysis: Goes beyond correlation by:
    -Modeling how Y depends on X (predicting Y from X).
    -Estimating the size and direction of the effect (the slope).
    -Allowing for multiple predictors (in multiple regression).
    -Testing statistical significance of predictors controlling for others.
    Used when:
    You want to explain or predict one variable (Y) based on one or more others (X).
    You want to quantify the effect of predictors on the outcome.
    You want to control for confounding variables.

Example:
You want to see if hours studied and test scores are related? Use correlation to check if there’s a linear association.
You want to predict test scores based on hours studied, maybe controlling for age or class attendance? Use regression.

24
Q

difference correlation test and chi-square test

A

. correlation test:
You have two continuous (numerical) variables.
You want to check whether they move together in a linear way.

.chi-square test:
You have two categorical variables (e.g., yes/no, male/female, party A/B/C).
You want to test whether the distribution of one variable depends on the other.