6. correlations and bivariate regression Flashcards

(16 cards)

1
Q

def correlation

A

a way to measure how two quantitative (number-based) variables are related.
It tells us:

Direction: Are the variables increasing or decreasing together?

Strength: How closely do they follow a straight-line pattern?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

def correlation coefficient

A

a number (denoted as r) that shows the strength and direction of the relationship.

r = 1 → Perfect positive correlation (as one variable increases, so does the other).
r = -1 → Perfect negative correlation (as one increases, the other decreases).
r = 0 → No linear correlation (no clear straight-line trend).

🔍 This value is called the Pearson correlation coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

scatter plot def

A

helps you see the relationship between two variables.
Even if r = 0, there might still be a non-linear relationship — so always plot your data! see graphs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what abt outliners in a correlation?

A

Outliers matter: One unusual data point can skew the correlation.

r = 0 doesn’t mean no relationship — just no linear one.

Correlation ≠ Causation!
Just because two things go up or down together doesn’t mean one causes the other.

Example: Ice cream sales and homicides both rise in summer — but eating ice cream doesn’t cause violence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

def regression

A

statistical method used to predict or explain the relationship between two variables.

It finds the line that best fits the data in a scatter plot. This line helps you understand how changes in one variable (x) affect the other (y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

regression equation

A

y=a+bx
y: Dependent variable (what you’re trying to predict or explain)

x: Independent variable (the one you think influences y)

b: Slope (shows how much y changes when x increases by 1 unit)

a: Intercept (the value of y when x = 0; where the line crosses the y-axis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

interpret the slope of a regression

A

If b > 0 → There’s a positive relationship (as x increases, y increases)

If b < 0 → There’s a negative relationship (as x increases, y decreases)

If b = 0 → No relationship (x doesn’t affect y)

The larger the absolute value of b, the stronger the relationship between x and y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

def of OLS (ordinary least squared regression)

A

This is the most common method for fitting the regression line.
OLS finds the line that minimizes the total squared distance (errors) between the actual data points and the line.

In simple terms: it makes the line fit the data as closely as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is Bivariate linear regression (OLS)

A

“Bivariate” means it involves two variables: one independent (x) and one dependent (y).

It’s the simplest form of linear regression.

🔍 Example:
If you’re studying how hours studied (x) affect exam scores (y), a regression line can tell you how much score increases per extra hour studied.

see graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are residuals in OLS?

A

In Ordinary Least Squares (OLS) regression, the goal is to draw a line through the scatterplot that best fits the data.

🔹 What are Residuals?
Residuals measure the difference between the actual values and the values predicted by the regression line:

Residual=𝑦𝑖−𝑦^𝑖
with
yi : actual value (observed data point)
𝑦^𝑖 : predicted value (from the regression line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the least squares principle? (in OLS)

A

OLS finds the line that minimizes the sum of squared residuals:
∑(𝑦𝑖−𝑦^𝑖)²

This means the regression line is chosen so that the vertical distances from the points to the line are as small as possible overall — especially after squaring them (to avoid negatives cancelling out).

That’s why it’s called “least squares” — it minimizes the squared errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

def R² in OLS

A

(how well the model fits)
R² tells you how much of the variation in the dependent variable (Y) is explained by your independent variable(s) (X).

It ranges from 0 to 1 (or 0% to 100%).

🧠 Interpretation Example:
If R² = 0.08, this means:

“8% of the variability in Y is explained by X.”
So, in your example:
8% of the variation in how salient people find migration is explained by the number of terrorist attacks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

errors (ε) in OLS

A

The residuals (errors) should be:

Randomly scattered above and below the regression line.

Normally distributed, with constant variance (this is called homoscedasticity).

This is important for your model to be reliable and valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

scatterplot diagram

A

After running a regression, you should check a residuals plot to spot issues. Look out for:

Curved patterns: suggests a non-linear relationship — OLS may not be appropriate.

Changing variability (fan-shaped spread): suggests heteroscedasticity.

Outliers or influential points: these can distort the regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

main assumptions of OLS (ordinary least squares)

A

For OLS to be valid, several key assumptions must be met:

Linearity
The relationship between X and Y is linear.

Independence
Observations are independent of each other.

Homoscedasticity
The variance of residuals is constant across all values of X.

Normality of Errors
The residuals are normally distributed.

No Multicollinearity (only applies in multiple regression)
Predictors (X variables) should not be too highly correlated with each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

common pb in OLS regression

A

Non-linearity: Your model might miss curved patterns.

Heteroscedasticity: Residuals have unequal variance — leads to biased standard errors.

Outliers: Can drag the regression line toward them.

Influential points: Especially problematic if they’re extreme on both X and Y.

Omitted variables: Important predictors left out can bias your results.

Reverse causality: X and Y may influence each other.

Correlation ≠ Causation: Regression shows association, not proof of cause.