Week 9 Flashcards

1
Q

Correlation Matrix

A

A correlation matrix is a table showing correlation coefficients between sets of variables. Each random variable (Xi) in the table is correlated with each of the other values in the table (Xj). This allows you to see which pairs have the highest correlation.
It displays the pearson`s r coefficient

● Limitation with correlation matrix it only shows you pairings between different variables by ignoring other variables but multivariate regressions tell you several variables and determines the impact of each variable on the outcome given the presence of other variables by controlling for other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Regression equation explained

A

● Regression equation: Y = a + b1X1 + b2X2 + … + bnXn + e
● Where
○ ¡Y is the outcome we wish to explain (dependent variable)
○ ¡The different X’s are the independent variables we think are important
○ ¡a is the value of the predicted outcome when all variables are 0 (Y-intercept)
○ ¡b1 is the regression coefficient for variable X1 (also called partial slope coefficient) - b2 is partial slope of X2 on Y and so on
■ The slope assigned to any given independent variabl;e holding all the other independent variables constant
○ ¡n is the number of independent variables we included in the model
○ ¡e is the prediction error - error for individual i
■ The error recognizes that each prediction is likely to carry at least some error, meaning that the combination of variables will miss the actual score of the dependent variable by some margin. In other words, the gap between the predicted and observed values. The amount by which an estimate misses its mark.
■ Error is defined as e = Y - Ŷ (lower case), meaning the value on the regression line at a particular value of Z - the predicted value

● So the model is actually a hypothesis:
○ ¡We hypothesize that this exact list of variables is the best way to explain variation in the outcome (Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explanatory versus control variables

A

● ¡So multivariate regressions allow for the isolation of each individual factor’s impact on the DV:
○ §Some of the Xs are explanatory variables - they’re variables included because their potential important causes in the variation of Y
○ §Some of the Xs are control variables - there to control for any influence from structural factors that are not the main part of our explanation, that may not be interesting to us, but we would like to highlight and isolate in terms of their impact on the outcome
■ Regression assesses the effect of each independent variable while holding the value of all other independent variables in the model at a constant value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Multiple R versus Adjusted R squared

A

1.Multiple R. This is the correlation coefficient. It tells you how strong the linear relationship is. For example, a value of 1 means a perfect positive relationship and a value of zero means no relationship at all. It is the square root of r squared (see #2). It’s basically the pearson’s r.

Multiple correlation coefficient (R2):
The R-squared of the regression is the fraction of the variation in your dependent variable that is accounted for (or predicted by) your independent variables. (In regression with a single independent variable, it is the same as the square of the correlation between your dependent and independent variable.) The R-squared is generally of secondary importance, unless your main concern is using the regression equation to make accurate predictions
● R2 gives us a measure of how good the model is as a whole at explaining variation in the outcome:
○ ¡Value between 0 and 1, PRE: corresponds to how much (%) of the variation in Y is explained
○ ¡By convention, we also often use the same interpretation as for bivariate measures of association: 0.30 is weak, 0.31–0.70 is moderate, and 0.71 and over is strong
○ ¡One problem with R2 is that it inevitably increases as we add more variables in the model: looking at the adjusted-R2 compensates for this error (preferable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Stat sig happens both at the entire model and for each indv variable - what are some implications

A

● For multivariate regressions, we can test for stat. sign. both at the model level and for individual coefficients:
● Regression analysis is a progressive process, where we try a combination of variables (model), then analyze the results, then we modify and improve the model.

● ¡Given the absence of significance for b2 and b4, perhaps we could remove these variables and run the regression again, to see if the relationships still hold for X1 (HDI) and X3 (youth bulge)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

interpreting b coefficients and stad coefficients

A

● As for bivariate regressions, the value of a given b tells us the rate of change of the outcome (DV) when that IV changes by one unit (everything else being equal).
● 1.Regression coefficients b tell us the impact of a one-unit increase in each variable X (b1 for X1, and so on) on the dependent variable Y
○ It’s not enough to say 1 unit increase in IV results in whatever in the DV, you have to say the what the unit is ($, km, for example)

● But what about units? difficult to compare these coefficients with each other…

β, or b* tell us the impact of each X in standardized units on the dependent variable Y (b1* for X1, and so on). How much change there is to standardized scores of Y when there is a one-unit change in the standardized scores of each independent variable, and controls for the effects of all other independent variables.
● For instance, which of the variables in the model has the strongest impact on Y?
○ ¡Solution: std. coeff. (also called standardized partial slopers/beta-weights), β, or b*
○ ¡One for each regression coefficient b, standardized for units
○ ¡As we have done a few weeks ago with Z scores, these measure the impact in terms of standard deviations on Y
○ ¡Bottom line: they are in the same units, so we can compare them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Strengths of regression analysis

A

○ ¡Very powerful tool that allows the inclusion of several explanatory factors, as well as the isolation of each of these factors’ impact on the outcome variable
○ ¡Probably the most universally used quantitative tool in social science research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Limitations of regression analysis

A

○ ¡Temptation to include too many variables: eventually the impact of each will be small just because of the sheer number of causes included in the model
○ ¡Temptation to discuss results where statistical significance is far from conclusive, for instance if we have it at a reasonable level only on a few coefficients
○ ¡Temptation to discuss results where R2 is small: R2 is rarely over 0.70 for models in the social sciences, but often researchers push it and discuss models with R2 < 0.30
○ ¡Several assumptions difficult to verify (more on this next week):

1) The DV is interval/ration and the IV are interval/ratio, dichotomous or dummy
2) All error terms are normally distributed
3) There is linearity between variables; meaning relationship between DV and IV are the same across the range of both variables, the effect of one extra year of schooling on education is the same for a high school and master’s degree
4) There is homoscedasticity: when the measure of variance in Y is consistent across values of X. Heteroscedasticity refers to a situation where the variance of Y differs by X value. For example, what happens imagine that two students in high school, one averages 65% the other 95%, and imagine both become interested in the class and study harder resulting in higher grades for both, the latter can’t increase their grade at the same rate of the former just because of where they started off, in OLS we need them to be homo.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you interpret dummy variables:

A

*** a note on dummy variables: many times we measure nominal and ordinal level variable so does this mean we can’t use regression? We stil can, the distance between response categories can be measured when there only two response categories (0 male - 1 female), there are called dichotomous or dummy variables

How about when there are more than two response categories? Well you can cheat, for example, you can break them down to a series of dichtomous variables, this is referred to as creating dummy variables

  • You do this for example by having 0 and 1s and you ask four categories of people, married, single, divorced or widowed
  • People answer and when you plug in the results for married people you put 1 and all other options (2, 3, 4) if someone belongs to any other category they receive a 0
  • The same thing for the second category, divorced, if someone is divorced for that category you would put them as 1 and when plugging all other sections (1, 3, 4) if someone belongs to that you would put them as 0

How do you interpret dummy variables:a little complicated, the effect of “1” is on the dummy variable compared to the reference group category (whatever group was left out), for example, the effect of career (IV) on family size (DV), we look at farmers (farmer: yes = 1 , no =0) and we get 0.405, this means that farmers are expected to on average have 0.405 more children than non-farmers (reference group).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

OLS versus logistic regression

A

● The key difference between logistic and OLS are the assumptions about the dependent variable, and the relationship between independent variables and the dependent variables. Instead of predicting the score of Y - an observed variables - logistic regression predicts the probability of the occurrence of Y versus the non-occurrence of Y. Logistic regression also helps solve dichotomous dependent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly