Week 11 Flashcards

1
Q

Regression diagnostics basic overview first

A
  1. All variables need to be interval/ratio or dichotomous, and all X and Y variables need to be correctly measured, everything we talked about validity and reliability also applies here
  2. The regression model must be specified correctly (i.e. the model it is theoretically sound, and reflects the form of the relationship between the variables)
  3. Multicollinearity between IVs cannot be too high
  4. Homoscedasticity: the measure of variance in Y must be consistent across all values of X (the spread of the errors above and below the regression line is uniform for all values of X)
  5. There must be no autocorrelation among the error terms: the error associated with one observation is not associated with errors from any other observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dichotomization

A

● When we have more than two categories (e.g., religion), however, we cannot include the variable directly
○ ¡Solution: transform each of the possible values into an individual dummy variable in a process called dichotomization

For example, if we have a RELIGION variable, we could code it as:
● ¡1 for Buddhist – 2 for Muslim – 3 for Christian – 4 for Jewish – 5 for other
● Then we dichotomize this variable:
● Each possible value becomes an X: in our case, X1 to X5
● ¡Each X is a dummy variable for a possible religion: X1 = 1 when the subject is Buddhist, and X1 = 0 when the respondent is not, and so on
● If a person is muslim they would be X2.1 and when comparing to respondents in all other categories they would be X2.0 if there any other category
● ¡Mutual exclusivity is met
● ¡Redundancy: for a variable with N possible values, we only need N-1 dummy variables, this means that if we know someone is not bud, mus, chr, or jew we know that there other, so we really don’t know X5
The coefficient b1 can then be interpreted as the impact on Y of being Buddhist, the coefficient b2 is the impact on Y of being Muslim, etc.

What about ordinal variables? It’s sort of been controversial given the exact distance between the options are not known such as the case with interval/ratio. But some researchers still include ordinal variables and treat them as interval/ratio for the purposes of the regression. To do this, they clearly outline the options to respondents on surveys and only resort to it out of necessity.

● So far we have considered only ordinary least square (OLS) regressions:
○ ¡Supposes a linear relationship within the model: the same variation in the outcome (DV) exists between the model and the DV at all ranges of values - here we’re talking about the outcome where as above we’re talking about dichotomous IV, OLS assumes that the DV outcome has the linear line of best fit, but this isn’t always the case, especially if the outcome is either 0 or 1. Then you can imagine the line wouldn’t go through all the observations. So, logistic regression helps predict the chance/odds ratio of one possibility occurring over the other
■ ¡If the outcome of interest, however, can only take two values (onset/not of conflict, approval/not of a negotiation position, etc.), then this is not the case
■ When the options are from 0 to 1 there is not that continuous scale of outcomes that is assumed
○ So you could use¡Threshold before which the outcome is 0 and beyond which it is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The specification problem

A

Assumption #2 states that the model must be correctly specified (the variables included are a reasonably ‘valid’ set of causes of variation): the assumption is that the variables include in the model are reasonable explanations of the outcome in the DV
● ¡Theoretical soundness: enough variables… but not too many - too many variables dilutes the results so that they’re no longer meaningful - should be around 5-8 variables, this also applies to control variables too
● ¡Linear vs. non-linear relationships; in scatter plots, clearly relationships are not linear, sometimes it may appear that there is no relationship because a linear line might not be appropriate, however, there are transformations that researchers do such as logs or squaring X that make it approximate more of a linear relationship and the interpretation will remain the same
○ The linearity issue has to do with functional forms, if the functional form is incorrect both the coefficients and standard errors in your output is unreliable
○ Violating this could result in our regression slopes (b coefficients) being wrong (too steep, too flat, or even wrong direction)
OLS draws a “line of best fit” among the data points: if the points are not distributed in a linear pattern, then OLS will not be a good approximation of the pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The problem of high multicollinearity

A

Multicollinearity arises when two (or more) of the independent variables in the model have a near perfect relationship with one another. Anything at 0.5 between two IV on the correlation matrix requires further examination. A relationship between two or more independent variables in a regression
¡
● The purpose of the regression model is to explain variation in an outcome
○ ¡Using two IVs (potential causes for this variation) in the model which vary in a similar fashion represents “redundant” information
○ ¡The more highly correlated the two IVs, the more overlap in the information that these variables contain about the behaviour of the DV
○ The regression technique cannot distinguish one from the other
● Perfect collinearity exists when one of the variables in a regression is perfectly related to one or more of the other independent variables
● Even when independent variables aren’t perfectly related, strong relationships between independent variables (multicollinearity) can cause problems in a regression
● The standard errors of the regression coefficients are wider when there is a lot of multicollinearity
● We can perform a test to see if multicollinearity is a problem (the tolerance statistic)
● The tolerance statistic tells you how much larger a regression coefficient’s standard error is than it would be if that particular variable were uncorrelated with the other independent variables in the equation

Technically, this is not a problem with our theory or our model!

How to detect it:
● ¡Correlation matrix to detect very high r coefficients for pairs of IVs
● ¡variance inflation factor test for each variable to detect very high correlation between a given IV and a combination of some of the other IVs, the result is a standardized version or pearson’s multiple coefficient R squared - typically a score of 5 or higher is deemed to high and there is multicollinearity anything around 4 warrants further examination

How to resolve it or plan ahead for it:
Technique #1: increase sample size
Technique #2: drop some IVs
● ¡In the correlation matrix, if the correlation coefficient for a given pair of IVs is higher that 0.95, we should probably drop one of them from our model
● ¡For more complex multicollinearity, VIF test and then drop IVs with VIF > 4
Technique #3: combine the problematic IVs in a composite measure

Tolerance (tol)
• A tolerance score of 1 indicates the independent variable is not at all correlated with other independent variables in the regression
• A tolerance score of zero indicates that the independent variable is perfectly correlated with one, or a combination of, the other independent variables in the regression.
• Tolerance scores above 0.20 are generally acceptable; tolerance scores below 0.20 suggest the regression model might need to be changed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The problem of heteroscedasticity

A

Another assumption for OLS regressions to work is that the variance of the population error terms is constant across the whole regression line, or that the standard deviation of the estimation error for each value of X is roughly similar. Unequal error variance; our regression lines fit some kinds of cases better than others

This quality is known as homoscedasticity

Example: examining income as a potential explanatory factor for the money that people spend on education
● ¡Potential linear relationship: people with less income spend different amounts on education, while people with more income spend more
● ¡but: the variability in spending for people with low income is also going to be low
● People with low income, there is a cap on how much they can spend, but those with high income will have much higher variability and OLS regression assumes that the variance in the population error terms is constant where it’s clearly not with higher income having higher variability and a higher error term
● ¡first step to detect: scatterplot

● Looking at the graph, we see the ‘funneling out’ around the regression line.
● Several more advanced techniques to resolve (weighted least squares in particular)
● Again standard error in your output is unreliable because it’s different depending on where you are on the plot, sometimes the software increases the error term so your less confident on your line of best fit
● Its not that observations are away from regression line, its that the variation is systematic, as regression line continues observations move further and further away
● Can lead to type 1 and type 2 errors
● The key limitation with this is that the accuracy of the coefficients can be questioned if there is heteroscedasticity for a observation: (need to double check this point but): your coefficient is essentially undermined because your saying this is the rate of change, but the rate of change varies depending on where you are on the graph/ depending on the observation
Yes. It can skew the OLS result, and we will not know how exactly, making the results less reliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The problem of autocorrelation

A

Frequent problem arising especially with time series data:
● ¡The error from one observation is related (auto-correlated) to the error of another

Essentially, what the assumption of no autocorrelation means is that there must be no systematic relationship among the error terms of individual cases (the amount by which our model is off in its prediction for a given case). Even if we know one period’s error it should not have a relationship and should be independent from another period’s error term.

Suppose we have a model that evaluates a developing economy’s standing, and that we run a regression with data from 1980 to 2016:
● ¡Suppose that an unexpected financial or economic crisis occurs in that economy -
● ¡The standing of the country’s economy will be much lower than our linear model would have predicted (because it was unexpected) - its normal because our prediction error won’t be fully accurate but some years it will be worse at predicting if something unexpected happens
● ¡This fall in standing will likely spread over several time periods (e.g., years)
● ¡In other words, there would be a ‘momentum’ to the errors, a pattern
● Chart for error residuals below

Autocorrelation: errors follow a certain pattern

Certain types of pattern commonly observed: the issue is not the sign of positive or negative but of the pattern, if there is a pattern in the error then our assumption of no autocorrelation is violated
● ¡One is that one period’s error tends to have the same sign as the previous period’s error (positive autocorrelation)
● ¡Similarly, another is that one period’s error tends to have the opposite sign of the previous period’s error (negative autocorrelation)
● ¡Visual analysis of the residuals (errors for data points that we are using to run the regression): we can plot the residuals over time, and see if there is a visible pattern (with Excel, SPSS, R or another software, for instance)

How to solve it if we detect it:
● ¡This is actually a very difficult problem with small samples that contain time series data for one case (such as the ones you are using in the country report)
● ¡Simple tactics to consider are to use growth rates for the dependent variable instead of absolute values (for instance, GDP growth instead of GDP)
● ¡More advanced procedures are often necessary (the Cochrane-Orcutt procedure, for instance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Handling missing data: before deleting

A

Before deleting we must understand why data is missing:
○ §The overall representativity of the sample
■ just as surveys when you remove non-respondents you may be excluding a certain group of non-respondents that would have a unique bias and change your results - patterns among the non-respondents, using the t-test maybe you’ll find patterns (for example, there could be missing data depending on the IV (more likelihood of missing data when looking at one specific IV), or there may be missing data when looking specific points in the DV), or the missing data can be entirely random and would thus be captured in the error term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Handling missing data: imputation versus census

A

Imputation and samples vs. census
■ When missing data sometimes researchers use imputation: the assignment of a value to something by inference from the value of the products or processes to which it contributes. If you introduce a new technique to making up your data your kind of introducing a pattern and magnifying a pattern that you’re making it up so there will be errors in this
● Best guess imputation: the researcher views missing values in a quasi-subjective manner based on knowledge obtained from other variables and assigns a value of best guess of what it would be if it wasn’t missing
● Mean substitution: this replaces the missing data either with the mean for interval/ratio data or with the mode for ordinal/nominal data
■ Census when your including the whole population then it’s less prevalent
○ §When doing imputation:
■ ▪Is it worth it? If none of the worries above are present, it is better to delete the entry - if there are no worries then removing the data then remove it - if there are no patterns - its really balancing cost and benefits from deleting the data to using imputation causing some skewing in the data
■ ▪The question to ask is: what is the imputation strategy that results in the least potential skew of the overall data?
■ §Here, increasing the sample size only helps up to a point
● Can help if your sample is small and you have missing data
● But in most cases if you have a small sample there are other considerations at play
● But here increasing the sample size could have other problems; because if you increase the sample size and there’s a specific pattern in the non-respondents and you increase your sample size it will reproduce what caused your problem in the first place because the same nonrespondent pattern will show up again
● How about when you want 1500 people but only get 500, and there may be a pattern within the 1000 people who didn’t participate but there’s really no way of you finding out that pattern:
○ When designing the survey and questions you must facilitate the response rate that you aim for so that you have as few as possible missing data problems so this question of patterns that end skewing the results because of characteristics shared among non respondents are mitigated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Terms: Outliers and Influencers

A

• Outliers: an observation that lies outside the overall pattern of the other observations
• Influence: points that when removes would markedly change the result of your calculations.
o Cooks Distance Test to determine whether removing an observation will change the coefficient. The higher than 1 q cook’s d value is the stronger the case for deleting a particular observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly