Week 5 Flashcards

1
Q
Different cause relationships:
Direct causation
Reverse causation
Partial causation
Spurious causation
Time
Coincidence
Tautology
A

● Suppose you have evidence (after you have collected data in a research project) that clearly shows a pattern: variable A varies along with B.
● The correlation between A and B could be a result of a host of different reasons:
○ Direct causation: A and only A causes B
■ Note, even if we prove this ideal case, it does not mean that we understand how the casual relationship operates (mechanism)
○ Reverse causation: A and B vary together, but it’s actually B that causes A
■ The more firemen fighting a fire, the bigger the fire is observed to be - actually the larger the fire the more firemen is needed
○ Partial causation: A causes B, but only because of the presence of something else C. So it’s A + C that is causing B
■ A healthy diet decreases your chances of getting cancer - but is it only a healthy diet? Probably not, it’s healthy diet usually leads to more fit people and other variables that contribute to your chances of getting cancer.
○ Spurious causation (confounding variables are present): A and B are both caused by a third, unidentified factor of C
■ A high grade in this course (A) correlates with higher grades in later courses (B): how about a good hard-working student C. Which means that if you’re a hard-working student you’ll be able to get good grades regardless of what you get in the course. Therefore, C can cause both A and B.
○ Time: the passage of time causes both A and b to vary independently
■ Global warming and the increased number of earthquakes and other natural disasters are a direct effect of the shrinking number of pirates since the early 1800s.
■ The process of time causes both A and B
■ When designing research take into consideration the effect of time
○ Coincidence: apparent relationship just due to random variation
■ Lincoln and Kennedy assassinations
■ Cherry-picking data that makes it look like you have an argument but you don’t
■ When it seems that variables are connected but there is a chance of coincidence you need a theoretical framework that offers a coherent explanation
○ Tautology: A and B are actually the same variable (A/B measure the same concept)
■ Level of economic development and quality of judiciary institutions - which comes first - maybe your measuring the overall development in the society your looking at.
■ 100% of people who drink water die - if your alive you drink water - if your dead you can’t drink water
● These are all possibilities: would like to eliminate as many of them as possible, in order to give us more confidence in a hypothesis on the causal relationship
○ Ideally, we need a methodological solution (design) that allows us to isolate the impact of A on B, taking into account all of these other possible situations (the presence of C, time, etc.)
○ If for instance, taking C into account (“controlling for”) makes the original correlation disappear, then there wasn’t really a relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hypothesis testing and statistical significance

A

● When testing hypothesis about this reality, we begin with two questions:
○ Is there a relationship between two variables in the population?
■ For instance, can we claim that people with a higher level of education are generally more satisfied with the job they have?
■ Very typically we term the two variables as independent and dependent variable - notice these things are made up by me not the data, I choose to the that because I have a certain logic
○ Could we determine this only by looking at a sample?
● Given that we rarely have information on the entire population of cases, hypothesis testing often begins with the observation that there is a trend, or relationship, or something interestingly odd in our sample.
○ Is this proof that one variable is linked to another (correlation)?
○ Is this what we would expect (from theory, past experience, other cases, etc
● Tests of statistical significance are all intended to determine whether or not the data in the sample is indicative of a larger trend in a given population.
○ For example: is there a difference between the average income of adult men versus the rest of the population? Suppose we have data on:
■ The entire population - mean income for everyone in the country
■ A representative sample of 100 adult men, with a mean income that is different from the populations
● There are at least two possible explanations:
○ The sample mean is not statistically significant different from the population mean - i.e., the difference is trivial
○ The difference between the sample and population is statistically significant - income for adult men is different than the whole populations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Null and research hypotheses

A

● We have two competing hypothesis
○ Null hypothesis: the difference is caused by random sampling error, and is not due to real differences between the two groups
■ The null always states that there is no statistically significant difference
○ The alternative hypothesis: the difference is “real” (i.e., the trend in the sample is also true in the population)
○ We can never infer with 100% confidence level, there is always a chance that we are wrong: proving the relationship vs we disprove the null hypothesis - if we can determine that it is reasonable to reject null hypothesis, then we have support for alternative hypothesis
● Results: suppose the results show the mean for population is 4,000
○ The mean of sample men, is 4,500, standard deviation is 500 and sample size is 100
○ So there is an observable difference between the parameter (4000) and the statistic in the sample (4500) - it seems that adult men do have higher income on average - is this difference real or it is caused by other factors?
● If you can reject the null hypothesis then you can say its statistically significant - very specific term that doesn’t mean significant - i.e., the difference we were seeing in the sample was not due to random sampling error - therefore, we can conclude that adult mean have a different income on average than the whole population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

5-step hypotheses testing model

A
  1. Make assumptions and meet test requirements
  2. State the null hypothesis and research hypothesis
  3. Select the sampling distribution and establish the critical region (i.e.g, the criteria to pass/fail the test)
  4. Compute the test statistic
  5. Make a decision and interpret results

Step 1: Assumptions and requirements

  • For one sample hypothesis testing with t-tests:
  • Sample must be randomly selected from the defined population
  • The sample is selected so that it is representative of a subgroup of the whole population, one with a specific characteristic
  • Level of measurement of the dependent variable must be interval/ratio
  • Level of measurement for the independent variable must be dichotomous (only two possible values) - and nominal?? - one distinction between two categorical values
  • Sampling distribution of means must be normal in shape

Step 2: State hypotheses

  • Null: there is “no difference” - if we are measuring men income to population - between the mean of the population from which the sample comes is equal to the population mean we are comparing it with
  • Research hypothesis: there is a difference - perhaps focus on direction of difference, smaller or larger

Step 3: sampling distribution and critical region

  • Which distribution should we use?
  • The z-distribution if N is larger than 120 (normal curve appendix A)
  • T-distribution if N is smaller than 2 (appendix B)
  • Select the critical region:
  • Critical region is the area under the curve which includes all the unlikely sample outcomes if the null hypothesis were to be true - the region where you can reject the null hypothesis
  • Typically a 0.05, or 5% alpha level (proportion of area under the curve which falls in the critical region) - 95% confident leaving 5% of area under the critical region
  • If we want 99% critical region, alpha would be 0.01 and 1% of area falls in the critical region
  • Z or t score is the corresponding for the selected a level
  • I.e., it is the score that corresponds to the threshold outside of which we fall into the critical region
  • E.g., for an alpha of 0.05 and z score would be 1.96 for a two tailed test
  • If instead we had less than 75 samples, then t score would be 1.993 for an alpha of 0.05

Step 4: compute the test statistic

  • The test statistic is the z or t score of the sample outcome we are interested in
  • This score is referred to as z or t obtained

Step 5: discussion and interpretation

  • If z obtained falls in the critical region
  • Is the z obtained outside the 1.96 you can reject the null hypothesis
  • A larger standard error will result in a lower z figure
  • When you can reject the null hypothesis you are able to state that is it reasonable to argue that there is a real difference between gender and age, not necessarily we know why this correlation exists but that it does exist
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Example of 5-step model in practice

A
  • A z critical to reject the null hypothesis is 1.96 for a two-tailed test but 1.65 for a one-tailed test
  • Example: a sample of 152 felonies tried in a local court has a mean sentence of 27.3 months, is this significantly different from the average term for all felons across the nation, which is 28.7 months?
  • Population mean = 28.7 months
  • Sample
  • 27.3 mean
  • SD = 3.7
  • N = 152
  • Step 1: assumption and distribution
  • Random sample
  • Interval data for dependent
  • Dichotomous for the independent whether they were tried in local court or not
  • Step 2: state hypotheses
  • Null there is no relationship
  • Alternative there is a relationshit, but there is a specific type of relationship, the nation is larger than the mean for the local
  • We need a one-tailed test, because the critical region is only on one side of the mean, I’m not even going to consider the region on the other side
  • Step 3: distribution and critical region
  • Sample is large enough for z distribution
  • Alpha is 0.005 - confidence level of 95%, one-tailed because it’s one direction
  • 1.65 z score, which is 0.45 because it’s one side of the mean 50+45
  • Step 4: compute the statistic
  • And z obtained 4.67 which is beyond the 1.65
  • Step 5:
  • 95% confidence interval this is not a random sample we can be confidence that it will be replicated in 95% other samples
  • We reject the null hypothesis
  • The difference cannot be attributed to sampling error
  • The local court is handing down lower sentences for the same crime
  • Now what?
  • The 1.4 month difference between the two different means, does it really matter?
  • All the test does is tell us about statistical analysis but we need to consider theory, policy relevance and other data to make a conclusion
  • If it wasn’t in the critical region then you could have said there was no difference between the local court and the national court
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

p-value - type I and type II errors

A

Making the decision with the p-value:

  • Based on the obtained z or t score from one sample mean
  • P value is the probability of obtaining the difference between the means if the null hypothesis is true
  • P.value less than 0.05 lets you reject the null hypothesis
  • Therefore the p-value and the z or to score obtained by comparing the z or t score limit is the same practically

Type one error: we choose to reject the null hypothesis when it’s true
Type two error: we fail to reject the null hypothesis when it’s not true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Correlation versus causation

A

● One frequent problem we have to deal with when trying to answer a research question is causality: can we really demonstrate a causal link between concepts by simply looking at a set of cases and comparing them
● Worth noting that with only a few exceptions, all research methods and designs only help us establish a correlation: one independent variable (hypothesized cause) varies along with the dependent variable(outcome)
● To establish causality between two variables, we need both to observe correlation and to have a good theoretical explanation for the link
○ Example: age and political donations. We may find evidence illustrating correlation (i.e., as people get older, they tend to give more money to political parties). But can we explain it?
○ In other words, is the correlation causal? Does the variation in the independent variable cause the variation in the outcome?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly