Week 8 Flashcards

1
Q

The trade-off between type 1 and type II errors

A

● As the likelihood of Type I error decreases (with a smaller α), basically for that amount of confidence it’s very unlikely maybe 5% that you’ll run into a type one error situation, the likelihood of Type II error increases when Type I error decreases and vice-versa . So what’s the worst that could happen

● Example: policy evaluation project testing the effectiveness of breakfast programs in inner-city schools:
○ ¡Randomly sample children who are in breakfast programs
○ ¡compare their school performance with the overall school population
○ ¡set α = 0.05

● H0: No difference in the performance of children in the program and all school kids: the program does not help with performance
○ ¡Possible type I error: 5% chance of rejecting the null when it is actually true
■ §i.e., there is a 5% chance that you will find the program has an effect in improving performance, when in fact it actually does not

● ¡Possible type II error: we don’t reject the null, but it is actually false
○ §i.e., we fail to detect a difference in performance, even if there actually is one

● To reduce the risk of type I error, we could set α = 0.01 instead
○ ¡The risk of supporting an ineffective program goes down to 1 in 100
○ ¡But this increases the risk of finding that the program does not work in improving school performance, when in reality it does - because if you lower the alpha level you increase the chance of making a type II error because you’re increasing the threshold that the sample must find itself to be able to reject the null hypothesis
○ So the odds of type II error increase as we reduce the odds of type I -
○ ¡If a small type I error is essential, increasing sample size helps reduce type II - let’s say your not happy with the possibility by decreasing the odds of a type I error you’re increasing the odds of making a type II error - you can increase the sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bivariate relationships with interval/ratio data

A

○ ¡In some situations we would like to know the rate of change: if X increases by a certain amount, can we predict (estimate) the increase in Y?

● For instance:
○ ¡How many additional dollars in annual revenue can an individual expect from a one-year increase in his education?
○ ¡How many economic growth points can a government hope to add to its economy by implementing a growth plan?

Tools to investigate relationships between two interval/ratio variables:

  1. Scatterplots (easier than cross-tabs)
  2. Regression line/equation
  3. Correlation coefficients (r and r2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Scatterplots

A

● Unlike nominal and ordinal data (where cross-tabs are used), with I/R data it is possible to create a scatterplot: a graphical representation of where each case in a sample would fall along two dimensions that are continuous
● This presents a visual distribution of the cases along two dimensions: if there is a relationship, a visual pattern should show up in the distribution of the cases on the scatterplot

● The scatterplot gives us indications (hints) about the relationship:
○ ¡A pattern indicates a relationship (there is a pattern in how the dots are distributed)
○ ¡A direction (positive or negative)
○ ¡A strength in the relationship, indicated by the concentration of the points around the pattern

● Visually:
○ ¡The greater the extent to which dots follow a clearly defined pattern, the stronger the relationship
○ ¡Positive relationship: we see a progression from the bottom left to the top right
○ ¡Negative relationship: we see a progression from the top left to the bottom right
○ ¡Linear vs. non-linear patterns: do the dots follow a straight line, or a curve?
○ ¡But even then, it only helps us (compared to cross-tabs, for instance) up to a point: how do we qualify and quantify the strength of any relationship? Can we be more precise?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Regression line

A

● Second tool: the line of best fit (‘regression’ line and equation):
○ ¡Straight line that comes as close as possible to all data points
○ ¡Summarizes the relationship between X and Y, and allows to predict scores on Y from a score on X
○ ¡The more clustered the dots are around the line, the stronger the relationship
○ ¡When the regression line rises left to right, the relationship is positive
○ ¡When the regression line falls left to right the relationship is negative

● The line is not just a graphical tool:
○ ✤The line’s equation expresses the (approximate) relationship: Y = a + b*X, where
■ ✤X and Y are the values for the two variables for a given case
■ ✤a is the value of Y where the line crosses the Y-axis (i.e., at X = 0)
■ ✤b is the slope of the line, i.e. the rate of change
● ✤rate: X increases by 1 unit, Y changes by b units
○ ✤The equation can be used for predictions/estimations:
■ ✤If we know X for a new case, we can use the equation to predict a value of Y
■ ✤If the relationship is strong, our prediction from the regression equation will be close to the true value
■ ✤The difference between the real value (of Y) for that case and the value we obtain from the equation is the prediction error - in other words, how much of a mistake I’m making when predicting a value of y for a new case by having both the regression equation and the value of x for that new case

HDI = 0.45 + 0.00574*INTERNET
For a country with 50% of its population having internet access, what predictions could we make regarding its HDI?
● ¡HDI = 0.45 + 0.00574 * 50 = 0.737
● ¡So we would predict a HDI of 0.737
● ¡If, for instance, that country actually had a HDI score of 0.75, then our prediction error would be 0.75 – 0.737 = 0.013
For this equation we have:
● ¡a = 0.45: means that the line crosses the Y-axis (X = 0) at Y = 0.45
● ¡b = 0.00574: means that for every additional % in the population having internet access, the HDI increases by 0.00574 points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Measure of association for interval and ratio variables

A

Bivariate: Pearson`s r - linear relationships, no PRE, from -1 to 1. Weak 0-0.30, Moderate 0.31-0.70, Strong 0.71-1.00

Bivariate: Coefficient of determination (r2) - linear relationship, has PRE, and from 0 to 1. But no strong, moderate or weak scale.

Multivariate: Multiple correlation coefficient (R2): Has a PRE and the scale goes from 0-0.30 weak, 0.31-070 moderate and 0.71 and over strong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pearson’s r: the correlation coefficient

A

● Measures the amount of change in Y produced by a unit of change in X, where the units are expressed as standard deviations
○ E.G., H1: the age of the husband is related to the age of a wife
■ H0: there is no relationship between the age of a husband and the age of his wife
● To prove that r is statistically significant before we test our hypotheses
● In this example, the r was 0.68 so a moderate to strong positive correlation, so we can say that on average, as the husband’s age increases so does the age of his spouse, if this was the opposite and it was negative, then we would say on average as the husbands age increases the wife’s age decreases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Coefficient of determination:

A

● Is the percentage of all variation in the dependent variable that can be explained by the values of the independent variable. The variable r^2 is the percentage by which errors are reduced when the information found in the independent variable is incorporated into the prediction of the dependent variable
● Explained variation is how much more accurate a prediction becomes when the independent variable, is taken into account. Unexplained variation is the remaining prediction error, which could be due to variable that weren’t included as predictors, measurement error or random error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Using a t-test to assess the significance of r

A

● If we can assume that the relationship we observe in the sample also exist in the population it is necessary when using are to determine this
● T-distributions are closely linked to z-distributions, but due to small sample
● df = sample size - 2
● If after observing the df we get the a t of 2.086 and we calculate a t observed of 1.101 this is below the critical value and thus we cannot be certain that there is a correlation - remember the df matters on if it’s a one or two tailed test and what confidence level you choose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A stronger relationship will show:

A

○ ¡ A clear linear pattern on the scatterplot
○ ¡A majority of the cases concentrated around the regression line
○ ¡Higher r and r2 values (in absolute terms, i.e. closer to 1 or –1)
○ ¡Lower errors of prediction when we use the equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A weaker relationship will show:

A

○ ¡A pattern difficult to detect on the scatterplot
○ ¡Cases typically far from the regression line
○ ¡r and r2 values closer to 0
○ ¡larger errors of prediction when using the equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly