Week 1 - Sample size and power Flashcards
(22 cards)
What is the main principle behind hull hypothesis significant testing?
To try to prove ones research hypothesis (that there is an effect) to be false; AKA to try to prove the null hypothesis true.
(BECAUSE: when an hypothesis is formed (based on research) it must be falsifiable, it must be possible to prove that it is incorrect - this is a cornerstone of science)
What is the null hypothesis?
The hypothesis that there is no effect (or that there is no relation), that the studies proposed hypotheses is/are not true.
What does the P-value mean in NHST? What is another for this value?
What determines the P-value cut-off and what is the rationale behind choosing the cut-off?
The probability of observing results as extreme (or more - i.e. magnitude of t-statistic) as observed if the null hypothesis is true (there were no effects).
It is also known as the ALPHA; and this value is selected by the researcher in an attempt to minimise false positives (Type 1 errors). The researcher will select a LOW probability that the observed results would occur if the null hypothesis is true - as this reduces risk of type 1/false positive errors.
What is the power of a study? What is another name for this value?
How is it defined algebraically?
The power determines the ABILITY for the study to DETECT an effect if there REALLY IS an effect.
- The power is the probability out of 1 of NOT having a false negative.
It is also known as the BETA. It is algebraically defined as Pr (not type II error) = 1 - B (beta).
OR:
1 (no false negatives) - (the P of inaccurately saying that there are no effects when there are) = (the probability that our findings are NOT saying there are no effects when there are)
Could also be thought of: POWER = the chance of being able to detect the abscence of an effect (the ability to support the null hypothesis)
How do we control for TYPE 1 Errors
Type 1 errors are false positives. These are controlled directly through selection of the alpha value (p-value).
The research decides P-value of
What is a Type 1 Error?
a FALSE POSITIVE.
The incorrect rejection of a true null hypothesis
OR
when we reject H0 based on sample information when it is actually true in the population
OR
when we say there IS AN EFFECT based on sample information when THERE REALLY IS NO EFFECT in the population.
Pr(type I error) = a (alpha) i.e., P-value/significance
What is a Type II Error?
a FALSE NEGATIVE.
The failure to reject the null hypothesis when it is false.
OR
When we accept H0 based on sample information, when it is actually NOT TRUE in the population.
OR
when we say there IS NO EFFECT based on sample information when there REALLY IS AN EFFECT in the population
Pr(type 2 error) = B (beta)
How do we control for type II errors?
Type II errors are false negatives. We control this indirectly through study design.
- Sample size and other factors (lower SD of DV measures)
Increasing POWER (i.e., a larger sample size), leads to a reduced probability of making type II errors.
Why is it important to conduct a power analysis and to consider power when conducting a research study?
If we do not have enough POWER, we can’t draw any conclusions from our results as we won’t know whether the observed result is likely to represent a true effect or error variance/no effect (i.e., H0).
This means we would be putting our subjects through unnecessary effort - in order to achieve nothing - this is unethical.
What is cohen’s D (and glass D and Hedges G)?
How is it calculated?
A standardised effect size measure that represents the difference in means between two groups expressed in standard deviation units.
i.e., an estimate of the difference between two means expressed in standard deviation units.
D = (mean1 - mean2)/ pooled SD
when n is equal ‘pooled SD’ = (SD1 + SD2) /2
when they are not equal. there is another equation (see picture file 1)
What is pooled SD?
The weight average of each groups standard deviation (the weighting gives larger groups a proportionally greater effect on the overall estimate).
What is the calculation of the t-statistics?
What does it mean? (i.e., deconstruct it)
See picture file 2.
The t-statistic is calculated by taking the unstandardised effect size of DIFFERENCE BETWEEN GROUPS and dividing it by (comparing it) to the difference within groups (the variance due to error or ERROR VARIANCE)
i.e., VARIANCE BETWEEN GROUPS/ ERROR VARIANCE (Within-group variance).
It takes error variance into account and then asks, is there still much variance left over? This must represent an effect!!.
What does the size of the t-statistic affect?
AND
what affects the size of the t-statistic?
The t-statistic affects/determines the P-value.
BUT the t-statistic also depends on stud power and design.
Power and design elements affect the denominator of the t-statistic equation (the within-subject variance component, pooled variance within-subjects), like this:
- Selection of measures with less SD (i.e., variance for the DV, this decreases the pooled variance (i.e., less error variance); such that you need less effect-related variance in order to detect a significant effect. (i.e., the unstandardised effect size will be divided by a smaller number)
- Increased the SAMPLE SIZE (power) - also decreased the magnitude of the denominator. The more people we test, the less inter-subject variance there is; therefore, the easier it is to detect an effect)
essentially - BY INCREASING THE SAMPLE SIZE and by SELECTING MEASURES WITH SMALL ERROR VARIANCE- we reduce the magnitude of ERROR VARIANCE WITHIN GROUPS…. this makes it MORE LIKELY/PROBABLE that we will detect a true effect if there is one, because the true effect won’t get lost in or confused with error variance.
OTHER STUFF:
the T-statistic is the (unstandardised) magnitude of difference between two means divided by the error variance that occurred within groups [i.e., naturally occurring error variance typically reflective of individual differences]
In other words - it COMPARES the difference observed between groups against the differences within groups to determine whether the difference BETWEEN-GROUPS is MORE THAN EXPECTED given the degree of naturally occuring error variance.
OR
it asks the question ‘ what is the probability that the differences between groups could be explained by the differences within-groups -i.e., might it just reflect individual difference?.
BY INCREASING THE SAMPLE SIZE and by SELECTING MEASURES WITH SMALL ERROR VARIANCE- we reduce the magnitude of ERROR VARIANCE WITHIN GROUPS.
What are the five parameters associated with the power function?
(every hypothesis test has a power function)
- Effect size -
- Variance -
- Sample size -
- Statistical significance level - Alpha, selected by research
- Statistical power - desired power, selected by reseacher.
These can be algebraically rearranged to calculate a missing value given the other 4 parameters e.g., calculate the needed sample size for a study - or calculate the power of a study.
What happens to type II error if you decrease type I error?
Type II increases if type II error decreases….and vice versa.
look at the equation in picture file 3 - what does each component mean? what is this the equation for?
This is the power function equation - to solve for the needed sample side (prospective power calculation).
- N = the needed sample size, we are trying to calculate this.
- Qe and Qc are the PROPORTION (0-1) allocated to each group i.e., if they are equal they will = 0.5 (i.e., 50% of ps in each group)
- the OPERATOR (top part)
- (sigma squared) represents the (pooled SD)/2 (i..e, the variance)
- Zalpha and Zbeta - are Z scores corresponding to the desired significance and power levels (type I and type II management)
- the DENOMINATOR (bottom part) represents the unstandardised effect size - this should be SET to the smallest difference between groups that would have some clinical or social significance. E.G., a difference in score of 10 points on a measure between two groups. NOT JUST WHAT YOU EXPECT TO HAPPEN!!
What is an appropriate level of power?
> 0.8 is ok, >0.9 is better.
What information needs to be included in a ‘power statement’ in your research write up/proposal?
4 parameters: POWER, STATISTICAL SIGNIFICANCE, CHANGE, and SAMPLE SIZE (and allocation).
e.g.,
“To achieve a statistical power of 0.9 at the 0.05 level of statistical significance where the difference in change in empathy scores between participants exposed to positive and negative media reports is at least 10 points we require n = 46 participants allocated equally to each group.”
Is is possible to calculate the power of a study retrospectively?
YES, but you will need the other 4 parameters.
- Alpha
- Sample size (and allocation)
- Effect size (unstandardised) - difference between groups
- Variance (pooled SD)
What if you don’t know the unstandardised effect size that would be the smallest effect of importance (e.g., you are using a measure you have not used before?!).
You can use cohen’s D - this is what the program G*power does!
and select a cohen’s D that is a reasonable standardised effect size (e.g., 0.5 )
How would you express a restrospective power analysis?
When would you use this?
The same size obtained (n = 42, 20 controls and 22 active subjects) provided adequate statistical power (0.87) at the 0.5 level of statistical significance as long as the difference between groups was at least 10 points with a SD= 10.5.
Power is generally reported RETROSPECTIVELY (i.e., as one is writing up a study in a thesis or journal article).
- Remember to calculate power at the smallest effect size of importance.
- This helps to pre-empt criticism in negative studies that our results were negative because of inadequate effect size.
How does the prospective power calculation differ when you are using proportions? (i.e., a yes/no responses for the DV).
Why do sample sizes need to be bigger to achieve a good power in studies using a proportion based DV?
Do the differences between proportions matter - or does the proportions themselves matter? - explain your answer.
As proportions follow a bi-nomial distribution, there are no separate SD.
Sample sizes are bigger for DVs that are proportions (than when there are means) because there is less information.
The effect size (difference between proportions) and each proportion value itself is important in calculating the needed sample size. - This is because we are dealing with BINOMIAL PROPORTIONS, and the variance of a binomial random variable is greatest at 0.5% (50%).
Even though the difference between two proportions (% of people agreeing to let the boat people in from group 1 and group2) may differ by the same % (e.g., .2%)….the sample size needed will also depend on the value of proportions themselves…
e. g.,
0. 9 0.7 - needs 125 n
0. 3 0.1 - needs 125
BUT
- 5 0.3 - needs 188
- 7 0.5 - needs 188
When values are around 0.5 - more participants are needed. (as there is more variance).