lecture 9 - within subjects t-tests additional considerations Flashcards
Assumptions of a within-participant t-test - Random and independent samples
assumptions need to be satisfied otherwise invalid and invalidates conclusion
Normally distributed “something or other” (distribution of sample means according to null hypothesis) (Field)
Formally it’s that the sampling distribution of the means (the mean difference scores) is approximately normal….
If n is large (n>30 or so) this is very likely to be reasonably true (thanks to the central limit theorem) - if sample is large - greater than 30 - the central limit theorem says that the assumption of normal distribution is likely to have been satisfied.
If n is small, then look at distribution of the data themselves (e.g. a histogram). If it looks fairly normal, you’re probably ok (unless people’s lives are at stake…). But if not (e.g. it’s strongly asymmetric or not very “mound-shaped”) … worry…. Worry more for really small n…. if sample is small (less than 20) - worry about assumption and do some checks on normality eg plot a histogram to see if normally distributed also histograms make it easy to see outliers- if skewed then need to look if satisfied assumption of normality. if there are outliers - that’s a problem for t-tests as mean is very sensitise to outliers used in a t-test.
Fortunately there are lots of checks you can do as well as different solutions to address this problem (more on those later).
A final worry/”assumption”:
Check your data for outliers, e.g. extreme data points that are a long way from most of the data. Think hard if you’ve got extreme outliers…. worry…. Talk to Field….
independence is important
as looking at other in experiment eg friends answers - your answers new now influenced by them so answers are not independent from each other.
random sampling
if your trying to draw a conclusion about a population and that means that at least conceptually you need to get all members of that population in some way. in practice rarely true as population usually = human race
samples - more constrained they just need to be representative of the population
practical benefit of the central-limit theorem
Means taken from skewed (or otherwise non-normally distributed data) are normally distributed.
So, tests based on means are reasonably robust to departure from normality (as long as sample size is big enough).
wilcoxon matched paired test assumption
it doesn’t assume normality - it assumes random and independent sampling so it can be better to do a non parametric test as its assumption is weaker and doesn’t assume normality
assessing normality - QQ plots
examples in notes
systematic deviation for the blue diagonal line suggests data isn’t normally distributed.
a QQ plot - plots the distribution of the data itself against a normal distribution - you can form an impression of what ways the data is not normal.
you ask does your data mainly fall on the diagonal line or are there systematic deviation of the data from the diagonal line. various kinds of deviations correspond to various patterns on the qq plot.
postive skew = curved line falling below diagonal line
negative skew = curved line falling below the diagonal
what does field textbook say about normality
SPSS will do a test for normality called the Shapiro-Wilk test: If it is significant, that means the data are “significantly” ???? non-normal…..
If the number of data points is large (> 30 or maybe > 60), a “significant violation” probably doesn’t matter as it will very likely be small….
If the number of data points is small (<30’ish), a significant Shapiro-Wilk test is a strong indication that the normality assumption probably is NOT satisfied…. so worry … as the central limit theorem doesn’t promise normality under those circumstances and the test itself say that you haven’t satisfied it.
What about Zhang et al.? Why didn’t they do Wilcoxon’s everywhere?
However, a nonsignificant Shapiro-Wilk test is not a good indication there isn’t a problem because the test is not very powerful for small n….. So worry…. Look at histograms, etc.
practical research advice for the normality assumption
Always look at histograms (and QQ plots) as they also help spot outliers
If you have a lot of data, you’re probably ok
If you have a little data, worry (maybe about bootstrapping….) consider a nonparametric test, maybe get more data….
bootstrapping
Bootstrapping is one possible solution to a normality problem, and Field spells out the details. The intuition is instead of using all the shelf distributions like the z-distribution of the t-distribution, to pretend the shape of the population distribution is exactly the same as the sample distribution (e.g. the sample might be positively skewed). You randomly take a sample from that population (using a computer), and calculate a sample statistic. You then do this over and over a plot the distribution of the sample statistic. You then can assess your original sample statistic relative to this bootstrapped distribution without worrying about normality….
The shape of the t distribution is determined by this df.
For low df’s the t distribution is a bit like a “squashed” z distribution.
As the dfs get bigger the t distribution looks more and more like the z distribution….
when critical value for t + df = small then value of critical value t needs to be further out into the tails to make 5% of the distribution than the normal distribution does. when the df get larger then the distribution becomes more and more like the normal distribution.
Null hypothesis for one- and two-tailed tests.
one-tailed/ directional - Your hypothesis must
be that there will be an effect in one particular direction only
two tails - The critical value
is larger, so it is
harder to reach
significance.
One-tailed vs. two-tailed tests
There are those who say never do a one-tailed test.
If you are in any doubt, follow that advice!
Avoid one-tailed tests unless -
You are worried that you will fail to get a significant result for a small, but real, effect (perhaps N is limited) and the result in the opposite tail you’re ignoring isn’t meaningful.
At minimum, you have decided in advance the expected direction of the effect (which mean will be higher), according to your hypothesis.
confidence intervals (CI)
A confidence interval gives a range of plausible mean differences
Even better, making errors bars a confidence interval visually implies statistical significance:
The fact that the 95% CI does not include 0 here directly corresponds to a statistically significant difference, p < 0.05.
some observations on the formulae
ways to make t big, i.e. increase power, so it is more likely to be significant…
t = D-bar / sm
Make the manipulation stronger: The bigger the difference, 𝐷̄, the larger t is.
Collect more and less noisy data: The smaller the standard error, sM, the larger t is.
Sm = s/√N
So the smaller the standard deviation s and the bigger N, the smaller the standard error sM and the larger t.
The tables for t only cover positive values, but negative ones can be significant too. So, simply use the absolute value of t unless you are doing a 1-tailed test.
the effect of sample size on the standard error
on notes
effect size
how big is the difference or how strong is the relationship that ive observed relative to the noisy variability that ive got. An effect size is an objective and (usually) standardized measure of the magnitude of observed effect. The fact that the measure is ‘standardized’ means that we can compare effect sizes across different studies that have measured different variables, or have used different scales of measuremen
A significant result (e.g. p < 0.05) isn’t necessarily interesting or practically useful, especially if it is tiny because
very large samples can make tiny effects significant.
effect size for a within-subjects t-test
How BIG? The mean difference between conditions.
relative to
How variable? The standard deviation of the differences
A large effect size: a large difference relative to small variability
A small effect effects size : a small difference relative to large variability
cohens d = lowercase d-bar /s
cohens rule of thumb -
small effect size = 0.2
medium = 0.5
large = 0.8
output in notes
d̂=d̅(mean difference) /s (standard deviation)
Include the effect size in your description of the results: “… the effect size for the difference between tulips and roses was large, Cohen’s d̂ = 1.153.
The “hat” on Cohen’s d, as per Field, is a reminder that we’re using properties of the sample to estimate something about the population. Using the rules of thumb doesn’t always answer the practical question of whether the effective size is big enough to be interesting or practically useful as those conclusions tend to be context specific.
Pearson’s r
cohens effect sizes for Pearsons r -
r = 0.10 (small effect): In this case the effect explains 1% of the total variance. (You can convert r to the proportion of variance by squaring it – see Section 8.4.2.)
r = 0.30 (medium effect): The effect accounts for 9% of the total variance.
r = 0.50 (large effect): The effect accounts for 25% of the variance
Why and how are statistical significance and effect size different concepts?
Significance is related to the number of data points for a given amount of noisy variability whereas effect size isn’t.
For a given amount of variability, i.e. a value of the sample standard deviation, t gets bigger as n increases but Cohen’s d̂ doesn’t.
𝑡=𝐷̄/((𝑠/√𝑁) )
Large samples give a lot of “power” to potentially detect small effects.
statistical power
The intuition: power is the potential to detect a particular effect, e.g. a difference, if it really exists
Say an effect definitely exists and it has a large effect size
Then that will be relatively easy to “detect” (e.g. get a statistically significant different) in an experiment with relatively few participants. The experiment has high power.
In contrast, say an effect definitely exists but it has a small effect size
An experiment with only a few participants would be unlikely to detect this difference, e.g. find a statistically significant different. The experiment has low power.
It’s not about knowing the effect definitely exists, but rather assuming it does to see what you’d needed to do to detect, but only if it actually exists.
why is calculating statistical power useful?
it helps answer a practical question - how many ptps should I run in my experiment ?
Formal power analysis will tell you how many participants you’ll need to have a high probability of detecting a given effect size if the effect really exists.
And being fairly sure you have high power may help to clarify the reasons for a nonsignificant result if you get one, i.e. maybe the effect really doesn’t exist…..
Current advice in psychology is to always calculate power before running an experiment, e.g. in SPSS, Gpower, etc.
But you need to specify an effect size. Where do you get one? Prior research is a good guess as it gives you an idea of how big the difference in terms of the effect your looking at would be but it is still an educated guess…. Alternatively specify a minimum effect size that would be “interesting”….
If you’ve run the experiment and got a significant result, then a formal power calculation based on the effect size for that significant results tells you very little if anything beyond what the p value already told you. This is the infamous concept of “posthoc power”.
to do a power calculate;ation you need some kind of effect size
what is an assumption
An assumption is a condition that ensures that what you’re attempting to do works.
For example, when we assess a model using a test statistic, we have usually made some assumptions, and if these assumptions are true then we know that we can take the test statistic (and associated p-value) at face value and interpret it accordingly. Conversely, if any of the assumptions are not true (usually referred to as a violation) then the test statistic and p-value will be inaccurate and could lead us to the wrong conclusion
the main assumptions are -
additivity and linearity;
normality of something or other;
homoscedasticity/homogeneity of variance;
independence
normally distributed something or other
it relates to the normal distribution - data doesn’t need to be normally distributed it relates in different ways to things we want to do when fitting models and assessing them:
- parameter estimates - The mean is a parameter, that extreme scores can bias it. This illustrates that estimates of parameters are affected by non-normal distributions (such as those with outliers). Parameter estimates differ in how much they are biased in a non-normal distribution: the median, for example, is less biased by skewed distributions than the mean. We’ve also seen that any model we fit will include some error: it won’t predict the outcome variable perfectly for every case. Therefore, for each case there is an error term (the deviance or residual). If these residuals are normally distributed in the population then using the method of least squares to estimate the parameters will produce better estimates than other methods.
For the estimates of the parameters that define a model to be optimal (to have the least possible error given the data) the residuals in the population must be normally distributed. This is true mainly if we use the method of least squares which we often do
confidence intervals - We use values of the standard normal distribution to compute the confidence interval around a parameter estimate Using values of the standard normal distribution makes sense only if the parameter estimates comes from one.
For confidence intervals around a parameter estimate to be accurate, that estimate must have a normal sampling distribution.
null hypothesis signficance testing -
If we want to test a hypothesis about a model (and, therefore, the parameter estimates within it) then we assume that the parameter estimates have a normal distribution. We assume this because the test statistics that we use (which we will learn about in due course) have distributions related to the normal distribution (such as the t-, F- and chi-square distributions), so if our parameter estimate is normally distributed then these test statistics and p-values will be accurate .
For significance tests of models (and the parameter estimates that define them) to be accurate the sampling distribution of what’s being tested must be normal. For example, if testing whether two means are different, the data do not need to be normally distributed, but the sampling distribution of means (or differences between means) does. Similarly, if looking at relationships between variables, the significance tests of the parameter estimates that define those relationships () will be accurate only when the sampling distribution of the estimate is normal
when does the assumption of normality matter?
if all you want to do is estimate the parameters of your model then normality matters mainly in deciding how best to estimate them. If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters, then the assumption of normality matters in small samples, but because of the central limit theorem we don’t really need to worry about this assumption in larger samples . In practical terms, provided your sample is large, outliers are a more pressing concern than normality. Although we tend to think of outliers as isolated very extreme cases, you can have outliers that are less extreme but are not isolated cases. These outliers can dramatically reduce the power of significance tests