lecture 9 - within subjects t-tests additional considerations Flashcards by Daisy Caddy

Assumptions of a within-participant t-test - Random and independent samples

assumptions need to be satisfied otherwise invalid and invalidates conclusion

Normally distributed “something or other” (distribution of sample means according to null hypothesis) (Field)

Formally it’s that the sampling distribution of the means (the mean difference scores) is approximately normal….

If n is large (n>30 or so) this is very likely to be reasonably true (thanks to the central limit theorem) - if sample is large - greater than 30 - the central limit theorem says that the assumption of normal distribution is likely to have been satisfied.

If n is small, then look at distribution of the data themselves (e.g. a histogram). If it looks fairly normal, you’re probably ok (unless people’s lives are at stake…). But if not (e.g. it’s strongly asymmetric or not very “mound-shaped”) … worry…. Worry more for really small n…. if sample is small (less than 20) - worry about assumption and do some checks on normality eg plot a histogram to see if normally distributed also histograms make it easy to see outliers- if skewed then need to look if satisfied assumption of normality. if there are outliers - that’s a problem for t-tests as mean is very sensitise to outliers used in a t-test.

Fortunately there are lots of checks you can do as well as different solutions to address this problem (more on those later).

A final worry/”assumption”:
Check your data for outliers, e.g. extreme data points that are a long way from most of the data. Think hard if you’ve got extreme outliers…. worry…. Talk to Field….

How well did you know this?

Not at all

Perfectly

independence is important

as looking at other in experiment eg friends answers - your answers new now influenced by them so answers are not independent from each other.

How well did you know this?

Not at all

Perfectly

random sampling

if your trying to draw a conclusion about a population and that means that at least conceptually you need to get all members of that population in some way. in practice rarely true as population usually = human race
samples - more constrained they just need to be representative of the population

How well did you know this?

Not at all

Perfectly

practical benefit of the central-limit theorem

Means taken from skewed (or otherwise non-normally distributed data) are normally distributed.
So, tests based on means are reasonably robust to departure from normality (as long as sample size is big enough).

How well did you know this?

Not at all

Perfectly

wilcoxon matched paired test assumption

it doesn’t assume normality - it assumes random and independent sampling so it can be better to do a non parametric test as its assumption is weaker and doesn’t assume normality

How well did you know this?

Not at all

Perfectly

assessing normality - QQ plots

examples in notes
systematic deviation for the blue diagonal line suggests data isn’t normally distributed.

a QQ plot - plots the distribution of the data itself against a normal distribution - you can form an impression of what ways the data is not normal.
you ask does your data mainly fall on the diagonal line or are there systematic deviation of the data from the diagonal line. various kinds of deviations correspond to various patterns on the qq plot.

postive skew = curved line falling below diagonal line

negative skew = curved line falling below the diagonal

How well did you know this?

Not at all

Perfectly

what does field textbook say about normality

SPSS will do a test for normality called the Shapiro-Wilk test: If it is significant, that means the data are “significantly” ???? non-normal…..
If the number of data points is large (> 30 or maybe > 60), a “significant violation” probably doesn’t matter as it will very likely be small….
If the number of data points is small (<30’ish), a significant Shapiro-Wilk test is a strong indication that the normality assumption probably is NOT satisfied…. so worry … as the central limit theorem doesn’t promise normality under those circumstances and the test itself say that you haven’t satisfied it.
What about Zhang et al.? Why didn’t they do Wilcoxon’s everywhere?
However, a nonsignificant Shapiro-Wilk test is not a good indication there isn’t a problem because the test is not very powerful for small n….. So worry…. Look at histograms, etc.

How well did you know this?

Not at all

Perfectly

practical research advice for the normality assumption

Always look at histograms (and QQ plots) as they also help spot outliers

If you have a lot of data, you’re probably ok

If you have a little data, worry (maybe about bootstrapping….) consider a nonparametric test, maybe get more data….

How well did you know this?

Not at all

Perfectly

bootstrapping

Bootstrapping is one possible solution to a normality problem, and Field spells out the details. The intuition is instead of using all the shelf distributions like the z-distribution of the t-distribution, to pretend the shape of the population distribution is exactly the same as the sample distribution (e.g. the sample might be positively skewed). You randomly take a sample from that population (using a computer), and calculate a sample statistic. You then do this over and over a plot the distribution of the sample statistic. You then can assess your original sample statistic relative to this bootstrapped distribution without worrying about normality….

How well did you know this?

Not at all

Perfectly

The shape of the t distribution is determined by this df.

For low df’s the t distribution is a bit like a “squashed” z distribution.
As the dfs get bigger the t distribution looks more and more like the z distribution….

when critical value for t + df = small then value of critical value t needs to be further out into the tails to make 5% of the distribution than the normal distribution does. when the df get larger then the distribution becomes more and more like the normal distribution.

How well did you know this?

Not at all

Perfectly

Null hypothesis for one- and two-tailed tests.

one-tailed/ directional - Your hypothesis must
be that there will be an effect in one particular direction only

two tails - The critical value
is larger, so it is
harder to reach
significance.

How well did you know this?

Not at all

Perfectly

One-tailed vs. two-tailed tests

There are those who say never do a one-tailed test.
If you are in any doubt, follow that advice!

Avoid one-tailed tests unless -
You are worried that you will fail to get a significant result for a small, but real, effect (perhaps N is limited) and the result in the opposite tail you’re ignoring isn’t meaningful.

At minimum, you have decided in advance the expected direction of the effect (which mean will be higher), according to your hypothesis.

How well did you know this?

Not at all

Perfectly

confidence intervals (CI)

A confidence interval gives a range of plausible mean differences

Even better, making errors bars a confidence interval visually implies statistical significance:

The fact that the 95% CI does not include 0 here directly corresponds to a statistically significant difference, p < 0.05.

How well did you know this?

Not at all

Perfectly

some observations on the formulae

ways to make t big, i.e. increase power, so it is more likely to be significant…

t = D-bar / sm

Make the manipulation stronger: The bigger the difference, 𝐷̄, the larger t is.
Collect more and less noisy data: The smaller the standard error, sM, the larger t is.

Sm = s/√N

So the smaller the standard deviation s and the bigger N, the smaller the standard error sM and the larger t.

The tables for t only cover positive values, but negative ones can be significant too. So, simply use the absolute value of t unless you are doing a 1-tailed test.

How well did you know this?

Not at all

Perfectly

the effect of sample size on the standard error

on notes

How well did you know this?

Not at all

Perfectly

effect size

Study These Flashcards

how big is the difference or how strong is the relationship that ive observed relative to the noisy variability that ive got. An effect size is an objective and (usually) standardized measure of the magnitude of observed effect. The fact that the measure is ‘standardized’ means that we can compare effect sizes across different studies that have measured different variables, or have used different scales of measuremen

A significant result (e.g. p < 0.05) isn’t necessarily interesting or practically useful, especially if it is tiny because
very large samples can make tiny effects significant.

effect size for a within-subjects t-test

Study These Flashcards

How BIG? The mean difference between conditions.
relative to

How variable? The standard deviation of the differences

A large effect size: a large difference relative to small variability

A small effect effects size : a small difference relative to large variability

cohens d = lowercase d-bar /s
cohens rule of thumb -
small effect size = 0.2
medium = 0.5
large = 0.8
output in notes

d̂=d̅(mean difference) /s (standard deviation)

Include the effect size in your description of the results: “… the effect size for the difference between tulips and roses was large, Cohen’s d̂ = 1.153.

The “hat” on Cohen’s d, as per Field, is a reminder that we’re using properties of the sample to estimate something about the population. Using the rules of thumb doesn’t always answer the practical question of whether the effective size is big enough to be interesting or practically useful as those conclusions tend to be context specific.

Pearson’s r

Study These Flashcards

cohens effect sizes for Pearsons r -

r = 0.10 (small effect): In this case the effect explains 1% of the total variance. (You can convert r to the proportion of variance by squaring it – see Section 8.4.2.)
r = 0.30 (medium effect): The effect accounts for 9% of the total variance.
r = 0.50 (large effect): The effect accounts for 25% of the variance

Why and how are statistical significance and effect size different concepts?

Study These Flashcards

Significance is related to the number of data points for a given amount of noisy variability whereas effect size isn’t.

For a given amount of variability, i.e. a value of the sample standard deviation, t gets bigger as n increases but Cohen’s d̂ doesn’t.

𝑡=𝐷̄/((𝑠/√𝑁) )

Large samples give a lot of “power” to potentially detect small effects.

statistical power

Study These Flashcards

The intuition: power is the potential to detect a particular effect, e.g. a difference, if it really exists

Say an effect definitely exists and it has a large effect size
Then that will be relatively easy to “detect” (e.g. get a statistically significant different) in an experiment with relatively few participants. The experiment has high power.
In contrast, say an effect definitely exists but it has a small effect size
An experiment with only a few participants would be unlikely to detect this difference, e.g. find a statistically significant different. The experiment has low power.

It’s not about knowing the effect definitely exists, but rather assuming it does to see what you’d needed to do to detect, but only if it actually exists.

why is calculating statistical power useful?

Study These Flashcards

it helps answer a practical question - how many ptps should I run in my experiment ?

Formal power analysis will tell you how many participants you’ll need to have a high probability of detecting a given effect size if the effect really exists.

And being fairly sure you have high power may help to clarify the reasons for a nonsignificant result if you get one, i.e. maybe the effect really doesn’t exist…..

Current advice in psychology is to always calculate power before running an experiment, e.g. in SPSS, Gpower, etc.

But you need to specify an effect size. Where do you get one? Prior research is a good guess as it gives you an idea of how big the difference in terms of the effect your looking at would be but it is still an educated guess…. Alternatively specify a minimum effect size that would be “interesting”….

If you’ve run the experiment and got a significant result, then a formal power calculation based on the effect size for that significant results tells you very little if anything beyond what the p value already told you. This is the infamous concept of “posthoc power”.

to do a power calculate;ation you need some kind of effect size

what is an assumption

Study These Flashcards

An assumption is a condition that ensures that what you’re attempting to do works.

For example, when we assess a model using a test statistic, we have usually made some assumptions, and if these assumptions are true then we know that we can take the test statistic (and associated p-value) at face value and interpret it accordingly. Conversely, if any of the assumptions are not true (usually referred to as a violation) then the test statistic and p-value will be inaccurate and could lead us to the wrong conclusion

the main assumptions are -
additivity and linearity;
normality of something or other;
homoscedasticity/homogeneity of variance;
independence

normally distributed something or other

Study These Flashcards

it relates to the normal distribution - data doesn’t need to be normally distributed it relates in different ways to things we want to do when fitting models and assessing them:

parameter estimates - The mean is a parameter, that extreme scores can bias it. This illustrates that estimates of parameters are affected by non-normal distributions (such as those with outliers). Parameter estimates differ in how much they are biased in a non-normal distribution: the median, for example, is less biased by skewed distributions than the mean. We’ve also seen that any model we fit will include some error: it won’t predict the outcome variable perfectly for every case. Therefore, for each case there is an error term (the deviance or residual). If these residuals are normally distributed in the population then using the method of least squares to estimate the parameters will produce better estimates than other methods.

For the estimates of the parameters that define a model to be optimal (to have the least possible error given the data) the residuals in the population must be normally distributed. This is true mainly if we use the method of least squares which we often do

confidence intervals - We use values of the standard normal distribution to compute the confidence interval around a parameter estimate Using values of the standard normal distribution makes sense only if the parameter estimates comes from one.
For confidence intervals around a parameter estimate to be accurate, that estimate must have a normal sampling distribution.

null hypothesis signficance testing -
If we want to test a hypothesis about a model (and, therefore, the parameter estimates within it) then we assume that the parameter estimates have a normal distribution. We assume this because the test statistics that we use (which we will learn about in due course) have distributions related to the normal distribution (such as the t-, F- and chi-square distributions), so if our parameter estimate is normally distributed then these test statistics and p-values will be accurate .
For significance tests of models (and the parameter estimates that define them) to be accurate the sampling distribution of what’s being tested must be normal. For example, if testing whether two means are different, the data do not need to be normally distributed, but the sampling distribution of means (or differences between means) does. Similarly, if looking at relationships between variables, the significance tests of the parameter estimates that define those relationships () will be accurate only when the sampling distribution of the estimate is normal

when does the assumption of normality matter?

Study These Flashcards

if all you want to do is estimate the parameters of your model then normality matters mainly in deciding how best to estimate them. If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters, then the assumption of normality matters in small samples, but because of the central limit theorem we don’t really need to worry about this assumption in larger samples . In practical terms, provided your sample is large, outliers are a more pressing concern than normality. Although we tend to think of outliers as isolated very extreme cases, you can have outliers that are less extreme but are not isolated cases. These outliers can dramatically reduce the power of significance tests

homoscedasticity impacts two things

Parameters: Using the method of least squares to estimate the parameters in the model, we get optimal estimates if the variance of the outcome variable is equal across different values of the predictor variable. Null hypothesis significance testing: Test statistics often assume that the variance of the outcome variable is equal across different values of the predictor variable. If this is not the case then these test statistics will be inaccurate

What is homoscedasticity/homogeneity of variance?

In designs in which you test groups of cases this assumption means that these groups come from populations with the same variance. In correlational designs, this assumption means that the variance of the outcome variable should be stable at all levels of the predictor variable. In other words, as you go through levels of the predictor variable, the variance of the outcome variable should not change.

When does homoscedasticity/homogeneity of variance matter?

f we assume equality of variance then the parameter estimates for a linear model are optimal using the method of least squares. The method of least squares will produce ‘unbiased’ estimates of parameters even when homogeneity of variance can’t be assumed, but they won’t be optimal. That just means that better estimates can be achieved using a method other than least squares, for example, by using weighted least squares in which each case is weighted by a function of its variance. If all you care about is estimating the parameters of the model in your sample then you don’t need to worry about homogeneity of variance in most cases: the method of least squares will produce unbiased estimates (Hayes & Cai, 2007). Brian Haemorrhage asks: Does homowhateveritis matter? However, unequal variances/heteroscedasticity creates a bias and inconsistency in the estimate of the standard error associated with the parameter estimates in your model (Hayes & Cai, 2007). As such, confidence intervals, significance tests (and, therefore, p-values) for the parameter estimates will be biased, because they are computed using the standard error. Confidence intervals can be ‘extremely inaccurate’ when homogeneity of variance/homoscedasticity cannot be assumed (Wilcox, 2010). Therefore, if you want to look at the confidence intervals around your model parameter estimates or to test the significance of the model or its parameter estimates then homogeneity of variance matters. Some test statistics are designed to be accurate even when this assumption is violated, and we’ll discuss these in the appropriate chapters

independence

This assumption means that the errors in your model are not related to each other. The equation that we use to estimate the standard error is valid only if observations are independent

spotting outliers

When they are isolated, extreme cases and outliers are fairly easy to spot using graphs such as histograms and boxplots; it is considerably trickier when outliers are more subtle (using z-scores may be useful - To look for outliers we can count how many z-scores fall within certain important limits. If we ignore whether the z-score is positive or negative (called the ‘absolute value’), then in a normal distribution we’d expect about 5% to be greater than 1.96 (we often use 2 for convenience), 1% to have absolute values greater than 2.58, and none to be greater than about 3.29)

spotting normality

- frequency distributions by looking at the shape of the distribution - P-P plot - (probability–probability plot), which plots the cumulative probability of a variable against the cumulative probability of a particular distribution (in this case we would specify a normal distribution). The data are ranked and sorted, then for each rank the corresponding z-score is calculated to create an ‘expected value’ that the score should have in a normal distribution. Next, the score itself is converted to a z-score The actual z-score is plotted against the expected z-score. If the data are normally distributed then the actual z-score will be the same as the expected z-score and you’ll get a lovely straight diagonal line. - numbers - in smaller samples can look at the distribution of the variables

confidence intervals - field

boundaries within which we believe the population value will fall. Such boundaries are called confidence intervals. To calculate the confidence interval, we need to know the limits within which 95% of sample means will fall. We know (in large samples) that the sampling distribution of means will be normal, and the normal distribution has been precisely defined such that it has a mean of 0 and a standard deviation of 1. We can use this information to compute the probability of a score occurring, or the limits between which a certain percentage of scores fall 95% of z-scores fall between −1.96 and 1.96. This means that if our sample means were normally distributed with a mean of 0 and a standard error of 1, then the limits of our confidence interval would be −1.96 and +1.96. we know from the central limit theorem that in large samples (above about 30) the sampling distribution will be normally distributed . It’s a pity then that our mean and standard deviation are unlikely to be 0 and 1 – except it’s not, because we can convert scores so that they do have a mean of 0 and standard deviation of 1 using z score equation. add and subtract the margin of error from the sample mean. This result is the upper limit and the lower limit of the confidence interval. The confidence interval may be wider or narrower depending on the degree of certainty, or estimation precision, that is required look at textbook

odds ratio

a popular effect size for counts

lecture 9 - within subjects t-tests additional considerations Flashcards

(32 cards)