Test 2 Flashcards

1
Q

Why do we use statistics?

A

 We use statistics as checks on our own
biases and help us better answer our RQ;
To understand the shape of the data, and validate our intuitions about patterns within the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
Do music lessons make kids smarter?
Causal claim (Mozart Effect)
A

Schellenberg (2004)
 Method:
o Over 36 weeks 4 groups of 6-year-old had
music lessons added to their course work.
Children were either taught; keyboard,
voice, drama or no lessons by qualified
instructors.
 Why use three different treatment groups?
o No lessons is a control group but having
three other experimental conditions helps
us gain a better understanding of what it is
about music lessons which cause this
effect on IQ.
o Comparing drama to music lessons helps
understand if it’s something specific about
music or just being creative.
o Comparing two music groups to see if its
music in general or a specific type of
music.
 Matching:
o He matched the four groups on extraneous
variables on age, family income (SES), IQ
before lessons. This is essentially a pre-test
post-test design which allows us to
compare IQ before and after the
experimental manipulation (lessons).
 Results:
o Indicate that IQ gains were greater for
music lessons (keyboard and voice) relative
to control and drama lessons. This
illustrates that it is something about music
which increases IQ and not just creative
classes.
o How do we know if these effects are
meaningful? We need to use statistics to
find if these between group differences
are statistically significant and not due to
sampling error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what do we need random assignments for?

A

> to create equivalent groups
to meet assumptions of t-tests and ANOVA
to rule out confounds (condition of
causation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If we are coming three groups means why not run t-tests?

A

We would need multiple and the more tests we run the higher we inflate our false positive rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Analysis of Variance (ANOVA)

A

 F statistic: between-groups variance (how
groups differ from each other) / Within
group variance (how people differ from
others in their group.
 Comparing effect due to IV to the variance
which naturally occurs in your population.
 If the null hypothesis is not true the sample
distribution for each group should not
overlap, the means should be much
different and give us a big F statistics.
 We want between group variance to be
high and within groups differences (noise)
smaller.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we calculate variance (s2)

A

 Null Hypothesis: All kids drawn from the
same population (i.e., 36 participants all
from the same population and no effect of
the IV on groups).
 Variance: calculating each participant’s
distance from the mean. Since some with
be + and some -, they can not just be
added together because it will equal zero.
Instead, we add all the distances from the
mean all together and square root it/ n-1,
so we remove the – symbol. Bigger
number = more spread from the mean and
small variance indicates that they fall close
to the mean.
 Total Sums of Squares (SStotal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SSbetween: Total Sums of Squares Between Groups Variance

SSwithin groups: Sums of Squares within

A

SSbetween: Total Sums of Squares Between Groups Variance
 Mean for each group comparing it to the
overall grand mean
 Add the three means together, square it
to see how much each group differs from
the overall mean.
 If they are all different from one another
than the SSbetween will be larger. If
they’re very similar it will be small and no
effect of IV present

SSwithin groups: Sums of Squares within
 We want within group variance to be small
(noise)
 How do each participant differ from their
group mean?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sums of Squares and Mean Squares

A

Sums of Squares and Mean Squares
§ In ANOVA we calculate variance using a
technique called sums of squares (SS).
o SStotal = how much each participant varies
from the overall mean (squared)
o SSbetween = how much each group varies
from the overall mean (squared).
o SSwithin = how much each person varies
from their own group mean (squared).
Spread of data within one group.
o SStotal = SSb + SSw
§ Mean Squares (MS) are adjusted for n by
dividing SS by df
o dfbetween = #Groups -1
o dfwithin = #Participants - #Groups
§ MSbetween = SSb/dfb (Mean Square
Between groups = sums of squares
between divided by degrees of freedom
between)
§ MSwithin = SSw/dfw (mean square within
groups = sums of squares within divided by
degrees of freedom within).
§ F = MSb/MSw (F statistic = Mean Squared
Between divided by Mean Square Within
§ WARNING: MSw also called MSresidual or
MSerror

Mean Squares
 The more people we have in each group,
the bigger the sums of square will be.
 What we want to know is the average how
far are people from the mean (mean
square).
 Mean Squares (MS) are adjusted for n by
dividing SS by df
o dfbetween = #Groups -1
o dfwithin = #Participants - #Groups
 MSbetween = SSb/dfb
 MSwithin = SSw/dfw

*Two Degrees of freedom in ANOVA (between; top and within; bottom).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ANOVA F Statistic Calculation:

F Distribution

A

§ F = Msbetween/Mswithin
§ We compare the F statistic to the F
sampling distribution which tells us how
often the null hypothesis can produce a F
that big.
§ All the F values are +, the Mean = 0 and the
distribution is all above 0 (one tailed; peaks
just under one).
§ The bigger the F statistic the less likely it is
being produced by the null hypothesis.
§ We want the between group variance to be
bigger than within group variance for the F
statistic to be big.
§ F of 1 tells us that the group variance is not
that much bigger than within (no effect of IV
– null is true).
§ Critical region: if the null hypothesis is true
less than 5% of the time it will produce a F
statistic 3.5 or bigger

F Distribution
 F is a sampling distribution of possible F
values if the null hypothesis is true.
 The exact size/shape of F depends on the
degrees of freedom
 If the groups differ from each other a lot
compared to how much people (or animals)
differ from others in their condition, you get
a large F.
 Reject the null if p < .05
 F is always positive
 No difference between one-tailed and two-
tailed F

Different df for different F distributions
§ df(x,y)
o X = number of groups – 1
o Y = total N – number of groups
§ In our example, we have 3 groups of 12 = 36
participants so the DF would be (2,33).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Jamovi treat subject variables in a quasi-experimental factorial ANOVA?

A

o One-Way ANOVA to measure three group
means. In Jamovi it doesn’t care if the IV is
manipulated within or between subjects.
Jamovi could use subject variables,
statistically all that matters is that you have
three groups. In experimental studies we
need the IV to be manipulated (not subject
variables).
o ANOVA is not JUST useful for
experimentalists its used anytime you want
to compare three group means.
Experimentalists and Non-experimentalists
use them and both use correlational or
regression as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Grand mean

A

Is useful because it corresponds with the null hypothesis (if the null is true, the % that these 3+ groups are sampled from the same population; group means all close to the grand mean).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Recap about F Ratios

A

o We calculate the F statistic, a ratio: between
group mean variance/within group variance
(participants around mean). We want the
between group variance to be bigger than
the within group variance to get a big F
statistic! More likely to reject null hypothesis.
o Calculate (F = means squares between/
means squares within) by calculating mean
squares. Mean square between =
SSbetween/dfbetween and Mean Sqaures
within – Sswithin/dfwithin. Jamovi does this
all for you. We then use the F statistic to
identify its corresponding p-value. We do
this by comparing it to the f sampling
distribution (dependent of df; sample size
and number of groups; under the null
hypothesis the groups come from the same
population and differences are due to
sampling error and not the IV; 5% rejection
region where there is less than 5% chance
of making a false positive and it is likely to
be IV effect present).
o Unlike T-Tests where there is 1 df (N-
#groups) ANOVA have 2x (one for numerator
and denominator; df= x,y; number of groups
minus 1 and total N minus number of groups;
e.g., 2,33 = 3 groups of 12 = 36).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A significant F tells me that my groups differ, but not how they differ.

A

 All the F tells me is that there is a group
mean difference that is statistically
significant. It doesn’t tell me which ones
are different and in what direction. We
need to look at our descriptive statistics
and each groups mean to find this out!
 Remember:
 I could do t-tests to compare each group
to each other group.
o 3 t-tests
o Probability of a false positive for each one
= .05
o Probability of a false positive in one of
them = ~ .15
o i.e., .5 x the number of tests you run
 Therefore, we use post hoc tests which
adjust for the number of tests you run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Post-hoc tests

A

Post-hoc tests
 Comparisons of pairs of means after finding
a significant F.
 Used when I have no hypothesis about
how the means might differ from each
other (2-tailed; no hypothesis in the
direction of the mean group difference[s]).
o Post-hocs are like t-tests to compare 2
means, but they have been adjusted to
correct for the increased chance of Type 1
error.
o Penalises you for running multiple tests by
being stricter on the significance value so
after all of them are done the collective
false positive risk adds to .05. Generally,
.05 divided by the number of tests you run.
 Note: we can do contrasts instead of post
hoc tests if we have a prediction of the
direction of the mean group difference
(one-tailed).
 Post-Hoc on IV (class):
o No correction = t-test comparison without
corrections
o Tukey (general fits most situations; not too
strict or too loose)
o Tick effect size (how big is mean group
difference; cannot be answered with the p-
value! Since it is now two mean group
difference test the effect size we use is
cohens d).

 df does not change (unlike a t-test where
the df would be 22 =12 x 2 – 2; ANOVA 33
= 12 per group x 3 – 3). Ptukey tells us
significance of group difference and
cohens d tells us how big this difference is.
 These cohens d’s are big effect size. We
identify direction of group differences by
looking at the data! Which group mean is
higher? Not p-value or +/- of cohens d!

• Only do post-hocs if the interaction is
significant.
• Check for the specific means you want to
compare (use the graph to help you
decide).
• No corrections necessary as long as your
comparisons are relevant to your
hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Planned Comparisons/Contrasts (instead of post hocs!)

A

• Sometimes, a certain comparison is
critical to testing my hypothesis.
• Just do it! (just don’t do too many of
them, and make sure they are justified by
the hypothesis).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Factorial Designs

A

2 or more independent variables
• Each variable can be manipulated within- or
between-subjects
• Each variable can have 2 or more levels
(that’s what makes them categorical!)
• Some variables can be subject variables
(quasi-experiment; i.e., male or female
jurors; a good way to test for the
generalisability of the findings to other
groups)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why add variables?

A

Why add variables?
i.e., factorial designs
1. It is efficient. Assess 2 or more causes at
once.
2. To refine a theory (because it depends…
e.g., in Stroop effect)
3. To isolate a particular process of interest.
4. To assess change over time (e.g. in a pre-
test/post-test design; mindfulness vs
Pilates).
5. To increase external validity (extend to
other populations, stimuli, situations…
subject variables in negotiation phase)

Note: Interaction and moderations are the
same thing. Moderations look at
continuous predictors but
experimental ANOVA use categorical
(levels of IV; continuous).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Hypothesis in Factorials

A

can be:
> interaction only (i.e., no main effect, but
effect dependant on the level of the other
IV)
> main effects & interaction

*always want to know about an interaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Variables plotted on graph

A

The DV always goes on the y-axis but either IV can go on the x-axis or the key.

We decide what IV goes on the X-axis by referring back to our research question to see what makes the most sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Main effects are ___s and interactions are ___s of _____s

A

• Main effects are the averages.
• To look at the main effect of drugs we
compare the both groups in placebo to
find the average score and compare both
groups of the Prozac group to find the
average. Is one average higher than the
other? Yes, Prozac appears to work better
than placebo.
• To look at the main effect of both CBT
groups and compare it to the average
improvement in both waitlist groups. Is one
average higher than the other? Yes, the
CBT group improved more than the waitlist
control group.
• Interactions are differences between
differences
• We calculate the difference in means
between the two groups of each level of
IVs (will be different between each IV).
Then calculate the difference between the
differences of each level of the IV (i.e.,
difference of differences between Prozac
and placebo and difference of differences
between the CBT and waitlist should equal
the same value either way you do it; tells
us the magnitude of the interaction).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Difference between Oneway ANOVA and ANOVA?

A

• OneWay ANOVA (1x IV with three
Or more levels)
• ANOVA (more than 2x IV with 2 or more levels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

ANOVA with 3x groups have…

A

2x df, 3 f statistics,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

IF a significant main effect has more than 2 levels, you need to do a…

A

post-hoc test to determine where the differences are (just like oneway ANOVA).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the advantages of combining two IVs in the same study?

A

• Using the same data, running an ANOVA
with 2x IVs rather than 1x the p-values and
F ratios are different!
• “Drug” has a larger effect size and lower p-
value when Therapy is included in the
model!
• Why?
o Look at the residuals.
o Larger residuals (denominator;
unexplained variance) when only one IV is
included which will reduce the size of the F-
statistic and make p-value larger and the
partial eta squared smaller.
o Some of that variability can be explained
by IV if included in the model (moved from
denominator to numerator)
o This only works if the added IV are related
to the IV (increases power); if not it will
only make it worse.

  • Adding (useful) factors decreases the
    residuals, increasing F, decreasing p.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Interaction Write Up: | & Main Effects
Describing a significant Interaction: 1. Split one of the IVs into levels 2. Compare the effect of the other IV at each level 3. Explain how the differences are different • For participants in CBT, the drug produced improvement over placebo. However, in those on the waitlist, the drug was no more effective than placebo. •For participants taking the drug, CBT was more effective than staying on the waitlist. For participants taking the placebo, CBT was no more effective than the waitlist. • The effectiveness of the drug depended on therapy. •The effectiveness of therapy depended on the drug. What about the main effects? • Main effect of drug isn’t meaningful. • Main effect of therapy isn’t meaningful. • These main effects are qualified by the interaction.
26
ANOVA with within-subject variables
ANOVA with within-subject variables (both variables are manipulated within-subjects) the only difference is the use of statistics (repeated measures ANOVA; do not care about overall differences between people, just differences in individuals between conditions). 2 x 2 factorial (within-subjects) use a repeated measures ANOVA (we use it anytime there is a within-subjects variable; only difference is if you call it a mixed or within-subjects factorial). Jamovi doesn’t have the theoretical variables in it, it only has the operationalized to levels. We need to tell Jamovi what we are measuring.
27
Assumption of sphericity only matters ... Homogeneity of variance ...
Assumption of sphericity only matters in within-subjects design with three or more levels (if two levels and two means than you only have one pairs of variances which means you can not violate sphericity if you are comparing 3+ pairs of means). Homogeneity of variance (if the variance within groups is different, SD’s; can possibly violate this in within-subjects design with two levels!) .
28
Small-N designs Establishing causality when we do not have group means to compare When do we use small-N designs?
When do we use small-N designs? • To establish a causal effect of IV on DV within a small number of participants - Research question concerns a very small sample (not enough people to sample from) - Situations where we cannot recruit a sufficiently-powered sample (too small for inferential statistics) - When we expect substantial variability in individual responses (group means isn’t useful for highly variable responses; the mean doesn’t end up describing most participants!) • A small-N design establishes causal relationships through replicating the effect of IV on the DV (to prove consistency) - Consistent change in DV as IV is manipulated, with little variability (level or . trend) - Direct replication of the IV’s effect within the participant (exact same participants, conditions and context) - Systematic replication of the IV’s effect across participants or contexts (different participants, context or conditions) • Control over other variables is achieved by - Establishing a baseline for the behaviour without intervention (acts as a control group) - Collecting multiple observations until we see consistency in behaviour (more confident in the IV effect on DV; helps establish that the DV is under the control of the IV and supports claims of causality!) - Replicating the change in DV with the introduction of the intervention (comparing change in DV from baseline-intervention change)  Is one AB relationship enough to demonstrate this? What is the problem with inferring causality from a single-phase change?  An AB design is not rule out history effects? Extraneous variables which may have caused the change in behaviour, and provides and alternative explanation for the results. Something that happens at the same time as the IV and provides alternative explanations.  Solution = reversal design
29
Reversal Designs
• Phase change between A and B happens more than once (not a single phase design!) • ABA (baseline-intervention-baseline; most common reversal line; does their behaviour go back to baseline after the intervention is removed; under control of IV’s presence or absence) or ABAB (baseline-intervention- baseline-intervention) • Meets the final replication criteria to demonstrate control by the IV (behaviour changing at each phase change supports causality claims that the DV is under the control of the IV) Example Reversal Design: (Bicard et al., 2012) • College athletes at risk of academic failure. The researchers propose that one causal factor for poor academic performance is being late to class. • The intervention would be to get students to attend class on time. • Minuets late to class (DV), Weeks into trimester (timeline), Texting (intervention; they have to text their student consuller when they are coming to class), baseline (no texting) • At baseline their class attendance was 30+ minuets late to class. • Intervention caused their behaviour to get better. • The removal of the intervention caused the behaviour to go back to baseline. • Re-introduction of intervention improves behaviour again. • An ABAB design is good for: o Behaviours that can be unlearnt o Is not ethical to use ABAB designs when the behaviour is harmful; cannot remove an intervention that is working (you can use alternative methods than reversals to prove causality; still need to show cause- effect) *Intervention behaviour different from baseline. Return to baseline reverses the effect
30
Multiple Baseline Designs
• Single AB phase change that is replicated across participants, behaviours or contexts • At least twice • Target outcomes must be independent (observations in one individual Is independent from observations in another) • Useful for effects that can’t be reversed • Behaviour can’t be un-learnt • Not ethical to withdraw treatment • Example: Multiple Baselines (same behaviour and context but different particpants) • Notice in this graph: - Multiple students are being tested with the same intervention at different times (the intervention doesn’t start at the same time for every participant) - Why do they start at different times? (staggered baseline switches common feature) because if the goal is to rule out history effect where extraneous variables are impacting the DV at the same time as the IV and can provide alternative explanations for the change in DV. Example Multiple Baseline Design (same context and participants but different behaviour) - Pairs of students enrolled into a mathematics course. The experimenters what to test the effects of a peer tutoring intervention (home-based) on test scores (three different types of mathematics equations) - Participants had a history of maltreatment which is associated with poorer maths and literacy skills. - Baseline (test scores without interventions; rewarded with money) no answerers correct - Intervention (taught one member how to solve it; participants taught the other and test scores were calculated) scores got better - If children did not reach mastery with the intervention alone. They were provided with additional interventions. - Staggered baseline is present (specific to multiple behaviours same participants) for: ruling out history effects, gives us a chance to see if the effects of the intervention for one maths equations (dv) influence’s the others. Impairs causality claims if the intervention effects subsequent behaviours. *Intervention effect replicated across behaviours - Say I’m not happy with this intervention? Their behaviour doesn’t work very well and I want to compare the effectiveness of two different treatments? Use an alternating treatments design using same participants or not sure what component of the intervention is causing the effect (i.e., separate things out).
31
Alternating Treatment Designs
• At least 2 interventions are tested, usually one at a time, and frequently alternated from session to session • Combinations of interventions can be used to test for interactions (A-B-C-B-C; baseline, intervention 1, intervention 1 + 2, intervention 1 + intervention 1 + 2; Baseline, intervention 1, intervention 2, intervention 1 + 2). Example: Alternating treatment design - Kallie is 6, autistic and has a high rate of PICA (compulsive consumption of inedible objects i.e., Christmas decorations or destroying them) - The intervention[s] to reduce the frequency of this harmful behaviour - Baseline: - Functional Analysis Context: baited PICA items on the table; safe to eat but look like the unsafe items she likes to eat - Holiday decoration context: attempts to eat holiday decorations or destroy them but was blocked by the experimenter - High levels (clinically significant) of PICA behaviours and warrants an intervention - B (DRA) - Differential reinforcement of alternative behaviour (block PICA attempts, and rewards non-harmful interactions with the toys/objects with edible food items she likes) - Not punishing bad behaviour, but rewards good behaviour to reduce frequency of bed behaviour - We saw reductions in target behaviour but not sufficient clinical significance (normal level; PICA 1x attempt per minuet is too high) - C (DRA and Facial Screen) - Reinforcement of good behaviour - Anytime she tries to eat an inedible object the experimenter puts her hand on top of her hand and eyes for 30sec. Is a form of stimulus avoidance (not restraining them at all). Then they redirect her attention to objects safe to play with. - Strong behaviour reduction with instances of 0 PICA behaviours (clinically significant) - At NO point do we return back to baseline, intervention + both interventions + intervention + both interventions - We can see in the second DRA intervention their behaviour gets worse which illustrates that the intervention on its own doesn’t have a lasting effect and in combination with facial screen preforms better (interaction) - Reversal between intervention phases (BCBC) * Behaviour changes as the treatment is alternated Mitter et al. (2015)
32
Changing Criterion Designs
- Used for behaviour which we do not expect an immediate but rather gradual change, more resistant behaviours such as smoking cessation) • Level of required behaviour is varied across trials (level of intervention or its criteria changes; gets harder and we measure peoples change in response to level changes) • A change in criterion for target behaviour is based on successfully maintaining behaviour at the previous criterion (maintaining target behaviour from all levels of IV) • Causal relationship is demonstrated by successive replication of a change in behaviour with changes to the IV Example: • Pole vaulter university student • Student is not lifting their arms high enough (technique) to clear the bar, as they should be able to so an intervention is needed. • Current height the raise their hands when doing pole vaulting is a long-term behaviour that is now a habit and harder to change (more resistant to change) • Baseline (pole vault attempts without intervention; common height they keep their hand at and the height they can clear) - Intervention shout reach as they are about to jump, using a feedback pole where if they raise their arms high enough they get instant feedback of noise. • Levels: baseline was 225 so first criterion was to learn to do 230, then 235, then 240, then 245 etc. - They got better overtime, and had to reach the stability criteria (consistency of correct arm extensions) before the criterion level was changed - At 252 they were not able to reach stability criteria and the study was terminated • Tips for a good criterion study: - Increases in criterion need to be small enough to be reasonably meet but big enough to be able to see when behaviours change (small changes is not good for high variability behaviour!), criterion interval changes need to not be done at predictable times (random or staggered; to rule out history or maturation effects), can return to baseline or previous criterion level support causal claims of IV-DV. *Successive demonstration of a change in behaviour with changes to the IV (level) *We are more concerned with clinical significance rather than statistical significance (healthy behaviour levels more important than having the best experiment)
33
Validity of small-N designs
1. Construct Validity • How well do the operational variables map onto the theoretical variables? ``` 2. Internal Validity • Are there other possible explanations for the findings? • Are there confounds? • Is it the best design to address the question? ``` 3. External Validity • Can the conclusions generalise to other people, other stimuli, other contexts? ``` 4. Statistical Validity • How big is the effect? • Does the study have sufficient power? • Is the data treated appropriately? • Are the statistical conclusions (e.g., significance) justified? ``` Sufficient Power • Does the study have a big enough sample size to identify the effect size we are interested in? • Small sample sizes can only detect big effects. • The bigger the sample size the smaller the effect size you can identify.
34
Threats to Good Replicable Science:
- One problem with study design, is low statistical power and not being able to detect small effects, unreplicable findings. - failure to control for bias - P- hacking (repeat statistical tests till one is significant) - Harking (hypothesis after results are known) - publication bias (only publish significant and experimental hypothesis testing)
35
Step 1: Do good Science
1. Make clear predictions based on hypothesis 2. Ensure you have enough statistical power (sample size is sufficient for effect size you are interested in, big effects need smaller sample sizes, but smaller effects need bigger samples). 3. Set a stopping rule 4. Reduce flexibility in data analysis (predetermined DV variables; exclusion criteria, subgroups and covariate analysis). 5. Adjust for multiple comparisons when appropriate. 6. Upgrade statistical skills and understanding *How does this apply to small N design which do not have samples of people. In small N designs it could be argued that we exploit optional stopping, researcher can extend the number of trails they do until they reach a consistent/stable pattern of behaviour. It is also a flexible approach in how many trials we do ad at what point we change phases (baseline-intervention) - these are dynamic changes we can make in Small N design. Does this mean Small N designs are not doing good science?
36
Replication allows us to determine ____ in a small N design but... Optional Stopping Rules Inductive and Deductive Reasoning
- Replication allows us to determine causality in a small N design but one phase change is not sufficient to conclude that the IV has power over the DV, too many. Confounds. - Unlike group mean designs, small n designs treat each participant as a unit of replication (this is an advantage of small n designs, within one study with multiple participants you are replicating an effect; two participants = two replication). This disproves the critique that small n’s do not care about replication. They do, it is just built into the design itself. - Is it okay to do research without a stopping rule? People who stop responding or do not complete all trials; there data is removed for not being complete. Is this a problem, to answer this we will look at the inductive and deductive model of reasoning. - Group-based studies use a deductive approach, theory-hypothesis-data collection - Small N-design is inductive in nature, observation of behaviour, finding patterns, using theory to explain patterns found (being dynamic is necessary for clinical work where the goal is to reduce negative behaviour and replace it with positive behaviours, a null effects cannot be found, we keep the study going to we see the behaviour change we want). Small-N’s can be deductive when they are test the effectiveness of an intervention (testing theory informed intervention on child, taking observations to see if it works). - Most Small N designs are a mixture of inductive and deductive reasoning. A mixture of theory-driven and explanatory work. This is true of all psychology research (including group designs) where there is no clear distinction between inductive and deductive work.
37
example Small N design experiment abstract write up
1. Set up goal with clinical significance: little research on PICA interventions 2. Link their goal to literature (previous) 3. Identify purpose of the current study: identifying the function of someone's behaviour (functional analysis is an inductive goal! Observe people’s behaviour in different contexts to find out what it causing it and keep doing study till it is fixed- flexible/dynamic/no problems with publication bias because there is no null findings! Not testing a hypothesis/prediction) but there is deductive as well (evaluate the function of treatment informed by research; theory- driven intervention selection and using data to see if it effective and supports our hypothesis, can have null effect and publication bias). Deductive reasoning issues will include what we have identified looking at group-level analysis. 4. This example is a mixture of inductive and deductive approaches. Small N designs allow them to test their primary goal of identifying the function of the girls behaviour and keeping it under control whilst being flexible enough to test a secondary deductive hypothesis, see which theory-driven treatment is most effective. This is not a clear distinction to find in studies!
38
Step 2: Transparency in Reporting
1. Report all measures, variables and conditions. 2. Clearly distinguish confirmatory (planned) from exploratory (unplanned) analysis 3. Clearly document hypothesis, predictions, design decisions, procedures. 4. Share data in a public repository (if ethically possible) 5. Share analysis codes and results 6. Share research materials 7. Improve journal standards *Once we have identified the best design for our RQ. We should then declare as much as possible. In deductive studies were are committing to our design decisions and there is no flexibility to change them throughout the study if we see it not panning out as we expected. However, in an inductive design it is much more dynamic and expect changes to be made in the study but we still need to declare the logic or reasoning we will use to make these changes (i.e., what will I use to decide when to change phases). - If our hypothesis is conceptualised at the individual level than it should be tested at the individual level with a small n design. If it is more generalised to groups of people on average performing the same then it should be tested with a group design. The design will decide what statistical validity questions you need to ask. - For example, an exploratory aim we asked was that people who were distracted were expected to take longer to respond. Not theory drive, just an intuitive expectation we had with no specific hypothesis was made.
39
External Validity in Small N designs:
External Validity in Small N designs: • Each participant is a replication unit which supports external validity of IV-DV to another person, context or behaviour (specific advantage of small N over group designs, which compare averages, means, that ignores individual differences) Example: concern of replication of IV-DV effect (visitor contact impacts tuatara welfare; animal subjects) does a replication of effect in 3 subjects mean it will generalise to other tuataras? • Do tuataras get distressed with visitor handling? • Enough in the literature to justify that this should be studied because some animals are distressed when handled but others don’t. • They make analysis at the species level (group differences expected within a species) and individual differences (within members of a species is appropriate to use small n design) • Small n design meets animal ethics (3 R’s: reduce, refine and replace; to use the minimum number of animals you need to answer question) Results: van Heerbeek et al. 2021 • Baseline: no handling count and record target behaviours • Intervention: handling (hold in had for 30 minuet and touched by visitors) and count and record target behaviours • High variability in baseline, visible, when distressed they burrow (cannot be seen). Do they burrow more when distressed. • Visitation days are set and cannot be changed which is a limitation because we cannot rule out extraneous variables which coincide with these days. No staggering of start in baseline! Does allow us to rule out environmental factors though. • Dark line shows that tuataras burrowed more (were less visible) on visitor days where contact was high which indicates it distressed the animals. • External validity is high in this example because the tuatara and ecological context it is meant to apply to are included in the study. We can be confident the behaviour will generalised outside of the study and be repeated in the real world. • Small N designs generally have high external validity! The subjects being studies are who it is meant to be applied to. Not clear if it can generalise to others but definitely the people in the study. • Side note: a single study does not lead to changes in policy, or interventions. Multiple studies are needed in research which provide converging evidence!
40
Statistical Power | In general:
In general: - The more power you have the better - How to design a study to have the most power that you can
41
We aim to collect as much data as we can to decide about the truth, but there is always room for error: what 2 errors can we make?
• True Positive: Reject the null and the null is false. • True Negative: Accept the null and the null is true. • False Positive (Alpha): - Type 1 error - Is our tolerance for being wrong which we arbitrarily set at .05 aka 5% of the time we reject the null when the null is true. - In other words, we conclude that the between-group differences is due to the effect of the IV but is actually due to sampling error. • False Negative (Beta): - Type 2 error. - Fail to reject the null hypothesis when it is actually false. - In other words, have insufficient evidence to reject the null hypothesis when an effect of IV is present. • These two errors are independent from one another.
42
Power = 1 – β
For example, - if beta (false negative) is set at .2 (20%) than your power is .8 (80%). 80% Power means you have an 80% probability of rejecting the null hypothesis IF it is actually false. 90% Power means you have a 90% probability of rejecting the null hypothesis IF it is actually false. - Calculating the power before you start the study is important to understand what is the probability of finding a significant difference in the study. - In other words, with 80% power, if we ran the study ten times, we would expect to find a significant difference 8/10 times. - Before the replication crisis studies typically only had 40% power which is a waste of time and resources. We need to design better studies with more statistical power = better science.
43
What determines Beta? How do we get more power?
choens d: difference between means/SD - The bigger the difference between groups (numerator) the bigger the effect size. - Cohen’s d uses SD NOT SE because SD is not effected by sample size (n). - The bigger the SD (denominator) the smaller the effect size.
44
Cohen’s d & Distribution overlap: | *Measures the difference in distribution overlap between two groups
- Calculated by M1-M2/SD, 100-115 = 15/10 = 1.5 is a big effect size - 100-115/60 = .025 (same mean group difference with larger SD’s mean there is more overlap and a smaller effect size). *Two things determine effect sizes: mean group difference and SD (variability)
45
How to get more power: | *To get a bigger t and smaller p
Design a study to optimize power: *Applies to any experimental or quasiexperimental design with 2+ groups (A) Increase the difference between groups (B) Decrease SE a. Decrease SD b. Increase N *because the SE is the square root of SD/N. smaller numerator and bigger denominator makes a smaller statistic. (A) Increase Difference Between Groups/Conditions • Increase the strength of the manipulation (IV) o i.e., more sessions of CBT o i.e., higher dosage of drugs o Wouldn’t want to do a strong manipulation all the time when using (-) valance stimuli which is unethical to do or within-subjects design which would introduce demand characteristics • Sample from extreme groups/ends of the distribution o Has its own limitations; regression to the mean and not being able to know what happens on average. *Always aim for the strongest manipulation you can so we can be confident that the effect is not present or due to having insufficient power to detect an effect. (B) Decrease SD • Standardise measurement o high consistency, reliability and validity of measure to reduce variability. • Homogeneous sample o Less variability the IV has to compete with but impairs external validity; generalize to other groups; if it doesn’t work in a homogeneous sample it is not likely to work in a heterogeneous sample. • Matched pairs o Match people for important extraneous variables • Within-subjects designs o I don’t care that people vary from one another (C) Increase N • Incentives /rewards for participation • Online data collection rather than in person • Collaborative studies (many lab method; multiple labs run same study and share data) *harder to do due to cost in time and resources
46
What about alpha?
• We set the alpha or significance level at .05. If the p-value is less than .05 we have sufficient evidence to reject the null hypothesis and accept that there is a 5% chance we are making a false positive error. In other words, we reject the null hypothesis and conclude the IV caused the effect on the DV but it is actually due to sampling error. • A one-tailed test has more power than a two tailed test. For example, in a two tailed test we predict an effect will be present but we don’t know in what direction so the 5% is split between both tails and requires a larger t (+/-) to produce a significant p-value. • In contrast, a one tailed test in when we make a prediction on the direction of the effect and get to pool the 5% at one end of the tail which means we need a smaller t statistic for the p-value to be significant. One caveat is that if you predict in the wrong direction the result will be insignificant, even if its significant in the other direction.
47
Calculating Power *Use statistical software (G*Power) to calculate it for you
• Function between N, mean group differences and SD • Will ask: o One tail or two o Predicted effect size o Alpha o How much power do you want (.90 is best) o Equal number of participants in each group? • It then tells you how many participants you need to meet these criteria (decide before sampling, acts as a stopping rule)
48
Smallest Effect Size of Interest (SESOI)
• Detecting small effects requires precise measurement and large N but both are costly design elements. - Solution: consider the theoretical and practical significance of finding an effect to justify the costs of detecting smaller effects. - Use literature to see what effect sizes others have found -Start with medium .05 as a base - What is the smallest effect size do you care about? Small effect sizes are costly (time, money and effort) - Theoretically, some people may want evidence of an effect no matter the cost in physics can really influence theory but in psychology this is not always the case. - Psychology is more important in practical significance, what effect size would be important enough to inform treatment design? Improve daily functioning? *Small effect sizes may be theoretically interesting but not practically significant.
49
Interpreting null effects
1. Your experiment didn’t fail! (significant effects are not the goal of research, we are aiming to find the truth!) 2. Do you have sufficient power? (not large enough sample size to have sufficient evidence to reject the null-hypothesis) 3. Did you effectively manipulate your IV? (construct validity; did the IV manipulation measure what we intended it to? Did we include a manipulation check?) 4. Do you replicate your null effects? (answers not found in a single study; same effect found in another study supports that its is a null effect and not issues with experimental design) 5. Did you preregister your hypotheses? (faith in null effect; because it requires you to do a power analysis).
50
Can we ever show that the null is true?
• Not with NHST (the logic of null hypothesis testing is that we assume what the shape of the sampling distribution is when the null hypothesis is true and then we look for evidence that would allow us to reject the null-hypothesis; we have two options we can reject the null hypothesis or fail to reject the null hypothesis; when we fail to reject the null hypothesis, we are saying that we have insufficient evidence to reject it, it does not mean that we accept it; NHST is not designed to answer this question! Only via replication of same effect can we conclude this!) • Demonstrate high power • Replicate the null • Use Bayesian statistics (an alternative to NHST where they weigh the evidence for the null and the experimental hypothesis; it doesn’t rely on the logic of rejecting the null or failing to reject the null as NHST does; it would allow you to say that there is more evidence for the null and not the experimental; which would allow us to claim if the null hypothesis is true, unlike NHST). *Absence of evidence does not equal evidence of absence! = having insufficient evidence to reject the null hypothesis, does not mean it is true! I don’t have evidence that they are equivalent I just don’t have sufficient evidence to show that they are different
51
Quasi-Experimental Designs *Designs, like experiments (manipulation of IV but doesn’t have the same level of control as a true experiment) Three examples
Recap: Criteria for Causal Inferences 1. An association between two variables. 2. The cause comes before the effect (temporal precedence) – manipulating IV measure the effect on DV 3. Alternative explanations are controlled (control of extraneous variables; RA, Expectations, control groups, order etc. the only difference between experimental group and control should be the IV). *This is not always possible or desirable. a) Non-equivalent groups b) Pre-test/post-test c) Interrupted time series
52
Non-equivalent Groups
Non-equivalent Groups *When your variable of interest cannot be manipulated (no causal claim can be made; direction of effect and alternative explanations are not ruled out). • Subject variables - culture, age, IQ, personality, performance, gender, income or education level. • Ethical concerns - fear, anxiety, depression, pain, malnutrition. Adding a Participant Variable *Does your experimental effect generalise to other populations? Example: Cultural Variations in Anger in negotiations (north American and Asian-American) • Design: - 2 x (culture; cannot be manipulated) x 2 (emotion) factorial design with a subject variable • Theory: - anger is a negative emotion, but it is effective in negotiations (instrumental anger) • Hypothesis: - Anger is an adaptive mechanism that demonstrates strength and encourages concessions. • Prediction: - If anger encourages concession, then participants in the anger condition will be more likely to offer the warranty. • Method: - Given a scenario where you are trying to sell a product to the client who wants the warranty thrown in before they accept your offer (warranty is expensive and does not want to offer it without accepting the offer). - The IV: end of script the client either says it in an angry or non-angry tone. - At the end of script they were asked two questions: - DV: What is the likelihood you will give the client the warranty? 1-7 likert scale - Manipulation check: how angry do you think the client was? Construct validity did we actually make people think they were angry. • Results: - People in angry condition concede more. (give warranty) than in the no anger condition. - Independent t-test (one variable, two levels, manipulated between groups) BUT • Samples: o WEIRD white, educated, industrialized, rich and democratic (biased sample only reflects a small proportion of the world; psychology undergraduate samples, majority of psyc samples, not generalizable to other groups). Why important? For theory • Anger is a good example of how WEIRD samples are a problem. Cultures vary in their expectance/tolerance of public displays of anger. • emotions-as-social-information model theory behind study: cultures vary on what emotions are appropriate to display in public and therefore will influence their utility in acts such as negotiations. Collectivist cultures disapprove of public display of anger, western's are more accepting of its intrinsic value. Important - Anger condition is an independent variable, but culture is a subject variable. This does not affect analysis, but it affects the interpretation of the results. - Jamovi will treat culture as an IV in the analysis, but when interpreting it its up to us as the researcher than no causal claims can be made because we didn’t manipulate it (effects interpretation not stats). Clustered bar graph: - Replicates previous research that client’s anger in negotiations lead to more concessions than non-anger in European Americans. - However, Asian Americans had the opposite effect. More anger led to less concessions than no anger in negotiations. - = cross-over interaction (no main effects of anger or culture because has opposite effect at different levels of the IV). - Categorical Variables (2x) should be presented as a clustered bar graph Line Graph: cross pattern ``` Write Up • Concession making (primary dv) write up: - Introduce analysis and variables - Main effect - Main effect - Interaction - Post hocs *Same write up steps for quasi-experimental and true experimental designs ```
53
Evaluating Validity (non-equivilant groups) Internal validity External validity
Internal validity • Better than an association study because anger is manipulated (cross-sectional or correlational studies; more control in quasi-experimental and can make causal claims about anger). • But, cannot make causal claims about culture as the cause of the difference, because it was not manipulated (practical knowledge to know if what works in one culture may not work in another even if I do not know the causal mechanism) External validity • Better than study in a homogeneous population (generalisability of effect to other cultures; using student samples for connivence and power supports internal validity but sacrifices external validity; once effect is found in student sample can replicate in more heterogenous or different samples to see if the effect generalises to other groups) • Still constrained by experimental methodology (manipulated anger, not real- world situation, still a valuable extension of. our knowledge). • Other examples: • Medical conditions • Anything which can not be randomly assigned with manipulation of IV
54
Pre-test/Post Designs
Pre-test/Post Designs *quasi-experimental design (studies for companies to improve their services where practical or cost constraints are present and make quasi-experimental designs the better option) When you want to measure change within individuals but cannot have a control group or counterbalance order a) Cost/practical constraints (can you provide it to one group and not the other, can you afford to) b) Participants are in a cohort (a class, a programme, a neighbourhood = have to apply it to everyone) c) Carry-over concerns in a within-subjects design (can’t counterbalance using a standard within-subjects design to use quasi-experimental pre-post to test one order = VR fear/neutral studies where fear response would contaminate neutral condition) Note: before we looked at true-experimental between subject’s pre-test/post-test design
55
Evaluating Validity (pre-post test design) Internal validity External validity
Evaluating Validity Internal Validity o Better than association study because IV is manipulated (direction of effect is established) o Risk of history, maturation, regression to the mean (threats to internal validity without control groups) o Reduce threats to internal validity by running a non-equivalent comparison group if possible (not a proper control group if not randomly assigned and can have self-selection effects but it at least gives a comparison which rule out maturation, regression to the mean and history effects can be assessed). External Validity o Similar to experiment (really only applicable to the sample I studied it in)
56
Interrupted Time Series(like pre=post design but better)
- Use it because it helps with addressing random variation in scores overtime in an erratic way - For example, covid-19 cases can still have an overall trend but go up and down erratically day to day. Running a pre-post design on variables which are highly unstable the effect we see may not be due to IV and just random variations in the variable. - If I looked at 2x given points I would make an inaccurate conclusion from the data. We need to look at the overall pattern before and after the intervention to determine its effectiveness. Intervention Research *its common method used in these areas • Government policy (i.e., banning cell- phones whist driving) • Organisational change (i.e., health care, education systems adopt new strategies and want to test their effectiveness) • Management strategies • Catastrophic events (i.e., compare online-in- person teaching due to covid-19; natural disasters disrupting daily functioning psychologically, economically and economically) *It would be impractical, unethical and too expensive to test these with a true experiment but we can still make strong casual claims using a quasi-experimental design. Time Interruption Designs: - We take a series of measurements (months in a row, or years with historical data; days), then introduce IV, then time series for post is taken to look at the overall trend in the data pre-to-post and we can make stronger claims about the effect being stable (IV-DV) overtime. - As opposed to normal pre-post design with one time point, IV, second time point (dv) Example: Time Interrupted Design Spoelman et al (2017) - Reducing the number of consolation people make with their physicians for easy to solve medical issues. - They made a website for patients to get medical advice (FAQ) for minor and easy to solve medical issues. - Measured number of medical consultations (per 100 people) before and after website was introduced. - We see that post intervention (2-years) we see a general decrease in number of consultations (reversed; increase in trend to decrease trend). - Collecting enough data pre and post to wash away individual differences day to day and allows us to see the overall trend in the data and gives us more confidence that IV caused the DV. *Where an experiment would’ve been unethical (to withhold from some patient when you believe it will be beneficial)
57
Evaluating Validity (Interrupted Time Series) Internal validity External validity
Evaluating Validity Internal Validity • Better than single pre-test/post-test because it controls for random variation (most internally valid then other quasi- experimental designs) • IV may be confounded with other factors, so alternative explanations are possible (because no control group; do not not what it is about the website that causes this effect) External Validity • Very High (almost always done in real- world situations; not in the lab so can not do a true experiment) • Still may not generalise to other contexts (specific to the real-world situation that it was conducted in; would need to replicate in other organisations or countries)
58
Assumptions of Parametric Tests
Assumptions of Parametric Tests • Normality (normally distributed) o the DV is normally distributed o the mean is a good estimate of the variable o If variables are not normally distributed, the means of two groups are no longer good group estimates and cause problems for t or f test. • Variance (equal SD) o the variance within each condition is similar (homogeneous) o affects formula for pooled variance (t) or SS residual (ANOVA) o When calculating the denominator, it assumes that these variances are relatively equal and can be pooled together. If they are not equal, then the pooled variance is not a good estimate of the average variance. •Sphericity o only applies to repeated measures ANOVA (within SS variables with more than 2 levels) o like a variance assumption, but based on the variance of the difference scores between conditions, not variance within each condition
59
Violating homogeneity of variance
• Student t test which has one measure of variance (s2; pooled variance from both groups; averaging variance across groups is inappropriate because it doesn’t reflect either group well; would make it too small and not reflect the actual SE). • Welch’s t test (more conservative; where the formula considers the S2 (variance) of both groups rather than pooling it). • Note – these assumptions only apply to between-subject comparisons (2 sets of variance); a paired t-test is the difference between condition 1-2 so there is only one set of variance then *Not: these assumptions only apply to between-subject comparisons • In Jamovi, it will do both student t and welch’s t, if assumption is violated than use information from welch’s t. the key difference will be that the df will be significantly smaller (conservative) with welch’s rather than student t test. • The error in using a students t (without correction) when SD’s are not equal is that the SE will be artificially small, and the T artificially big and result in more false positives. Violating assumptions increases false positive rates, so we penalize ourselves with reducing our df, to reshape sampling distribution and p-value. Some people think we should just use the welch’s t test all the time, since it doesn't effect the p-value too much and we can be safe. • It is therefore more conservative, and less likely to produce a spurious significant effect. *f’s and df are different
60
Violating Sphericity
Repeated measures ANOVA with 3+ levels - In a paired t test, it compares the difference between condition A and B for each participant. We end up with one group of participants with difference scores, we can calculate the mean difference and SD scores from this. This means we can not violate homogeneity of variance because we only have one set of difference scores, there is nothing for it to be homogeneous with. - In an ANOVA with 3+ levels, we now have three sets of differences (A-B, B-C, A-C) with three different SE’s which need to be equal within one another (difference of differences). - We will ask for sphericity, greenhouse gieszier correction (if sphericity is violated) and homogeneity test (for any between subject's variables). - Greenhouse Gizier correction makes the df smaller (more conservative) - 2, 1.76 (1.76/2 = penalization of .88)
61
If we violate normality…
When our data is not normally distributed (i.e., skewed) we cannot run parametric tests (t/f tests). Why? Normality assumptions • Parametric tests are based on comparisons of means (assumes that means are a good estimate of the group average, but if data is skewed than the mean will produce artificially big/small group differences, because it is pulled in one direction +/- of the tail; outliers warp means and subsequently t and p-value). • Using the means to represent a group or conditions assumes the mean is a good estimate Solution: Use Non-parametric tests (ranked test) • Based on medians, not means (median is better because ½ of participants scores will fall above or below the median; is based off of participants and not their scores) • Various ways of ranking individual data points, to determine if high/low scores are more likely in one condition than another (clear split in the data) • More conservative than parametric tests (less likely to give a false positive) • Less powerful than parametric tests (we are throwing out individual scores/data and just focus on high/low without looking at how different they are) (A) Mann-Whitney U Test 1. Non-parametric alternative to independent t-test 2. Rank order all the participant RTs 3. Null hypothesis – all RTs are equally likely to come from either condition 4. Research hypothesis – the faster RTs are more likely to come from the one condition, and the slower RTs are more likely to come from the other condition (½ participants above the median will come from group A and ½ participants below the median from group B; what is the probability of finding an 80-20 split?) 5. Use for non-normal data, especially if N is small. *will lose power. *No df for non-parametric tests because they are about the sampling distribution which non-parametric tests do not use (i.e., they rank data and look at the probability people are in one group or the other). (B) Wilcoxon Signed Rank test 1. Non-parametric alternative a paired t-test (e.g., comparing Condition A to Condition B within-subjects) 2. For each participant, sort them into two groups based on whether they score higher in Condition A or Condition B (will ½ participants be better at A and ½ B = no difference between conditions; if more people on A than B, 80-20 split then it is highly unlikely that these between condition differences are due to chance and we would reject the null hypothesis). 3. Null hypothesis – two equal groups (people equally likely to be better in Condition A than Condition B) 4. Research hypothesis – unequal groups (more people with A>B than B>A). *Normality in difference scores and not in the raw data (variables is for independent t) *No df for non-parametric tests because they do not use sampling distribution, they rank data. *We would report it as w = 3.50, p = .008, np2 = .873 (C) Kruskal-Wallis (alternative to oneway ANOVA) - Comparing 3+ means; between subjects - Run normal ANOVA - If we’ve violated normality assumption (i.e., small n or skewed data) - Rerun test using a Kruskal-Wallis non- parametric ANOVA - Chi square test statistic (x2) - Report it as x2 (2) = 11.40, p = .003 Post Hocs - We know there is a significant difference but not what group means are significantly different, so we use post hocs (with or without correction if we have directional prediction or not). - Non-parametric oneway ANOVA’s use Dwass-Steel-Critchlow-Fligner pairwise comparisons ``` (D) Friedman Test (alternative to repeated ANOVA) - Chi square (x2) - Pairwise comparisons AKA post hocs = Durbin-Conover - If 2 df = 3 groups (N-1) ```
62
Summary: | 4 Parametric Tests and their Non-Parametric Substitutes
(A) Independent t-test Mann-Whitney U test (B) Paired t-test Wilcoxon Signed Rank test (C) Oneway ANOVA (between) Kruskal- Wallis test (D) Oneway Repeated ANOVA Friedmans test
63
What about factorial ANOVA? We ignore violations to normality typically in these cases
• ANOVA is robust to minor violations of normality (ignore them; its not surprising that one group may be slightly different than the others but once they’re averaged out it shouldn't matter; 2 x 2 has 4 groups) • Transformations (if big violations to normality we can transform the data; there is no non-parametric test for a factorial ANOVA). Transformations - Many options! - Different transformations are possible, depending on the shape of your original distribution. - A common option for a heavily positive skewed via logarithmic transformation. Take the log of each of these values using exponent system (101 system; where values of 10 become 1; 102 where values of 100 become 2; 103 where values of 1000 become 3 etc. now any value below 10 becomes bigger but those below 10 (in the tail) get pulled closer to the middle of the distribution making it look more like a normal distribution). It admits that these values are different but not enough to justifiably pull the mean that far out. Note that it’s a transformed mean not the actual mean. We don’t care about the specific value more the effect of the IV on Dvmean (bigger, - skew; smaller, + skew) between two groups. - Commonly use logarithmic transformations in eeg data in the lab. The socres still sit in the high end of the distribution but closer to the mean. - Inverse 1/850. Flip data and put each individual score over 1 to scrunch the tail in closer to the middle of the distribution. - Exponents - The type of transformation you use depends on the problem with your data! - Logmarithmic data transformations is good for (-) not (+) skews. If scores were too high, you will only make it worse. *We find an expert source to help decide what would be the best transformation for our data.
64
Take-home messages
1. Check your assumptions 2. Use corrections for variance and sphericity-related violations 3. Use non-parametric tests for normality assumptions 4. Use transformations if non-parametric tests aren’t possible (i.e., factorial designs) 5. Ignore minor violations in factorial and repeated ANOVA (false positives are not really effected) 6. Get help! (people specialize in understanding what correction is best for different data problems; go ask them)
65
Types of scales (4) Types of variables (2)
``` Nominal (categorical) - no link between categories - can not average or say bigger or smaller = discrete - gender, eye colour ``` ``` Ordinal (ranked categorical) - there is a natural meaningful way to rank/order categories - position in a race, item Q’s gradually increase in extremity - can not average them ``` Interval Scale (continuous) - numerical value is genuinely meaningful - differences between intervals/scores is meaningful - temperature - addition, subtraction and averaging is meaningful but can not be a ratio because 0 is not meaningful Ratio (interval) - 0 is meaningful = absence of variable - scores/numbers are meaningful - can multiple and divide - reaction time ``` Types of variables: Discrete -there is nothing in-between two points/scores - year went to school - nominal, ordinal, interval and ratio ``` Continuous - interval & ratio - a variable with given any two variables it is logical for there to be a variable in between - RT, temperature in degrees interval but ratio for farenhieght, # of t/f answers correct on a test (ratio), Likert scale (interval)
66
If we square the t-test statistic we get
If we square the t-test statistic we get the f statistic for the anova run on the same data
67
Graphs have two primary functions:
Graphs have two primary functions: to help us understand our own data or to communicate our findings to the public
68
Reading: histogram and box plots
Histograms simplest graph that works best with interval or ratio data and gives you an overall impression of the variable. Has an advantage over histograms being that their shape is NOT influenced by the number of bins used. Their strength is that it shows the entire spread of the data which is a flaw if it has multiple bins (not compact). It is not helpful for nominal data. Boxplots (box and whiskers) works best for interval and ratio data. Includes visual presentation of the median, IQR and range of the data. Compact and useful exploratory analysis of your own data. Good way to identify outliers.
69
Reading: Null hypothesis Testing
The goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true. The goal is to show that the null hypothesis is (probably) false. Like a court trial the all hypothesis is deemed true until we find sufficient evidence to prove beyond a reasonable doubt that it is false. The goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them. There will always be error. If we reject a null hypothesis that is actually true then we have made a type I error. On the other hand, if we retain the null hypothesis when it is in fact false then we have made a type II error. The single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted α, is called the significance level of the test. And I’ll say it again, because it is so central to the whole set-up, a hypothesis test is said to have significance level α if the type I error rate is no larger than α. So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by β. However, it’s much more common to refer to the power of the test, that is the probability with which we reject a null hypothesis when it really is false, which is 1 ́ β. To help keep this straight, here’s the same table again but with the relevant numbers added:  A “powerful” hypothesis test is one that has a small value of β, while still keeping α fixed at some (small) desired level .05. An aside regarding the language you use to talk about hypothesis testing. First, one thing you really want to avoid is the word “prove”. A statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty and, as the saying goes, statistics means never having to say you’re certain.
70
Reading: Test statistics and sampling distributions
Calculate test statistic from our sample and compare it to its corresponding sampling distribution (data we expect given the null hypothesis is true). If our test statistic falls in the tail of the sampling distribution (within the rejection region) then it is not likely the null hypothesis produced our results. The p-value is the probability if we replicated the study that the sampling distribution would produce a test statistic the same or greater, given the null is true. * nothing about proving null is wrong or the research hypothesis is right * to be in the tail the test statistic either has to be very big or very small (5% or 2.5% depending on whether its a one tailed or two tailed hypothesis). Statistically significant simply means we have enough evidence to reject the null and conclude there is a significant difference present. It doesn’t tell us how big or how important this finding is to practice. It doesn’t tell us if our study was “good”. It doesn’t tell us the probability that the null is true. we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as 1 ́ β, this is the same thing.
71
Reading: Comparing two means
+/- sign is arbitrary for test statistics The main difference is that the standard error calculations are different. If the two populations have different standard deviations, then it’s a complete nonsense to try to calculate a pooled standard deviation estimate, because you’re averaging apples and oranges.9 Table 11.1: A (very) rough guide to interpreting Cohen’s d. My personal recommendation is to not use these blindly. The d statistic has a natural interpretation in and of itself. It re-describes the difference in means as the number of standard deviations that separates those means. So it’s generally a good idea to think about what that means in practical terms. In some contexts a “small” effect could be of big practical importance. In other situations a “large” effect may not be all that interesting.  In statistical jargon, this makes them nonparametric tests. While avoiding the normality assumption is nice, there’s a drawback: the Wilcoxon test is usually less powerful than the t-test (i.e., higher Type II error rate). An independent samples t-test is used to compare the means of two groups, and tests the null hypothesis that they have the same mean. It comes in two forms: the Student test (Section 11.3) assumes that the groups have the same standard deviation, the Welch test (Section 11.4) does not. 
 A paired samples t-test is used when you have two scores from each person, and you want to test the null hypothesis that the two scores have the same mean. It is equivalent to taking the difference between the two scores for each person, and then running a one sample t-test on the difference scores. (Section 11.5) 
 
72
What test do we do if we have a categorical DV?
Chi square