Test 2 Flashcards

Question

Interaction Write Up: | & Main Effects

Answer 1

Describing a significant Interaction: 1. Split one of the IVs into levels 2. Compare the effect of the other IV at each level 3. Explain how the differences are different • For participants in CBT, the drug produced improvement over placebo. However, in those on the waitlist, the drug was no more effective than placebo. •For participants taking the drug, CBT was more effective than staying on the waitlist. For participants taking the placebo, CBT was no more effective than the waitlist. • The effectiveness of the drug depended on therapy. •The effectiveness of therapy depended on the drug. What about the main effects? • Main effect of drug isn’t meaningful. • Main effect of therapy isn’t meaningful. • These main effects are qualified by the interaction.

Answer 2

ANOVA with within-subject variables (both variables are manipulated within-subjects) the only difference is the use of statistics (repeated measures ANOVA; do not care about overall differences between people, just differences in individuals between conditions). 2 x 2 factorial (within-subjects) use a repeated measures ANOVA (we use it anytime there is a within-subjects variable; only difference is if you call it a mixed or within-subjects factorial). Jamovi doesn’t have the theoretical variables in it, it only has the operationalized to levels. We need to tell Jamovi what we are measuring.

Answer 3

Assumption of sphericity only matters in within-subjects design with three or more levels (if two levels and two means than you only have one pairs of variances which means you can not violate sphericity if you are comparing 3+ pairs of means). Homogeneity of variance (if the variance within groups is different, SD’s; can possibly violate this in within-subjects design with two levels!) .

Answer 4

When do we use small-N designs? • To establish a causal effect of IV on DV within a small number of participants - Research question concerns a very small sample (not enough people to sample from) - Situations where we cannot recruit a sufficiently-powered sample (too small for inferential statistics) - When we expect substantial variability in individual responses (group means isn’t useful for highly variable responses; the mean doesn’t end up describing most participants!) • A small-N design establishes causal relationships through replicating the effect of IV on the DV (to prove consistency) - Consistent change in DV as IV is manipulated, with little variability (level or . trend) - Direct replication of the IV’s effect within the participant (exact same participants, conditions and context) - Systematic replication of the IV’s effect across participants or contexts (different participants, context or conditions) • Control over other variables is achieved by - Establishing a baseline for the behaviour without intervention (acts as a control group) - Collecting multiple observations until we see consistency in behaviour (more confident in the IV effect on DV; helps establish that the DV is under the control of the IV and supports claims of causality!) - Replicating the change in DV with the introduction of the intervention (comparing change in DV from baseline-intervention change)  Is one AB relationship enough to demonstrate this? What is the problem with inferring causality from a single-phase change?  An AB design is not rule out history effects? Extraneous variables which may have caused the change in behaviour, and provides and alternative explanation for the results. Something that happens at the same time as the IV and provides alternative explanations.  Solution = reversal design

Answer 5

• Phase change between A and B happens more than once (not a single phase design!) • ABA (baseline-intervention-baseline; most common reversal line; does their behaviour go back to baseline after the intervention is removed; under control of IV’s presence or absence) or ABAB (baseline-intervention- baseline-intervention) • Meets the final replication criteria to demonstrate control by the IV (behaviour changing at each phase change supports causality claims that the DV is under the control of the IV) Example Reversal Design: (Bicard et al., 2012) • College athletes at risk of academic failure. The researchers propose that one causal factor for poor academic performance is being late to class. • The intervention would be to get students to attend class on time. • Minuets late to class (DV), Weeks into trimester (timeline), Texting (intervention; they have to text their student consuller when they are coming to class), baseline (no texting) • At baseline their class attendance was 30+ minuets late to class. • Intervention caused their behaviour to get better. • The removal of the intervention caused the behaviour to go back to baseline. • Re-introduction of intervention improves behaviour again. • An ABAB design is good for: o Behaviours that can be unlearnt o Is not ethical to use ABAB designs when the behaviour is harmful; cannot remove an intervention that is working (you can use alternative methods than reversals to prove causality; still need to show cause- effect) *Intervention behaviour different from baseline. Return to baseline reverses the effect

Answer 6

• Single AB phase change that is replicated across participants, behaviours or contexts • At least twice • Target outcomes must be independent (observations in one individual Is independent from observations in another) • Useful for effects that can’t be reversed • Behaviour can’t be un-learnt • Not ethical to withdraw treatment • Example: Multiple Baselines (same behaviour and context but different particpants) • Notice in this graph: - Multiple students are being tested with the same intervention at different times (the intervention doesn’t start at the same time for every participant) - Why do they start at different times? (staggered baseline switches common feature) because if the goal is to rule out history effect where extraneous variables are impacting the DV at the same time as the IV and can provide alternative explanations for the change in DV. Example Multiple Baseline Design (same context and participants but different behaviour) - Pairs of students enrolled into a mathematics course. The experimenters what to test the effects of a peer tutoring intervention (home-based) on test scores (three different types of mathematics equations) - Participants had a history of maltreatment which is associated with poorer maths and literacy skills. - Baseline (test scores without interventions; rewarded with money) no answerers correct - Intervention (taught one member how to solve it; participants taught the other and test scores were calculated) scores got better - If children did not reach mastery with the intervention alone. They were provided with additional interventions. - Staggered baseline is present (specific to multiple behaviours same participants) for: ruling out history effects, gives us a chance to see if the effects of the intervention for one maths equations (dv) influence’s the others. Impairs causality claims if the intervention effects subsequent behaviours. *Intervention effect replicated across behaviours - Say I’m not happy with this intervention? Their behaviour doesn’t work very well and I want to compare the effectiveness of two different treatments? Use an alternating treatments design using same participants or not sure what component of the intervention is causing the effect (i.e., separate things out).

Answer 7

• At least 2 interventions are tested, usually one at a time, and frequently alternated from session to session • Combinations of interventions can be used to test for interactions (A-B-C-B-C; baseline, intervention 1, intervention 1 + 2, intervention 1 + intervention 1 + 2; Baseline, intervention 1, intervention 2, intervention 1 + 2). Example: Alternating treatment design - Kallie is 6, autistic and has a high rate of PICA (compulsive consumption of inedible objects i.e., Christmas decorations or destroying them) - The intervention[s] to reduce the frequency of this harmful behaviour - Baseline: - Functional Analysis Context: baited PICA items on the table; safe to eat but look like the unsafe items she likes to eat - Holiday decoration context: attempts to eat holiday decorations or destroy them but was blocked by the experimenter - High levels (clinically significant) of PICA behaviours and warrants an intervention - B (DRA) - Differential reinforcement of alternative behaviour (block PICA attempts, and rewards non-harmful interactions with the toys/objects with edible food items she likes) - Not punishing bad behaviour, but rewards good behaviour to reduce frequency of bed behaviour - We saw reductions in target behaviour but not sufficient clinical significance (normal level; PICA 1x attempt per minuet is too high) - C (DRA and Facial Screen) - Reinforcement of good behaviour - Anytime she tries to eat an inedible object the experimenter puts her hand on top of her hand and eyes for 30sec. Is a form of stimulus avoidance (not restraining them at all). Then they redirect her attention to objects safe to play with. - Strong behaviour reduction with instances of 0 PICA behaviours (clinically significant) - At NO point do we return back to baseline, intervention + both interventions + intervention + both interventions - We can see in the second DRA intervention their behaviour gets worse which illustrates that the intervention on its own doesn’t have a lasting effect and in combination with facial screen preforms better (interaction) - Reversal between intervention phases (BCBC) * Behaviour changes as the treatment is alternated Mitter et al. (2015)

Answer 8

- Used for behaviour which we do not expect an immediate but rather gradual change, more resistant behaviours such as smoking cessation) • Level of required behaviour is varied across trials (level of intervention or its criteria changes; gets harder and we measure peoples change in response to level changes) • A change in criterion for target behaviour is based on successfully maintaining behaviour at the previous criterion (maintaining target behaviour from all levels of IV) • Causal relationship is demonstrated by successive replication of a change in behaviour with changes to the IV Example: • Pole vaulter university student • Student is not lifting their arms high enough (technique) to clear the bar, as they should be able to so an intervention is needed. • Current height the raise their hands when doing pole vaulting is a long-term behaviour that is now a habit and harder to change (more resistant to change) • Baseline (pole vault attempts without intervention; common height they keep their hand at and the height they can clear) - Intervention shout reach as they are about to jump, using a feedback pole where if they raise their arms high enough they get instant feedback of noise. • Levels: baseline was 225 so first criterion was to learn to do 230, then 235, then 240, then 245 etc. - They got better overtime, and had to reach the stability criteria (consistency of correct arm extensions) before the criterion level was changed - At 252 they were not able to reach stability criteria and the study was terminated • Tips for a good criterion study: - Increases in criterion need to be small enough to be reasonably meet but big enough to be able to see when behaviours change (small changes is not good for high variability behaviour!), criterion interval changes need to not be done at predictable times (random or staggered; to rule out history or maturation effects), can return to baseline or previous criterion level support causal claims of IV-DV. *Successive demonstration of a change in behaviour with changes to the IV (level) *We are more concerned with clinical significance rather than statistical significance (healthy behaviour levels more important than having the best experiment)

Answer 9

1. Construct Validity • How well do the operational variables map onto the theoretical variables? ``` 2. Internal Validity • Are there other possible explanations for the findings? • Are there confounds? • Is it the best design to address the question? ``` 3. External Validity • Can the conclusions generalise to other people, other stimuli, other contexts? ``` 4. Statistical Validity • How big is the effect? • Does the study have sufficient power? • Is the data treated appropriately? • Are the statistical conclusions (e.g., significance) justified? ``` Sufficient Power • Does the study have a big enough sample size to identify the effect size we are interested in? • Small sample sizes can only detect big effects. • The bigger the sample size the smaller the effect size you can identify.

Answer 10

- One problem with study design, is low statistical power and not being able to detect small effects, unreplicable findings. - failure to control for bias - P- hacking (repeat statistical tests till one is significant) - Harking (hypothesis after results are known) - publication bias (only publish significant and experimental hypothesis testing)

Answer 11

1. Make clear predictions based on hypothesis 2. Ensure you have enough statistical power (sample size is sufficient for effect size you are interested in, big effects need smaller sample sizes, but smaller effects need bigger samples). 3. Set a stopping rule 4. Reduce flexibility in data analysis (predetermined DV variables; exclusion criteria, subgroups and covariate analysis). 5. Adjust for multiple comparisons when appropriate. 6. Upgrade statistical skills and understanding *How does this apply to small N design which do not have samples of people. In small N designs it could be argued that we exploit optional stopping, researcher can extend the number of trails they do until they reach a consistent/stable pattern of behaviour. It is also a flexible approach in how many trials we do ad at what point we change phases (baseline-intervention) - these are dynamic changes we can make in Small N design. Does this mean Small N designs are not doing good science?

Answer 12

- Replication allows us to determine causality in a small N design but one phase change is not sufficient to conclude that the IV has power over the DV, too many. Confounds. - Unlike group mean designs, small n designs treat each participant as a unit of replication (this is an advantage of small n designs, within one study with multiple participants you are replicating an effect; two participants = two replication). This disproves the critique that small n’s do not care about replication. They do, it is just built into the design itself. - Is it okay to do research without a stopping rule? People who stop responding or do not complete all trials; there data is removed for not being complete. Is this a problem, to answer this we will look at the inductive and deductive model of reasoning. - Group-based studies use a deductive approach, theory-hypothesis-data collection - Small N-design is inductive in nature, observation of behaviour, finding patterns, using theory to explain patterns found (being dynamic is necessary for clinical work where the goal is to reduce negative behaviour and replace it with positive behaviours, a null effects cannot be found, we keep the study going to we see the behaviour change we want). Small-N’s can be deductive when they are test the effectiveness of an intervention (testing theory informed intervention on child, taking observations to see if it works). - Most Small N designs are a mixture of inductive and deductive reasoning. A mixture of theory-driven and explanatory work. This is true of all psychology research (including group designs) where there is no clear distinction between inductive and deductive work.

Answer 13

1. Set up goal with clinical significance: little research on PICA interventions 2. Link their goal to literature (previous) 3. Identify purpose of the current study: identifying the function of someone's behaviour (functional analysis is an inductive goal! Observe people’s behaviour in different contexts to find out what it causing it and keep doing study till it is fixed- flexible/dynamic/no problems with publication bias because there is no null findings! Not testing a hypothesis/prediction) but there is deductive as well (evaluate the function of treatment informed by research; theory- driven intervention selection and using data to see if it effective and supports our hypothesis, can have null effect and publication bias). Deductive reasoning issues will include what we have identified looking at group-level analysis. 4. This example is a mixture of inductive and deductive approaches. Small N designs allow them to test their primary goal of identifying the function of the girls behaviour and keeping it under control whilst being flexible enough to test a secondary deductive hypothesis, see which theory-driven treatment is most effective. This is not a clear distinction to find in studies!

Answer 14

1. Report all measures, variables and conditions. 2. Clearly distinguish confirmatory (planned) from exploratory (unplanned) analysis 3. Clearly document hypothesis, predictions, design decisions, procedures. 4. Share data in a public repository (if ethically possible) 5. Share analysis codes and results 6. Share research materials 7. Improve journal standards *Once we have identified the best design for our RQ. We should then declare as much as possible. In deductive studies were are committing to our design decisions and there is no flexibility to change them throughout the study if we see it not panning out as we expected. However, in an inductive design it is much more dynamic and expect changes to be made in the study but we still need to declare the logic or reasoning we will use to make these changes (i.e., what will I use to decide when to change phases). - If our hypothesis is conceptualised at the individual level than it should be tested at the individual level with a small n design. If it is more generalised to groups of people on average performing the same then it should be tested with a group design. The design will decide what statistical validity questions you need to ask. - For example, an exploratory aim we asked was that people who were distracted were expected to take longer to respond. Not theory drive, just an intuitive expectation we had with no specific hypothesis was made.

Answer 15

External Validity in Small N designs: • Each participant is a replication unit which supports external validity of IV-DV to another person, context or behaviour (specific advantage of small N over group designs, which compare averages, means, that ignores individual differences) Example: concern of replication of IV-DV effect (visitor contact impacts tuatara welfare; animal subjects) does a replication of effect in 3 subjects mean it will generalise to other tuataras? • Do tuataras get distressed with visitor handling? • Enough in the literature to justify that this should be studied because some animals are distressed when handled but others don’t. • They make analysis at the species level (group differences expected within a species) and individual differences (within members of a species is appropriate to use small n design) • Small n design meets animal ethics (3 R’s: reduce, refine and replace; to use the minimum number of animals you need to answer question) Results: van Heerbeek et al. 2021 • Baseline: no handling count and record target behaviours • Intervention: handling (hold in had for 30 minuet and touched by visitors) and count and record target behaviours • High variability in baseline, visible, when distressed they burrow (cannot be seen). Do they burrow more when distressed. • Visitation days are set and cannot be changed which is a limitation because we cannot rule out extraneous variables which coincide with these days. No staggering of start in baseline! Does allow us to rule out environmental factors though. • Dark line shows that tuataras burrowed more (were less visible) on visitor days where contact was high which indicates it distressed the animals. • External validity is high in this example because the tuatara and ecological context it is meant to apply to are included in the study. We can be confident the behaviour will generalised outside of the study and be repeated in the real world. • Small N designs generally have high external validity! The subjects being studies are who it is meant to be applied to. Not clear if it can generalise to others but definitely the people in the study. • Side note: a single study does not lead to changes in policy, or interventions. Multiple studies are needed in research which provide converging evidence!

Answer 16

In general: - The more power you have the better - How to design a study to have the most power that you can

Answer 17

• True Positive: Reject the null and the null is false. • True Negative: Accept the null and the null is true. • False Positive (Alpha): - Type 1 error - Is our tolerance for being wrong which we arbitrarily set at .05 aka 5% of the time we reject the null when the null is true. - In other words, we conclude that the between-group differences is due to the effect of the IV but is actually due to sampling error. • False Negative (Beta): - Type 2 error. - Fail to reject the null hypothesis when it is actually false. - In other words, have insufficient evidence to reject the null hypothesis when an effect of IV is present. • These two errors are independent from one another.

Answer 18

For example, - if beta (false negative) is set at .2 (20%) than your power is .8 (80%). 80% Power means you have an 80% probability of rejecting the null hypothesis IF it is actually false. 90% Power means you have a 90% probability of rejecting the null hypothesis IF it is actually false. - Calculating the power before you start the study is important to understand what is the probability of finding a significant difference in the study. - In other words, with 80% power, if we ran the study ten times, we would expect to find a significant difference 8/10 times. - Before the replication crisis studies typically only had 40% power which is a waste of time and resources. We need to design better studies with more statistical power = better science.

Answer 19

choens d: difference between means/SD - The bigger the difference between groups (numerator) the bigger the effect size. - Cohen’s d uses SD NOT SE because SD is not effected by sample size (n). - The bigger the SD (denominator) the smaller the effect size.

Answer 20

- Calculated by M1-M2/SD, 100-115 = 15/10 = 1.5 is a big effect size - 100-115/60 = .025 (same mean group difference with larger SD’s mean there is more overlap and a smaller effect size). *Two things determine effect sizes: mean group difference and SD (variability)

Answer 21

Design a study to optimize power: *Applies to any experimental or quasiexperimental design with 2+ groups (A) Increase the difference between groups (B) Decrease SE a. Decrease SD b. Increase N *because the SE is the square root of SD/N. smaller numerator and bigger denominator makes a smaller statistic. (A) Increase Difference Between Groups/Conditions • Increase the strength of the manipulation (IV) o i.e., more sessions of CBT o i.e., higher dosage of drugs o Wouldn’t want to do a strong manipulation all the time when using (-) valance stimuli which is unethical to do or within-subjects design which would introduce demand characteristics • Sample from extreme groups/ends of the distribution o Has its own limitations; regression to the mean and not being able to know what happens on average. *Always aim for the strongest manipulation you can so we can be confident that the effect is not present or due to having insufficient power to detect an effect. (B) Decrease SD • Standardise measurement o high consistency, reliability and validity of measure to reduce variability. • Homogeneous sample o Less variability the IV has to compete with but impairs external validity; generalize to other groups; if it doesn’t work in a homogeneous sample it is not likely to work in a heterogeneous sample. • Matched pairs o Match people for important extraneous variables • Within-subjects designs o I don’t care that people vary from one another (C) Increase N • Incentives /rewards for participation • Online data collection rather than in person • Collaborative studies (many lab method; multiple labs run same study and share data) *harder to do due to cost in time and resources

Answer 22

• We set the alpha or significance level at .05. If the p-value is less than .05 we have sufficient evidence to reject the null hypothesis and accept that there is a 5% chance we are making a false positive error. In other words, we reject the null hypothesis and conclude the IV caused the effect on the DV but it is actually due to sampling error. • A one-tailed test has more power than a two tailed test. For example, in a two tailed test we predict an effect will be present but we don’t know in what direction so the 5% is split between both tails and requires a larger t (+/-) to produce a significant p-value. • In contrast, a one tailed test in when we make a prediction on the direction of the effect and get to pool the 5% at one end of the tail which means we need a smaller t statistic for the p-value to be significant. One caveat is that if you predict in the wrong direction the result will be insignificant, even if its significant in the other direction.

Answer 23

• Function between N, mean group differences and SD • Will ask: o One tail or two o Predicted effect size o Alpha o How much power do you want (.90 is best) o Equal number of participants in each group? • It then tells you how many participants you need to meet these criteria (decide before sampling, acts as a stopping rule)

Answer 24

• Detecting small effects requires precise measurement and large N but both are costly design elements. - Solution: consider the theoretical and practical significance of finding an effect to justify the costs of detecting smaller effects. - Use literature to see what effect sizes others have found -Start with medium .05 as a base - What is the smallest effect size do you care about? Small effect sizes are costly (time, money and effort) - Theoretically, some people may want evidence of an effect no matter the cost in physics can really influence theory but in psychology this is not always the case. - Psychology is more important in practical significance, what effect size would be important enough to inform treatment design? Improve daily functioning? *Small effect sizes may be theoretically interesting but not practically significant.

Answer 25

1. Your experiment didn’t fail! (significant effects are not the goal of research, we are aiming to find the truth!) 2. Do you have sufficient power? (not large enough sample size to have sufficient evidence to reject the null-hypothesis) 3. Did you effectively manipulate your IV? (construct validity; did the IV manipulation measure what we intended it to? Did we include a manipulation check?) 4. Do you replicate your null effects? (answers not found in a single study; same effect found in another study supports that its is a null effect and not issues with experimental design) 5. Did you preregister your hypotheses? (faith in null effect; because it requires you to do a power analysis).

Answer 26

• Not with NHST (the logic of null hypothesis testing is that we assume what the shape of the sampling distribution is when the null hypothesis is true and then we look for evidence that would allow us to reject the null-hypothesis; we have two options we can reject the null hypothesis or fail to reject the null hypothesis; when we fail to reject the null hypothesis, we are saying that we have insufficient evidence to reject it, it does not mean that we accept it; NHST is not designed to answer this question! Only via replication of same effect can we conclude this!) • Demonstrate high power • Replicate the null • Use Bayesian statistics (an alternative to NHST where they weigh the evidence for the null and the experimental hypothesis; it doesn’t rely on the logic of rejecting the null or failing to reject the null as NHST does; it would allow you to say that there is more evidence for the null and not the experimental; which would allow us to claim if the null hypothesis is true, unlike NHST). *Absence of evidence does not equal evidence of absence! = having insufficient evidence to reject the null hypothesis, does not mean it is true! I don’t have evidence that they are equivalent I just don’t have sufficient evidence to show that they are different

Answer 27

Recap: Criteria for Causal Inferences 1. An association between two variables. 2. The cause comes before the effect (temporal precedence) – manipulating IV measure the effect on DV 3. Alternative explanations are controlled (control of extraneous variables; RA, Expectations, control groups, order etc. the only difference between experimental group and control should be the IV). *This is not always possible or desirable. a) Non-equivalent groups b) Pre-test/post-test c) Interrupted time series

Answer 28

Non-equivalent Groups *When your variable of interest cannot be manipulated (no causal claim can be made; direction of effect and alternative explanations are not ruled out). • Subject variables - culture, age, IQ, personality, performance, gender, income or education level. • Ethical concerns - fear, anxiety, depression, pain, malnutrition. Adding a Participant Variable *Does your experimental effect generalise to other populations? Example: Cultural Variations in Anger in negotiations (north American and Asian-American) • Design: - 2 x (culture; cannot be manipulated) x 2 (emotion) factorial design with a subject variable • Theory: - anger is a negative emotion, but it is effective in negotiations (instrumental anger) • Hypothesis: - Anger is an adaptive mechanism that demonstrates strength and encourages concessions. • Prediction: - If anger encourages concession, then participants in the anger condition will be more likely to offer the warranty. • Method: - Given a scenario where you are trying to sell a product to the client who wants the warranty thrown in before they accept your offer (warranty is expensive and does not want to offer it without accepting the offer). - The IV: end of script the client either says it in an angry or non-angry tone. - At the end of script they were asked two questions: - DV: What is the likelihood you will give the client the warranty? 1-7 likert scale - Manipulation check: how angry do you think the client was? Construct validity did we actually make people think they were angry. • Results: - People in angry condition concede more. (give warranty) than in the no anger condition. - Independent t-test (one variable, two levels, manipulated between groups) BUT • Samples: o WEIRD white, educated, industrialized, rich and democratic (biased sample only reflects a small proportion of the world; psychology undergraduate samples, majority of psyc samples, not generalizable to other groups). Why important? For theory • Anger is a good example of how WEIRD samples are a problem. Cultures vary in their expectance/tolerance of public displays of anger. • emotions-as-social-information model theory behind study: cultures vary on what emotions are appropriate to display in public and therefore will influence their utility in acts such as negotiations. Collectivist cultures disapprove of public display of anger, western's are more accepting of its intrinsic value. Important - Anger condition is an independent variable, but culture is a subject variable. This does not affect analysis, but it affects the interpretation of the results. - Jamovi will treat culture as an IV in the analysis, but when interpreting it its up to us as the researcher than no causal claims can be made because we didn’t manipulate it (effects interpretation not stats). Clustered bar graph: - Replicates previous research that client’s anger in negotiations lead to more concessions than non-anger in European Americans. - However, Asian Americans had the opposite effect. More anger led to less concessions than no anger in negotiations. - = cross-over interaction (no main effects of anger or culture because has opposite effect at different levels of the IV). - Categorical Variables (2x) should be presented as a clustered bar graph Line Graph: cross pattern ``` Write Up • Concession making (primary dv) write up: - Introduce analysis and variables - Main effect - Main effect - Interaction - Post hocs *Same write up steps for quasi-experimental and true experimental designs ```

Answer 29

Internal validity • Better than an association study because anger is manipulated (cross-sectional or correlational studies; more control in quasi-experimental and can make causal claims about anger). • But, cannot make causal claims about culture as the cause of the difference, because it was not manipulated (practical knowledge to know if what works in one culture may not work in another even if I do not know the causal mechanism) External validity • Better than study in a homogeneous population (generalisability of effect to other cultures; using student samples for connivence and power supports internal validity but sacrifices external validity; once effect is found in student sample can replicate in more heterogenous or different samples to see if the effect generalises to other groups) • Still constrained by experimental methodology (manipulated anger, not real- world situation, still a valuable extension of. our knowledge). • Other examples: • Medical conditions • Anything which can not be randomly assigned with manipulation of IV

Answer 30

Pre-test/Post Designs *quasi-experimental design (studies for companies to improve their services where practical or cost constraints are present and make quasi-experimental designs the better option) When you want to measure change within individuals but cannot have a control group or counterbalance order a) Cost/practical constraints (can you provide it to one group and not the other, can you afford to) b) Participants are in a cohort (a class, a programme, a neighbourhood = have to apply it to everyone) c) Carry-over concerns in a within-subjects design (can’t counterbalance using a standard within-subjects design to use quasi-experimental pre-post to test one order = VR fear/neutral studies where fear response would contaminate neutral condition) Note: before we looked at true-experimental between subject’s pre-test/post-test design

Answer 31

Evaluating Validity Internal Validity o Better than association study because IV is manipulated (direction of effect is established) o Risk of history, maturation, regression to the mean (threats to internal validity without control groups) o Reduce threats to internal validity by running a non-equivalent comparison group if possible (not a proper control group if not randomly assigned and can have self-selection effects but it at least gives a comparison which rule out maturation, regression to the mean and history effects can be assessed). External Validity o Similar to experiment (really only applicable to the sample I studied it in)

Answer 32

- Use it because it helps with addressing random variation in scores overtime in an erratic way - For example, covid-19 cases can still have an overall trend but go up and down erratically day to day. Running a pre-post design on variables which are highly unstable the effect we see may not be due to IV and just random variations in the variable. - If I looked at 2x given points I would make an inaccurate conclusion from the data. We need to look at the overall pattern before and after the intervention to determine its effectiveness. Intervention Research *its common method used in these areas • Government policy (i.e., banning cell- phones whist driving) • Organisational change (i.e., health care, education systems adopt new strategies and want to test their effectiveness) • Management strategies • Catastrophic events (i.e., compare online-in- person teaching due to covid-19; natural disasters disrupting daily functioning psychologically, economically and economically) *It would be impractical, unethical and too expensive to test these with a true experiment but we can still make strong casual claims using a quasi-experimental design. Time Interruption Designs: - We take a series of measurements (months in a row, or years with historical data; days), then introduce IV, then time series for post is taken to look at the overall trend in the data pre-to-post and we can make stronger claims about the effect being stable (IV-DV) overtime. - As opposed to normal pre-post design with one time point, IV, second time point (dv) Example: Time Interrupted Design Spoelman et al (2017) - Reducing the number of consolation people make with their physicians for easy to solve medical issues. - They made a website for patients to get medical advice (FAQ) for minor and easy to solve medical issues. - Measured number of medical consultations (per 100 people) before and after website was introduced. - We see that post intervention (2-years) we see a general decrease in number of consultations (reversed; increase in trend to decrease trend). - Collecting enough data pre and post to wash away individual differences day to day and allows us to see the overall trend in the data and gives us more confidence that IV caused the DV. *Where an experiment would’ve been unethical (to withhold from some patient when you believe it will be beneficial)

Answer 33

Evaluating Validity Internal Validity • Better than single pre-test/post-test because it controls for random variation (most internally valid then other quasi- experimental designs) • IV may be confounded with other factors, so alternative explanations are possible (because no control group; do not not what it is about the website that causes this effect) External Validity • Very High (almost always done in real- world situations; not in the lab so can not do a true experiment) • Still may not generalise to other contexts (specific to the real-world situation that it was conducted in; would need to replicate in other organisations or countries)

Answer 34

Assumptions of Parametric Tests • Normality (normally distributed) o the DV is normally distributed o the mean is a good estimate of the variable o If variables are not normally distributed, the means of two groups are no longer good group estimates and cause problems for t or f test. • Variance (equal SD) o the variance within each condition is similar (homogeneous) o affects formula for pooled variance (t) or SS residual (ANOVA) o When calculating the denominator, it assumes that these variances are relatively equal and can be pooled together. If they are not equal, then the pooled variance is not a good estimate of the average variance. •Sphericity o only applies to repeated measures ANOVA (within SS variables with more than 2 levels) o like a variance assumption, but based on the variance of the difference scores between conditions, not variance within each condition

Answer 35

• Student t test which has one measure of variance (s2; pooled variance from both groups; averaging variance across groups is inappropriate because it doesn’t reflect either group well; would make it too small and not reflect the actual SE). • Welch’s t test (more conservative; where the formula considers the S2 (variance) of both groups rather than pooling it). • Note – these assumptions only apply to between-subject comparisons (2 sets of variance); a paired t-test is the difference between condition 1-2 so there is only one set of variance then *Not: these assumptions only apply to between-subject comparisons • In Jamovi, it will do both student t and welch’s t, if assumption is violated than use information from welch’s t. the key difference will be that the df will be significantly smaller (conservative) with welch’s rather than student t test. • The error in using a students t (without correction) when SD’s are not equal is that the SE will be artificially small, and the T artificially big and result in more false positives. Violating assumptions increases false positive rates, so we penalize ourselves with reducing our df, to reshape sampling distribution and p-value. Some people think we should just use the welch’s t test all the time, since it doesn't effect the p-value too much and we can be safe. • It is therefore more conservative, and less likely to produce a spurious significant effect. *f’s and df are different

Answer 36

Repeated measures ANOVA with 3+ levels - In a paired t test, it compares the difference between condition A and B for each participant. We end up with one group of participants with difference scores, we can calculate the mean difference and SD scores from this. This means we can not violate homogeneity of variance because we only have one set of difference scores, there is nothing for it to be homogeneous with. - In an ANOVA with 3+ levels, we now have three sets of differences (A-B, B-C, A-C) with three different SE’s which need to be equal within one another (difference of differences). - We will ask for sphericity, greenhouse gieszier correction (if sphericity is violated) and homogeneity test (for any between subject's variables). - Greenhouse Gizier correction makes the df smaller (more conservative) - 2, 1.76 (1.76/2 = penalization of .88)

Answer 37

When our data is not normally distributed (i.e., skewed) we cannot run parametric tests (t/f tests). Why? Normality assumptions • Parametric tests are based on comparisons of means (assumes that means are a good estimate of the group average, but if data is skewed than the mean will produce artificially big/small group differences, because it is pulled in one direction +/- of the tail; outliers warp means and subsequently t and p-value). • Using the means to represent a group or conditions assumes the mean is a good estimate Solution: Use Non-parametric tests (ranked test) • Based on medians, not means (median is better because ½ of participants scores will fall above or below the median; is based off of participants and not their scores) • Various ways of ranking individual data points, to determine if high/low scores are more likely in one condition than another (clear split in the data) • More conservative than parametric tests (less likely to give a false positive) • Less powerful than parametric tests (we are throwing out individual scores/data and just focus on high/low without looking at how different they are) (A) Mann-Whitney U Test 1. Non-parametric alternative to independent t-test 2. Rank order all the participant RTs 3. Null hypothesis – all RTs are equally likely to come from either condition 4. Research hypothesis – the faster RTs are more likely to come from the one condition, and the slower RTs are more likely to come from the other condition (½ participants above the median will come from group A and ½ participants below the median from group B; what is the probability of finding an 80-20 split?) 5. Use for non-normal data, especially if N is small. *will lose power. *No df for non-parametric tests because they are about the sampling distribution which non-parametric tests do not use (i.e., they rank data and look at the probability people are in one group or the other). (B) Wilcoxon Signed Rank test 1. Non-parametric alternative a paired t-test (e.g., comparing Condition A to Condition B within-subjects) 2. For each participant, sort them into two groups based on whether they score higher in Condition A or Condition B (will ½ participants be better at A and ½ B = no difference between conditions; if more people on A than B, 80-20 split then it is highly unlikely that these between condition differences are due to chance and we would reject the null hypothesis). 3. Null hypothesis – two equal groups (people equally likely to be better in Condition A than Condition B) 4. Research hypothesis – unequal groups (more people with A>B than B>A). *Normality in difference scores and not in the raw data (variables is for independent t) *No df for non-parametric tests because they do not use sampling distribution, they rank data. *We would report it as w = 3.50, p = .008, np2 = .873 (C) Kruskal-Wallis (alternative to oneway ANOVA) - Comparing 3+ means; between subjects - Run normal ANOVA - If we’ve violated normality assumption (i.e., small n or skewed data) - Rerun test using a Kruskal-Wallis non- parametric ANOVA - Chi square test statistic (x2) - Report it as x2 (2) = 11.40, p = .003 Post Hocs - We know there is a significant difference but not what group means are significantly different, so we use post hocs (with or without correction if we have directional prediction or not). - Non-parametric oneway ANOVA’s use Dwass-Steel-Critchlow-Fligner pairwise comparisons ``` (D) Friedman Test (alternative to repeated ANOVA) - Chi square (x2) - Pairwise comparisons AKA post hocs = Durbin-Conover - If 2 df = 3 groups (N-1) ```

Answer 38

(A) Independent t-test Mann-Whitney U test (B) Paired t-test Wilcoxon Signed Rank test (C) Oneway ANOVA (between) Kruskal- Wallis test (D) Oneway Repeated ANOVA Friedmans test

Answer 39

• ANOVA is robust to minor violations of normality (ignore them; its not surprising that one group may be slightly different than the others but once they’re averaged out it shouldn't matter; 2 x 2 has 4 groups) • Transformations (if big violations to normality we can transform the data; there is no non-parametric test for a factorial ANOVA). Transformations - Many options! - Different transformations are possible, depending on the shape of your original distribution. - A common option for a heavily positive skewed via logarithmic transformation. Take the log of each of these values using exponent system (101 system; where values of 10 become 1; 102 where values of 100 become 2; 103 where values of 1000 become 3 etc. now any value below 10 becomes bigger but those below 10 (in the tail) get pulled closer to the middle of the distribution making it look more like a normal distribution). It admits that these values are different but not enough to justifiably pull the mean that far out. Note that it’s a transformed mean not the actual mean. We don’t care about the specific value more the effect of the IV on Dvmean (bigger, - skew; smaller, + skew) between two groups. - Commonly use logarithmic transformations in eeg data in the lab. The socres still sit in the high end of the distribution but closer to the mean. - Inverse 1/850. Flip data and put each individual score over 1 to scrunch the tail in closer to the middle of the distribution. - Exponents - The type of transformation you use depends on the problem with your data! - Logmarithmic data transformations is good for (-) not (+) skews. If scores were too high, you will only make it worse. *We find an expert source to help decide what would be the best transformation for our data.

Answer 40

1. Check your assumptions 2. Use corrections for variance and sphericity-related violations 3. Use non-parametric tests for normality assumptions 4. Use transformations if non-parametric tests aren’t possible (i.e., factorial designs) 5. Ignore minor violations in factorial and repeated ANOVA (false positives are not really effected) 6. Get help! (people specialize in understanding what correction is best for different data problems; go ask them)

Answer 41

``` Nominal (categorical) - no link between categories - can not average or say bigger or smaller = discrete - gender, eye colour ``` ``` Ordinal (ranked categorical) - there is a natural meaningful way to rank/order categories - position in a race, item Q’s gradually increase in extremity - can not average them ``` Interval Scale (continuous) - numerical value is genuinely meaningful - differences between intervals/scores is meaningful - temperature - addition, subtraction and averaging is meaningful but can not be a ratio because 0 is not meaningful Ratio (interval) - 0 is meaningful = absence of variable - scores/numbers are meaningful - can multiple and divide - reaction time ``` Types of variables: Discrete -there is nothing in-between two points/scores - year went to school - nominal, ordinal, interval and ratio ``` Continuous - interval & ratio - a variable with given any two variables it is logical for there to be a variable in between - RT, temperature in degrees interval but ratio for farenhieght, # of t/f answers correct on a test (ratio), Likert scale (interval)

Answer 42

If we square the t-test statistic we get the f statistic for the anova run on the same data

Answer 43

Graphs have two primary functions: to help us understand our own data or to communicate our findings to the public

Answer 44

Histograms simplest graph that works best with interval or ratio data and gives you an overall impression of the variable. Has an advantage over histograms being that their shape is NOT influenced by the number of bins used. Their strength is that it shows the entire spread of the data which is a flaw if it has multiple bins (not compact). It is not helpful for nominal data. Boxplots (box and whiskers) works best for interval and ratio data. Includes visual presentation of the median, IQR and range of the data. Compact and useful exploratory analysis of your own data. Good way to identify outliers.

Answer 45

The goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true. The goal is to show that the null hypothesis is (probably) false. Like a court trial the all hypothesis is deemed true until we find sufficient evidence to prove beyond a reasonable doubt that it is false. The goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them. There will always be error. If we reject a null hypothesis that is actually true then we have made a type I error. On the other hand, if we retain the null hypothesis when it is in fact false then we have made a type II error. The single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted α, is called the significance level of the test. And I’ll say it again, because it is so central to the whole set-up, a hypothesis test is said to have significance level α if the type I error rate is no larger than α. So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by β. However, it’s much more common to refer to the power of the test, that is the probability with which we reject a null hypothesis when it really is false, which is 1 ́ β. To help keep this straight, here’s the same table again but with the relevant numbers added: A “powerful” hypothesis test is one that has a small value of β, while still keeping α fixed at some (small) desired level .05. An aside regarding the language you use to talk about hypothesis testing. First, one thing you really want to avoid is the word “prove”. A statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty and, as the saying goes, statistics means never having to say you’re certain.

Answer 46

Calculate test statistic from our sample and compare it to its corresponding sampling distribution (data we expect given the null hypothesis is true). If our test statistic falls in the tail of the sampling distribution (within the rejection region) then it is not likely the null hypothesis produced our results. The p-value is the probability if we replicated the study that the sampling distribution would produce a test statistic the same or greater, given the null is true. * nothing about proving null is wrong or the research hypothesis is right * to be in the tail the test statistic either has to be very big or very small (5% or 2.5% depending on whether its a one tailed or two tailed hypothesis). Statistically significant simply means we have enough evidence to reject the null and conclude there is a significant difference present. It doesn’t tell us how big or how important this finding is to practice. It doesn’t tell us if our study was “good”. It doesn’t tell us the probability that the null is true. we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as 1 ́ β, this is the same thing.

Answer 47

+/- sign is arbitrary for test statistics The main difference is that the standard error calculations are different. If the two populations have different standard deviations, then it’s a complete nonsense to try to calculate a pooled standard deviation estimate, because you’re averaging apples and oranges.9 Table 11.1: A (very) rough guide to interpreting Cohen’s d. My personal recommendation is to not use these blindly. The d statistic has a natural interpretation in and of itself. It re-describes the difference in means as the number of standard deviations that separates those means. So it’s generally a good idea to think about what that means in practical terms. In some contexts a “small” effect could be of big practical importance. In other situations a “large” effect may not be all that interesting. In statistical jargon, this makes them nonparametric tests. While avoiding the normality assumption is nice, there’s a drawback: the Wilcoxon test is usually less powerful than the t-test (i.e., higher Type II error rate). An independent samples t-test is used to compare the means of two groups, and tests the null hypothesis that they have the same mean. It comes in two forms: the Student test (Section 11.3) assumes that the groups have the same standard deviation, the Welch test (Section 11.4) does not.   A paired samples t-test is used when you have two scores from each person, and you want to test the null hypothesis that the two scores have the same mean. It is equivalent to taking the difference between the two scores for each person, and then running a one sample t-test on the difference scores. (Section 11.5)  

Answer 48

Chi square