8 the replication crisis and the open science movement Flashcards
Where did the idea of a replication crisis come from?
studies found that
The mean effect size (r) of the replication effects (Mr= 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr= 0.403, SD = 0.188), representing a substantial decline.
Ninety-seven percent of original studies had significant results (P< .05). Thirty-six percent of replications had significant results
Why do we have a replication crisis?
problematic practices: selective reporting, selective analysis, insufficient specification of the conditions necessary or sufficient to obtain the results
publication bias, …
⇒ understanding is achieved through multiple, diverse investigations
replication just means evidence for the reliability of a result
alternative explanations, … can account for diminished reproducibility
⇒ cultural practices in the scientific communications
low-power research designs
publication bias
⇒ Reproducibility is not well understood because the incentives for individual
scientists prioritize novelty over replication
What is predictive of replication success?
good strength of initial evidence rather than characteristics of the team conducting the research
What is “evaluating replication effect against null hypothesis of no effect”?
does replication show statistically significant effect within the same direction as the original study
treating 0.05 treshold as a bright-line criterion between replication success and failure is key weakness of this method
What is done if you evaluate the replication effect against the original effect size?
is the original effect size withing the 95% CI of the effect size estimate from the replication
-> precision of effect, not only direction
-> size, not only direction
What is done if you compare original and replication effect sizes for cumulative evidence?
descriptive comparison of effect sizes - does not provide info about the precision of either estimate or resolution of the cumulative evidence for the effect
→ computing meta-analytic estimate
One qualification about this result is the possibility that the original studies have inflated effect sizes due to publication, selection, reporting, or other biases
Is replication the real problem?
meta-analyses show - most findings are being replicated
The real problem is not a lack of replication; it is the distortion of our
research literatures caused by publication bias and questionable research practices.
What do the researchers argue for? what is the real problem in psychological research?
(a) studies in most areas are replicated;
(b) failure to replicate a study is usually not evidence against the initial study’s conclusions;
(c) an initial study with a nonsignificant finding requires replication;
(d) a single study can never answer a scientific question;
(e) the widely used sequential study research program model does not work;
(f) randomization does not work when sample sizes are small.
What different types of replication exist?
(a) literal replication—the same researcher conducts a new study in exactly the same way as in the original study;
(b) operational replication—a different researcher attempts to duplicate the original study using exactly the same procedures (also called direct replication); and
(c) systematic replication—a different researcher conducts a study in which many features of the original study are maintained but some aspects (e.g., type of subjects or measures used) are changed (also called conceptual replication)
What are common errors in thinking about replication?
- replication should be interpreted in a stand-alone manner
ignores statistical power
average statistical power in psychological literatures ranges rom .40 to .50
(the likelihood that a test will detect an effect of a certain size if there is one)
Note that if confidence intervals (CIs) were used instead of significance tests, there would be far fewer “failures to replicate”— because the CIs would often overlap, indicating no conflict between the two studies - research in meta-analysis has shown no single study can answer any question
sampling error = the difference between an estimate of a population parameter and the actual value of the population parameter that the sample is intended to estimate - measurement error, range variation, imperfect construct validity of measures, artificial dichotomization of continuous measures, and others
What about replicability of non-significant findings?
= usually the absence of a relationship
→ unjustified
→ do nsf´s not need replication?
⇒ should be followed up with additional studies
In fact, given typical levels of statistical power, a relation that shows consistent nonsignificant findings may be real.
Richard (2003) - average effect size in social psychology is d = .40
(based on >300 meta-analyses)
median sample size in psychology is only 40
-> usually half should report significant, half non-significant findings
-> not the patten we see
What are biases in the published literature?
- research fraud
- publication bias, source bias
- biasing effects of questionable research practices -> QRPs (most severe in laboratory experimental studies)
highest admission in social psychology, 40%
(a) adding subjects one by one until the result is significant, then stopping;
(b) dropping studies or measures that are not significant;
(c) conducting multiple significance tests on a relation and reporting only those that show significance (cherry picking);
(d) deciding whether to include data after looking to see the effect on statistical significance;
(e) hypothesizing after the results are known (harking); and
(f) running a lab experiment over until you get the “right” results.
- limitations of random assignment
(claimed superiority of experimential studies)
randomisation does not work if the samples are not large - extremely rare
small randomized sample sizes produce neither equivalent groups nor
groups representative of the population of interest
What approach should be taken to detect QRPs?
The frequency of statistical significance in some literatures is suspiciously high given the level of statistical power in the component studies
statistical power has not increased since Cohen first pointed it out in 1962
low power → nonsignificant findings → difficult to publish
avoiding this consequence by using QRPs
upward bias in mean effect sizes and a downward bias in the variability across effect sizes due to the unavailability of low-effect-size studies
What is false-positive psychology research practice?
despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05),
flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates
false positive = incorrect rejection of a null hypothesis (detecting true differences when there are none)
What are researcher degrees of freedom?
- it is common for researchers to explore various analytic alternatives and report only “what worked”
ambiguity in how to best make a decision
desire to find statistically significant results
→ self-serving justifications
(highly subjective, variable across replications)
- flexibility in choosing among dependent variables
- choosing sample size
- using covariates
- reporting subsets of experimental conditions
What can be said about the influence of this flexibility on false-positive rates?
⇒ flexibility in analyzing two dependent variables (correlated atr= .50) nearly doubles the probability of obtaining a false-positive finding
⇒ adding 10 more observation until the findings are significant doubles the probability as well
=> controlling for gender or interaction of gender with treatment produces fpr of 11.7%
⇒ combination of all practices would lead to a false positive rate of 61%
What is the main problem of this flexibility?
often decide when to stop data collection on basis of interim data analysis
affects are not necessarily significant in a large sample if they are in a small one
What requirements for authors do the researchers suggest?
- must decide the rule for terminating data collection before it begins
- at least 20 observations per cell
- list all variables collected in a study
- report all experimental conditions, including failed manipulations
- if observations are eliminated, they must report what the statistical results are if those were included
- covariate - report it with and without
What guidelines for reviewers do the researchers suggest?
- ensure the authors follow the requirements
- be more tolerant of imperfections of results
- demonstrate that results do not hinge on arbitrary analytic decisions
- conduct exact replication if not compelling
What is the open science movement?
all elements of an experiment are completely accessible and clearly documented, then it
(1) increases the level to which exact replications can be conducted, and
(2) reduces the likelihood of researchers using questionable practices in their research, for example, it reduces the likelihood of p-hacking.
collection of several research practices
openness, transparency, rigor, reproducibility, replicability, and accumulation of knowledge
What is p-hacking?
Data dredging (also known as data snooping or p-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives
e.g. Selective reporting of significant results from a series of hypothesis tests with different dependent variables
What are some guidelines to ensure open science practice?
“good enough practice in scientific computing”
preregistration - arising from the need to promote purely confirmatory research and transparently demarcate exploratory research
→ cognitive biases, particularly confirmation and hindsight bias, and the pressure to publish large quantities of predominantly positive results
“almost no psychological research is conducted in a purely confirmatory fashion”
The solution lies in preregistration: researchers committing to the hypotheses, study design, and analyses before the data are accessible. In their paper,Wagenmakers et al. present an exemplary preregistered replication as an illustration of this practice.
making replication mainstream - a finding needs to be repeatable to count as a scientific discovery
teaching open science
What else is important to ensure good research practice in psychology?
correct statistical knowledge and report
models, hypotheses and tests
What is the current understanding of a statistical model?
complex web of assumptions
statistical model
mathematical representation of data variability
- often unrealistic or unjustified assumptions
defining the scope of a model
good representaiton of observed data and hypothetical alternative data that might have been observed
model is usually presented in highly compressed and abstract form
one assumption in the model is a hypothesis that a particular effect has
a specific size, and has been targeted for statistical analysis
→ study hypothesis
Much statistical teaching and practice has developed a strong (and unhealthy) focus on the idea that the main aim of a study should be to test null hypotheses