M-side Flashcards
(38 cards)
Colquitt et al. (2019)
content validation
Content validation guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness
Definitional CORRESPONDENCE: Degree that scale items align with the target construct’s definition
Definitional DISTINCTIVENESS: Degree that scale items better reflect the focal construct than closely related (“orbiting”) constructs
Two main approaches for content validation
- Anderson & Gerbing (1991)
- Sorting-Based (ChatGPT-like logic)
- SMEs sort items into the construct they believe best represents the item’s meaning.
- Goal: Maximize agreement on correct classification.
Metrics:
- Proportion of Substantive Agreement (PSA) = CORRESPONDENCE = % of judges correctly sorting items
- Substantive Validity Coefficient (CSV) = DISTINCTIVENESS = difference between correct vs. incorrect classifications
- Hinkin & Tracey (1999)
- Rating-Based (Human-like logic)
- JUDGES rate how well each item reflects each construct definition using Likert scales (1–5).
Metrics:
- Hinkin-Tracey Correspondence (HTC) = CORRESPONDENCE = average rating for focal construct ÷ number of anchors
- Hinkin-Tracey Distinctiveness (HTD) = DISTINCTIVENESS = average difference between focal and orbiting construct ratings ÷ (anchors – 1)
When modifying scales, the most common was to drop items from the scale (36.94%); however, modified scales should also be tested for convergent validity, discriminant validity, and comparative CFAs (Cortina et al., 2020)
Podsakoff et al. (2016)
Concept definitions
Recommendations for creating better concept definitions in social sciences
Lack of good conceptual definitions has been a longstanding problem in social sciences.
Clear conceptual definitions are essential for scientific progress
Main point: ensure the final version of conceptual definition is clear, concise, understandable to broad audiences, and not subject to multiple interpretations.
CONCEPTS
- Cognitive symbols (or abstract terms) that specify the features, attributes, or characteristics of a phenomenon in the real or phenomenological world that are meant to represent them & distinguished from other related phenomena
- Serve as the building blocks of theory
Problems with Lack of Conceptual CLARITY:
- Difficult to distinguish focal concept from other similar concepts, undermining discriminant validity
- Leads to proliferation of different terms for the same concept – “Old wine new bottle” problem [personal note: especially in leadership research]
- Difficult to specify and test the nomological network of the concept
- Difficult to operationally measure the concept – i.e., mismatch between concept and measures of it, undermining construct validity.
- Also increases the likelihood of contamination or deficiency of the conceptual measurement.
Recommendations: stages may overlap
(1) Identify potential attributes by collecting representative set of definitions
- Activities to aid in collecting attributes: searching dictionary for synonyms/antonyms, surveying literature, interviewing subject matter experts, focus groups, case studies, comparing the concept with its opposite, thinking how to operationalize the concept
(2) Organize potential attributes by theme and identify necessary/sufficient and shared ones
- Identifying underlying themes can help clarify the concept and distinguish from other related constructs.
- Need to identify attributes of the concept that are necessary and jointly sufficient.
- Necessary (essential) = essential properties that all exemplar aspects of the concept must possess.
- Sufficient (unique) = only exemplars the concept possesses.
(3) Develop preliminary definition of concept
- describe the general nature of the conceptual domain by specifying the property the construct represents and the entity that it applies to
- When concept is multidimensional, each dimension or facet should be intentional/clearly defined
- The conceptual definition should also specify whether the concept is stable over time and generalizable across situations
- The concept should be distinguished from other concepts (e.g., attributes unique to focal concept) to reduce construct proliferation.
- Identifying antecedents and consequences can help clarify the definition, but it shouldn’t be the sole definition
- Avoid tautological statements - definition that simply restates in different words the thing that is being defined.
(4) Refine the definition of the concept
- Revise the definition as needed (e.g., SME reviews, ask “what do you mean by that?”)
Zickar (2020)
Measurement annual review
Measurement development and evaluation
Before item writing, decide on the most appropriate format for the items and review item-writing best practices
Write significantly more items than needed
For negatively-worded items, there is much debate about whether they should be included or not. If you want to maximize unidimensionality, items should be all scored in the same direction. Reverse-coded items may introduce unintended method factors and tend to have smaller discrimination
For ambiguous items, they tend to perform worse psychometrically
- Double-barreled items confuse respondents overall and should be avoided
- For example: Employers should be allowed to use urine drug tests but not hair follicle tests (Disagreement = either both are bad or both are okay).
Measurement evaluation frameworks
- CTT: simplest, works well with smaller samples, item-total correlation is useful in EARLY scale development for weeding out bad items; but no way to determine model fit and assumes linear relationship
- EFA: good for inductive, and early stages; but often not replicable (generally assumes linear relationship)
- CFA: specific fit evaluations, useful once basic structure is clarified; but results often hypersensitive to wording of items (generally assumes linear relationship)
- IRT: detailed stats on how items individually function, understanding of response process; but requires large N and complex statistical programs
Cho (2016)
reliability
Making reliability reliable: A systematic approach to reliability coefficients
All four test models assume that true scores from one test are linearly related to the true scores from another test. They also assume the unidimensionality of items. Each model has different basic assumptions on measurement.
Parallel model
Assumes indicators of a given factor have equal loadings and equal error variances.
Assumes that true scores from different tests are equal when the same individual is involved.
The tests have intercepts of 0 and slope (coefficient) of 1 when linking the two true scores
→ test-retest reliability; alternate form
Tau-equivalent model
Assumes indicators of a given factor have equal loadings but differing error variances.
Assumes that true scores from different items are equal when the same person is involved.
equality of error variances is not assumed.
→ coefficient alpha
An essentially tau-equivalent model
- assumes indicators of a given factor have equal loadings but differing error variances
- Similar to tau-equivalent model, but this model frees the assumption of 0 intercepts in the regression describing two true scores. This model allows different true score means.
- The model is even less restrictive than the tau-equivalent test model.
- T1 = a + T2
A congeneric model
- Assumes indicators of a given factor have differing loadings and differing error variances.
→ most lenient → coefficient omega
- the congeneric test model has the least restrictive assumptions.
- Both true scores and variances of errors from two tests are not equally assumed.
- The intercept and slope of the regression equation fitting the two test scores are not considered to be 0 and 1 each.
- These assumptions allow this model to be most realistic.
STEP 1. Identify the dimensionality of the data.
Decision: If unidimensional, go to step 2; otherwise, go to step 3.
STEP 2. Identify the statistical similarity of the unidimensional data
Dependencies: Chi-square difference test
STEP 3. Determine the measurement model of the multidimensional data
Dependencies: Chi-square difference test and theoretical considerations
McNeish (2018)
alpha/omega/H
Thanks coefficient alpha, we’ll take it from here
empirical studies in psychology commonly report Cronbach’s alpha as a measure of internal consistency reliability despite the fact that many methodological studies have shown that Cronbach’s alpha is riddled with problems stemming from unrealistic assumptions
many times, violating these assumptions yields estimates of reliability that are too small -> making measures look less reliable than they actually are
however, published literature is still using CA
One interpretation of CA from Kline (1986): correlation between a scale and another hypothetical, same-length scale measuring the same construct
Assumptions of CA
(1) Tau-equivalence: Items must have equal factor loadings.
–> But most psychological scales are congeneric (unequal loadings), which leads to underestimates of reliability.
(2) Continuous, normally distributed items:
–> In reality, most items are discrete (e.g., Likert), violating this assumption.
–> Fix: use a polychoric covariance matrix instead of Pearson correlations.
(3) Uncorrelated error terms:
–> Often violated due to item wording/order, speeded tests, or changes in respondent mood.
–> Correlated errors often lead to overestimates of CA.
(4) Unidimensionality:
–> CA does not guarantee unidimensionality.
–> Must check with factor analysis before interpreting CA.
Alternatives to Cronbach’s alpha
(1) Omega = A more accurate estimate of composite reliability, designed for congeneric scales; Omega Total includes both general and specific factors; subsumes CA as a special case; Assumes uncorrelated errors, but can be generalized to handle correlated errors.
–> will be roughly equal to alpha if all of the correct assumptions for alpha are met (above), which is rare in psych research
(2) Coefficient H and Maximal Reliability = Measures how well a scale performs when items are optimally weighted (vs. equal weighting in CA). Best when using factor loadings to weight items differently, improving reliability estimation
Campbell & Fiske (1959)
MTMM matrix
convergent/discriminant validity
matrix showing correlations among two or more measurement techniques used to assess two or more constructs or traits, as obtained from a multitrait–multimethod model.
It includes correlations among the same traits with different methods (i.e., monotrait–heteromethod) and among different traits with the same method (i.e., heterotrait–monomethod).
The former are expected to be the largest, thus demonstrating convergent validity, whereas the latter are expected to be smallest, demonstrating discriminant validity
Hunsley & Meyer (2003)
incremental validity
The incremental validity of psychological testing and assessment: conceptual, methodological, and statistical issues
in selection: determine whether incorporating a new measure enhances decision-making accuracy
As with all validity evidence, a measure’s incremental validity is context-dependent; a test may add value in one setting but not in another.
Interpretation Challenges: The size of the incremental validity effect should be interpreted in light of practical significance, not just statistical significance.
Cost-Benefit Analysis: The added predictive value of a new measure should be weighed against the costs (financial, time, resources) associated with its implementation.
careful consideration of methodological and statistical factors to ensure that new measures truly enhance predictive accuracy and decision-making in applied psychological settings.
Test bias
Can use Berry (2015) for applying this to adverse impact and cognitive ability tests
Internal bias = measurement bias, within a latent trait across groups (test with measurement invariance, DIF)
External bias = relational bias, predictor-outcome relationship differs across groups (test with differential validity), differential prediction/predictive bias (test with hierarchical regression)
PCA versus EFA
(Use textbook: Furr, 2018)
Exploratory factor analysis is used when it is not known how many factors there are between the items and which factors are determined by which items
EFA and PCA are two entirely different things! EFA = reflective; PCA = formative
EFA
Goal: reduce the redundancy among the variables by using a smaller number of factors; evaluate the internal factor structure
–> factor extraction –> enumeration –> rotation –> interpretation
It is used when little is known about the underlying structure
Factor is referred to as a dimension underlying latent traits
EFA involves factoring a correlation (or covariance) matrix representing the shared or common variance
Factors are evidenced by patterns of correlation among indicators –> need to use correct correlation based on data (continuous, binary, categorical) –> If the indicators are not correlated in the first place, game over!
Variance of an observed variable
We partition the variance of yi into a component due to the common factors (communality) and a unique component (unique variance)
communality = shared variance among a set of indicators
if h2 = .70, then 70% of indicator (i.e., item) variance is explained by underlying factors. The rest, or 30% of the total variance, is unique variance
Limitations
- naming the factors can be problematic. Factor names may not accurately reflect the variables within the factor.
- some variables are difficult to interpret because they may load onto more than one factor which is known as split loadings. These variables may correlate with each another to produce a factor despite having little underlying meaning for the factor
Principle Components Analysis
PCA is NOT a “true” factor model
However, beginning with PCA allows us to establish a number of core principles that apply to all factor analytic approaches
PCA is also widely used in practice, is often the default in major software packages, and is easily confused with other types of EFA
PCA involves a mathematical procedure that transforms a set of correlated variables into a smaller set of uncorrelated principal components (PCs)
These PCs are linear combinations of the original variables and can be thought of as “new” variables (Johnson, 1998)
–> How can I abbreviate this set of variables?
EFA vs CFA
Can cite Furr (2018)
EFA vs. CFA: What gets analyzed
Structure
EFA
- All items load on all factors
- Goal is to pick a rotation that gives closest approximation to “simple structure” (clearly defined factors, fewest cross-loadings)
CFA
- CFA must be theory-driven: any structure is a testable hypothesis
- You specify number of latent variables and their structure
- You specify which items load on which latent variables
- You specify any additional relationships for method/other covariance
Matrix
EFA: Correlation matrix (of items = indicators)
- only correlations among observed item responses are used
- Only a standardized solution is provided
CFA: Covariance matrix (of items = indicators)
- Variances and covariances of observed item responses are analyzed
- Output included unstandardized (covariance) AND standardized (correlation) solutions
Factor scores
EFA: Don’t use factor scores from an EFA
- Factor scores are indeterminate (especially due to rotation)
CFA: Factor scores can be used
- Factors can either be predictors (“exogenous” variables) or outcomes (“endogenous” variables) or both at once as needed (e.g., as mediators)
CFA identification
- under-identified: unknown parameters > known parameters
- just-identified: unknown parameters = known parameters –> this is not really a model, moreso a model description
- over-identified: unknown parameters < known parameters –> the only time we can test model fit
Measurement invariance
CFA vs IRT
(Tay et al., 2015)
Use CFA for:
- relationship among latent FACTORS [across groups]
- General equivalence of SCALE scores across groups (continuous indicators)
Use IRT for:
- general equivalence of TEST scores (total) across groups
- equivalence of TEST items across groups
- General equivalence of SCALE scores across groups (categorical indicators)
Use either for:
- Equivalence of SCALE items (if individual items themselves are more of interest, use IRT)
Flora et al. (2012)
CFA with ordinal data
MAIN POINT: when using a CFA with ordinal data (e.g., likert scale) use polychoric correlation and robust weighted least squares (WLS) estimator
Credé & Harms (2015)
25 years of higher‐order confirmatory factor analysis in the organizational sciences: a critical review and development of reporting recommendations.
Higher order CFA = used when constructs are hierarchically structured—that is, when several related first-order factors are believed to be explained by a broader, overarching second-order (or higher-order) factor (e.g. sub-dimensions of a trait like conscientiousness)
Second order factor = conscientiousness
First order factor = achievement striving, orderliness
Manifest/observed variables = individual items “I tend to keep things in order” (orderliness–and ONLY orderliness)
Error terms = measurement error for items
researchers should present 5 types of evidence to support a higher order model using CFA
1 - Higher order model (HOM) can accurately reproduce covariation among manifest variables in an ABSOLUTE sense
–> Evidence: Global fit indices (e.g., RMSEA, CFI, SRMR)
2 - HOM can reproduce the covariation among manifest variables as accurately as the bifactor model and the oblique lower order model
–> Evidence: Model comparison using fit indices and likelihood ratio tests. Incremental fit indices (e.g., difference in CFI)
3 - HOM is characterized by a higher-order factor that reproduces the covariation among FIRST-ORDER factors
–> Evidence: The correlation matrix of the first-order factors should closely match what the higher-order factor predicts. The higher-order factor loadings (from the first-order factors) should be strong and statistically significant.
4 - HOM explains substantial VARIATION in first order factors
–> Evidence: R² values (explained variance) for the first-order factors should be substantial. Higher-order factor loadings should indicate a strong relationship with each first-order factor.
5 - HOM explains substantial variation in manifest variables
–> Evidence: R² values for the manifest variables should indicate that the model explains a significant portion of their variance. Indirect effects can be calculated to show how the higher-order factor influences the manifest variables via the first-order factors.
found that organizational researchers typically do NOT bolster their claims of HOM with these 5 types of evidence (e.g., core self-evaluation construct; Erez & Judge, 2001)
Multidimensional forced choice format
(Use Lee et al., 2018)
MFC measures are designed to assess several traits simultaneously using statements (or, more generally, trait descriptors) that are intended to be factorially pure; that is, each statement is intended to measure just one trait. MFC items are groups of statements that are presented together for examinee consideration.
recent research shows that MFC have similar or higher construct and criterion-related validity than Likert-type personality inventories
MFC measures commonly use a two-alternative (pair), three-alternative (triplet), or four-alternative (tetrad) format.
MFC response formats can be classified into three general types:
–> PICK - pick the statement most like you
–> MOLE - choose the most like and least like statements
–> RANK - rank statements from most like you to least like you
Recent research suggests that the RANK response format yields better latent trait (person parameter) recovery than the MOLE and PICK (Joo et al., 2018) among tests of equal length
Recommended approach is MFC with rank response with triplets (best balance of info provided and cognitive load required to answer)
IRT basics
(use Farr, 2018)
IRT is a mathematical model that relates a test-taker’s latent trait or ability level with the probability of responding in a specific response category of an item.
Item characteristic curve (ICC) → function relating probability of a correct answer on an item to the ability (true score) measured by the test containing that item
Parameters of item characteristic curve:
Ability (θ) — represents the “true” ability as the construct measured by the test –> measured along horizontal axis
Difficulty (b) — represents the difficulty of the item –> Shifts along the ability axis to show level of difficulty
Discrimination (a) — represents the discrimination of an item –> steeper curve = more discrimination
Guessing (c) — represents the possibility of being correct by guess. –> Lower asymptote for the probability of a correct response
Assumes UNIDIMENSIONALITY (single trait) and LOCAL INDEPENDENCE (latent variable in a model fully explains why the observed items are related to one another)
Evaluate IRT model fit
- Item level: Item-level fit statistics (e.g., various Chi-square tests and fit indices) and plots
- Test level: Overall chi-square fit statistics
- The null hypothesis is that the model fits the data
- Chi-square value showing non-significant indicates that the model fits the data
Newman (2014)
Missing data
Missing data: five practical guidelines
Types of Missing data
Missing completely at random (MCAR) = missingness is not related to any of the other variables or itself
–> MCAR does not bias the results. Any kinds of imputation can be used, and the results will be unbiased.
–> However, if MCAR takes up large portion, listwise deletion would be inefficient because it reduces sample size, and error gets larger.
Missing at random (MAR) = Missingness is related to other variables, but not to the value of Y itself
–> By using auxiliary variables
Missing not at random (MNAR): Missingness is related to the value of Y itself
Best practice to use maximum likelihood (usus summary estimates) or multiple imputation (20-40 imputed datasets then combine)
MNAR WILL ALWAYS BE BIASED NO MATTER WHAT MECHANISM YOU USE. But - ML and MI allow for accurate standard errors in MNAR
Aguinis et al. (2013)
Outliers
Best-practice recommendations for defining, identifying, and handling outliers
There are
(a) 14 unique and mutually exclusive outlier definitions, 39 outlier identification techniques, and 20 different ways of handling outliers;
(b) inconsistencies in how outliers are defined, identified, and handled in various methodological sources; and
(c) confusion and lack of transparency in how outliers are addressed by substantive researchers.
Three types of outliers
- ERROR outliers: data points that lie at a distance from other data points because they are results of inaccuracies (errors)
- INTERESTING outliers: data points that have been identified as outlying observations (i.e., potential error outliers), but not confirmed as actual error outliers.
- INFLUENTIAL outliers: model fit outliers and prediction outliers; data points whose presence alter the fit of a model or parameter estimates.
DeSimone et al. (2015)
data screening
Best practice recommendations for data screening
During study design
- Determine the forms of insufficient-effort responding most likely to be exhibited by respondents / how best to detect
- Insert screening items at various points throughout the survey to detect localized lapses in effort, ensuring that these items use identical response options to surrounding items.
- Understand the range of possible values for each variable as well as distributional characteristics for variables that have been studied previously.
During study administration
- Time respondents (if possible)
- Observe respondents to determine whether or not they are attending to study-related tasks (has issues with demand effects)
After data has been collected
- Visually inspect the data to identify data-entry errors or implausible values for each variable.
- Calculate the distributional characteristics of each item to assist in identifying outliers
- PRIOR to examining individual respondents, carefully determine what should be considered insufficient-effort responding, noting that this will vary by research design
- Using a combination of different screening techniques, eliminate participants who are likely to have exhibited insufficient effort responding.
Report the results of a study both BEFORE and AFTER employing data screening techniques, noting any differences in results that arise as the result of eliminating insufficient-effort respondents.
Zickar et al. (2023)
sampling quality annual review
Innovations in sampling: improving the appropriateness and quality of samples in organizational research.
Considerations when CHOOSING samples
- cost
- potential degree of accuracy of population you’ll get with the sample
- time to collect
- n of subgroup members
- generalizability considerations
Looking for careless / inefficient effort responding
- attention checks [manipulation checks; bogus items]
- mahalanobis D
- long string data
- response time
- consistency indices [even-odd, response consistency]
- response coherence [open-ended items]
- self-reported effort/attention
Podsakoff et al. (2024)
annual review CMB
Common method bias: It’s bad, it’s complex, it’s widespread, and it’s not easy to fix
2 harmful effects of common method bias/method variance
1) method variance can bias estimates of reliability and validity of latent variables
2) can bias parameter estimates of the relationship between measures of different constructs
common rater effects = leniency, social desirability, affect
item characteristic effects = item wording, ambiguity, item-level demand effects, common scale anchors
item context effects = item priming effects, item proximity, scale length
measurement context effects = same time, same location, same medium
4 basic procedural remedies
1 - obtain measures of predictor and criterion from different sources
2 - separate the measurement of predictor and criterion variables temporally, proximally, psychologically
3 - protect respondent anonymity and reduce evaluation apprehension
4 - minimize common scale properties
best to match procedural remedy to source of CMB
Antonakis et al. (2010)
causality
On making causal claims: A review and recommendations
Major problem in non-experimental causal models:
endogeneity bias leads to biased & inconsistent estimates of the model parameters (unreliable results for inference or prediction): effect of x on y (dependent; e.g., income and consumer preference) cannot be interpreted because it includes
- omitted causes: thus its included in the error term and correlates w/included IV violating exogeneity assumption,
- simultaneity: bidirectional IV and DV (e.g., assertiveness to engagement and engagement to assertiveness)
- measurement error: becomes part of error term which then correlates with IV,
- self-selection: not randomly selected where participation is voluntary thus may not be representative of the population
- X must precede Y temporally (necessary but not sufficient condition)
- X must be reliably correlated with Y (beyond chance)
- The relationship between X and Y must not be explained by other causes
Failsafe way to establish causality = randomized experiments
Funder & Ozer (2019)
effect sizes
Evaluating effect size in psychological research: Sense and nonsense.
the nonsensical but widely used interpretation of effect size is Cohen’s (1988) suggestions in the context of power analysis: r = .10 is small, .30 is medium, and .50 is large
Cohen later regretted these suggestions
small, medium, and large are meaningless in the absence of a frame of reference
Suggest two easy ways to interpret effect sizes
–> use a benchmark: compare with classic studies, with other well-established psychological findings, comparisons with “all” studies, comparisons with intuitively understood non-psychological relations
–> estimate consequences: binomial effect size display (BESD) - the BESD illustrates the size of an effect, reported in terms of r, using a 2 x 2 table of outcomes; consequences in the long run - could this tiny effect size be truly important for a population (e.g., smoking and cancer)?
Implications for Interpreting Research Findings
- Researchers should not automatically dismiss “small effects”
- Researchers should be more skeptical about “large” effects
- Researchers should be more realistic about the aim of their programs of psychological research
Recommendations for Research Practice
- Report effect sizes, always and prominently, with confidence intervals
- Conduct studies with large samples (when possible)
- Report effect sizes in terms that are meaningful in context
revise the cohen guidelines
r = .10 - small at level of single events but potentially more ultimately consequential
r = .20 - medium size that is somewhat explanatory and practical in short run
r = .30 - large, potentially very powerful in both short and long run
r > .40 - very large in psychological research and likely to be a gross overestimate that will rarely be found in large sample or in a replication
Bosco et al. (2015)
effect size
Correlational effect size benchmarks
use meta-analysis to create correlation effect size benchmarks for applied psychology
In sum, results indicated that commonly used, existing ES benchmarks are not appropriately tailored to the applied psychology research context. Results also indicate that empirical benchmarks for effect size magnitude may vary as a function of bivariate relation type
effect sizes are larger when relations don’t include behaviors
–> medium effect sizes (r)
involving behaviors = .10 - .25
not involving behaviors = .20 - .40
found that relations with movement (e.g., turnover) are smaller than relations with performance
also calculated the sample sizes needed to achieve .80 power with the effect sizes they found
using the broad benchmarks, sample sizes required to achieve .80 power for a 50th percentile effect size vary between 97 and 150 (nonbehavioral relations) and between 215 and 304 (behavioral relations)
Nosek et al. (2022)
replication crisis annual review
Replicability, robustness, and reproducibility in psychological science
A study is a REPLICATION when the innumerable differences from the original study are believed to be irrelevant for obtaining the evidence about the same finding.
Replication - testing the reliability of a prior finding with different data
Robustness - testing the reliability of a prior finding using the same data and a different analysis strategy
Reproducibility - testing the reliability of a prior finding using the same data and the same analysis strategy
Replication seems straightforward - but it is not. There is no such thing as an exact replication. This fact creates a tension
Can resolve this tension by:
- accepting that every study is unique and the evidence it produces applies only to a context that will never occur again
- understanding replication as a theoretical commitment.
SYSTEMATIC replication - replication efforts that define a sampling frame and conduct replications of as many studies in the sampling frame as possible to minimize selection biases
MULTI-SITE replications - studies that conduct the same replication protocol in a variety of samples and settings to obtain highly precise estimates of effect size and heterogeneity
Bakker et al. (2012) found that if the goal is to create as many positive results as possible, it is in the researchers’ self interest to run MANY UNDERPOWERED studies rather than fewer well powered ones
Research has found that researchers do disagree with the cultural devaluation of replication (hints that problem is with system and rewards not the minds of scientists)