M-side Flashcards

(38 cards)

1
Q

Colquitt et al. (2019)

content validation

A

Content validation guidelines: Evaluation criteria for definitional correspondence and definitional distinctiveness

Definitional CORRESPONDENCE: Degree that scale items align with the target construct’s definition

Definitional DISTINCTIVENESS: Degree that scale items better reflect the focal construct than closely related (“orbiting”) constructs

Two main approaches for content validation

  1. Anderson & Gerbing (1991)
    - Sorting-Based (ChatGPT-like logic)
    - SMEs sort items into the construct they believe best represents the item’s meaning.
    - Goal: Maximize agreement on correct classification.

Metrics:
- Proportion of Substantive Agreement (PSA) = CORRESPONDENCE = % of judges correctly sorting items
- Substantive Validity Coefficient (CSV) = DISTINCTIVENESS = difference between correct vs. incorrect classifications

  1. Hinkin & Tracey (1999)
    - Rating-Based (Human-like logic)
    - JUDGES rate how well each item reflects each construct definition using Likert scales (1–5).

Metrics:
- Hinkin-Tracey Correspondence (HTC) = CORRESPONDENCE = average rating for focal construct ÷ number of anchors
- Hinkin-Tracey Distinctiveness (HTD) = DISTINCTIVENESS = average difference between focal and orbiting construct ratings ÷ (anchors – 1)

When modifying scales, the most common was to drop items from the scale (36.94%); however, modified scales should also be tested for convergent validity, discriminant validity, and comparative CFAs (Cortina et al., 2020)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Podsakoff et al. (2016)

Concept definitions

A

Recommendations for creating better concept definitions in social sciences

Lack of good conceptual definitions has been a longstanding problem in social sciences.

Clear conceptual definitions are essential for scientific progress

Main point: ensure the final version of conceptual definition is clear, concise, understandable to broad audiences, and not subject to multiple interpretations.

CONCEPTS
- Cognitive symbols (or abstract terms) that specify the features, attributes, or characteristics of a phenomenon in the real or phenomenological world that are meant to represent them & distinguished from other related phenomena
- Serve as the building blocks of theory

Problems with Lack of Conceptual CLARITY:
- Difficult to distinguish focal concept from other similar concepts, undermining discriminant validity
- Leads to proliferation of different terms for the same concept – “Old wine new bottle” problem [personal note: especially in leadership research]
- Difficult to specify and test the nomological network of the concept
- Difficult to operationally measure the concept – i.e., mismatch between concept and measures of it, undermining construct validity.
- Also increases the likelihood of contamination or deficiency of the conceptual measurement.

Recommendations: stages may overlap

(1) Identify potential attributes by collecting representative set of definitions
- Activities to aid in collecting attributes: searching dictionary for synonyms/antonyms, surveying literature, interviewing subject matter experts, focus groups, case studies, comparing the concept with its opposite, thinking how to operationalize the concept

(2) Organize potential attributes by theme and identify necessary/sufficient and shared ones
- Identifying underlying themes can help clarify the concept and distinguish from other related constructs.
- Need to identify attributes of the concept that are necessary and jointly sufficient.
- Necessary (essential) = essential properties that all exemplar aspects of the concept must possess.
- Sufficient (unique) = only exemplars the concept possesses.

(3) Develop preliminary definition of concept
- describe the general nature of the conceptual domain by specifying the property the construct represents and the entity that it applies to
- When concept is multidimensional, each dimension or facet should be intentional/clearly defined
- The conceptual definition should also specify whether the concept is stable over time and generalizable across situations
- The concept should be distinguished from other concepts (e.g., attributes unique to focal concept) to reduce construct proliferation.
- Identifying antecedents and consequences can help clarify the definition, but it shouldn’t be the sole definition
- Avoid tautological statements - definition that simply restates in different words the thing that is being defined.

(4) Refine the definition of the concept
- Revise the definition as needed (e.g., SME reviews, ask “what do you mean by that?”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Zickar (2020)

Measurement annual review

A

Measurement development and evaluation

Before item writing, decide on the most appropriate format for the items and review item-writing best practices

Write significantly more items than needed

For negatively-worded items, there is much debate about whether they should be included or not. If you want to maximize unidimensionality, items should be all scored in the same direction. Reverse-coded items may introduce unintended method factors and tend to have smaller discrimination

For ambiguous items, they tend to perform worse psychometrically
- Double-barreled items confuse respondents overall and should be avoided
- For example: Employers should be allowed to use urine drug tests but not hair follicle tests (Disagreement = either both are bad or both are okay).

Measurement evaluation frameworks
- CTT: simplest, works well with smaller samples, item-total correlation is useful in EARLY scale development for weeding out bad items; but no way to determine model fit and assumes linear relationship
- EFA: good for inductive, and early stages; but often not replicable (generally assumes linear relationship)
- CFA: specific fit evaluations, useful once basic structure is clarified; but results often hypersensitive to wording of items (generally assumes linear relationship)
- IRT: detailed stats on how items individually function, understanding of response process; but requires large N and complex statistical programs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cho (2016)

reliability

A

Making reliability reliable: A systematic approach to reliability coefficients

All four test models assume that true scores from one test are linearly related to the true scores from another test. They also assume the unidimensionality of items. Each model has different basic assumptions on measurement.

Parallel model
Assumes indicators of a given factor have equal loadings and equal error variances.
Assumes that true scores from different tests are equal when the same individual is involved.
The tests have intercepts of 0 and slope (coefficient) of 1 when linking the two true scores
→ test-retest reliability; alternate form

Tau-equivalent model
Assumes indicators of a given factor have equal loadings but differing error variances.
Assumes that true scores from different items are equal when the same person is involved.
equality of error variances is not assumed.
→ coefficient alpha

An essentially tau-equivalent model
- assumes indicators of a given factor have equal loadings but differing error variances
- Similar to tau-equivalent model, but this model frees the assumption of 0 intercepts in the regression describing two true scores. This model allows different true score means.
- The model is even less restrictive than the tau-equivalent test model.
- T1 = a + T2

A congeneric model
- Assumes indicators of a given factor have differing loadings and differing error variances.
→ most lenient → coefficient omega
- the congeneric test model has the least restrictive assumptions.
- Both true scores and variances of errors from two tests are not equally assumed.
- The intercept and slope of the regression equation fitting the two test scores are not considered to be 0 and 1 each.
- These assumptions allow this model to be most realistic.

STEP 1. Identify the dimensionality of the data.
Decision: If unidimensional, go to step 2; otherwise, go to step 3.

STEP 2. Identify the statistical similarity of the unidimensional data
Dependencies: Chi-square difference test

STEP 3. Determine the measurement model of the multidimensional data
Dependencies: Chi-square difference test and theoretical considerations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

McNeish (2018)

alpha/omega/H

A

Thanks coefficient alpha, we’ll take it from here

empirical studies in psychology commonly report Cronbach’s alpha as a measure of internal consistency reliability despite the fact that many methodological studies have shown that Cronbach’s alpha is riddled with problems stemming from unrealistic assumptions

many times, violating these assumptions yields estimates of reliability that are too small -> making measures look less reliable than they actually are

however, published literature is still using CA
One interpretation of CA from Kline (1986): correlation between a scale and another hypothetical, same-length scale measuring the same construct

Assumptions of CA
(1) Tau-equivalence: Items must have equal factor loadings.
–> But most psychological scales are congeneric (unequal loadings), which leads to underestimates of reliability.
(2) Continuous, normally distributed items:
–> In reality, most items are discrete (e.g., Likert), violating this assumption.
–> Fix: use a polychoric covariance matrix instead of Pearson correlations.
(3) Uncorrelated error terms:
–> Often violated due to item wording/order, speeded tests, or changes in respondent mood.
–> Correlated errors often lead to overestimates of CA.
(4) Unidimensionality:
–> CA does not guarantee unidimensionality.
–> Must check with factor analysis before interpreting CA.

Alternatives to Cronbach’s alpha

(1) Omega = A more accurate estimate of composite reliability, designed for congeneric scales; Omega Total includes both general and specific factors; subsumes CA as a special case; Assumes uncorrelated errors, but can be generalized to handle correlated errors.
–> will be roughly equal to alpha if all of the correct assumptions for alpha are met (above), which is rare in psych research

(2) Coefficient H and Maximal Reliability = Measures how well a scale performs when items are optimally weighted (vs. equal weighting in CA). Best when using factor loadings to weight items differently, improving reliability estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Campbell & Fiske (1959)

MTMM matrix

A

convergent/discriminant validity

matrix showing correlations among two or more measurement techniques used to assess two or more constructs or traits, as obtained from a multitrait–multimethod model.

It includes correlations among the same traits with different methods (i.e., monotrait–heteromethod) and among different traits with the same method (i.e., heterotrait–monomethod).

The former are expected to be the largest, thus demonstrating convergent validity, whereas the latter are expected to be smallest, demonstrating discriminant validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Hunsley & Meyer (2003)

incremental validity

A

The incremental validity of psychological testing and assessment: conceptual, methodological, and statistical issues

in selection: determine whether incorporating a new measure enhances decision-making accuracy

As with all validity evidence, a measure’s incremental validity is context-dependent; a test may add value in one setting but not in another.

Interpretation Challenges: The size of the incremental validity effect should be interpreted in light of practical significance, not just statistical significance.

Cost-Benefit Analysis: The added predictive value of a new measure should be weighed against the costs (financial, time, resources) associated with its implementation.

careful consideration of methodological and statistical factors to ensure that new measures truly enhance predictive accuracy and decision-making in applied psychological settings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Test bias

A

Can use Berry (2015) for applying this to adverse impact and cognitive ability tests

Internal bias = measurement bias, within a latent trait across groups (test with measurement invariance, DIF)

External bias = relational bias, predictor-outcome relationship differs across groups (test with differential validity), differential prediction/predictive bias (test with hierarchical regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PCA versus EFA

(Use textbook: Furr, 2018)

A

Exploratory factor analysis is used when it is not known how many factors there are between the items and which factors are determined by which items

EFA and PCA are two entirely different things! EFA = reflective; PCA = formative

EFA

Goal: reduce the redundancy among the variables by using a smaller number of factors; evaluate the internal factor structure
–> factor extraction –> enumeration –> rotation –> interpretation

It is used when little is known about the underlying structure
Factor is referred to as a dimension underlying latent traits

EFA involves factoring a correlation (or covariance) matrix representing the shared or common variance

Factors are evidenced by patterns of correlation among indicators –> need to use correct correlation based on data (continuous, binary, categorical) –> If the indicators are not correlated in the first place, game over!

Variance of an observed variable
We partition the variance of yi into a component due to the common factors (communality) and a unique component (unique variance)
communality = shared variance among a set of indicators
if h2 = .70, then 70% of indicator (i.e., item) variance is explained by underlying factors. The rest, or 30% of the total variance, is unique variance

Limitations
- naming the factors can be problematic. Factor names may not accurately reflect the variables within the factor.
- some variables are difficult to interpret because they may load onto more than one factor which is known as split loadings. These variables may correlate with each another to produce a factor despite having little underlying meaning for the factor

Principle Components Analysis

PCA is NOT a “true” factor model

However, beginning with PCA allows us to establish a number of core principles that apply to all factor analytic approaches

PCA is also widely used in practice, is often the default in major software packages, and is easily confused with other types of EFA

PCA involves a mathematical procedure that transforms a set of correlated variables into a smaller set of uncorrelated principal components (PCs)

These PCs are linear combinations of the original variables and can be thought of as “new” variables (Johnson, 1998)
–> How can I abbreviate this set of variables?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

EFA vs CFA

Can cite Furr (2018)

A

EFA vs. CFA: What gets analyzed

Structure
EFA
- All items load on all factors
- Goal is to pick a rotation that gives closest approximation to “simple structure” (clearly defined factors, fewest cross-loadings)
CFA
- CFA must be theory-driven: any structure is a testable hypothesis
- You specify number of latent variables and their structure
- You specify which items load on which latent variables
- You specify any additional relationships for method/other covariance

Matrix
EFA: Correlation matrix (of items = indicators)
- only correlations among observed item responses are used
- Only a standardized solution is provided
CFA: Covariance matrix (of items = indicators)
- Variances and covariances of observed item responses are analyzed
- Output included unstandardized (covariance) AND standardized (correlation) solutions

Factor scores
EFA: Don’t use factor scores from an EFA
- Factor scores are indeterminate (especially due to rotation)
CFA: Factor scores can be used
- Factors can either be predictors (“exogenous” variables) or outcomes (“endogenous” variables) or both at once as needed (e.g., as mediators)

CFA identification
- under-identified: unknown parameters > known parameters
- just-identified: unknown parameters = known parameters –> this is not really a model, moreso a model description
- over-identified: unknown parameters < known parameters –> the only time we can test model fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Measurement invariance
CFA vs IRT

(Tay et al., 2015)

A

Use CFA for:
- relationship among latent FACTORS [across groups]
- General equivalence of SCALE scores across groups (continuous indicators)

Use IRT for:
- general equivalence of TEST scores (total) across groups
- equivalence of TEST items across groups
- General equivalence of SCALE scores across groups (categorical indicators)

Use either for:
- Equivalence of SCALE items (if individual items themselves are more of interest, use IRT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Flora et al. (2012)

CFA with ordinal data

A

MAIN POINT: when using a CFA with ordinal data (e.g., likert scale) use polychoric correlation and robust weighted least squares (WLS) estimator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Credé & Harms (2015)

A

25 years of higher‐order confirmatory factor analysis in the organizational sciences: a critical review and development of reporting recommendations.

Higher order CFA = used when constructs are hierarchically structured—that is, when several related first-order factors are believed to be explained by a broader, overarching second-order (or higher-order) factor (e.g. sub-dimensions of a trait like conscientiousness)

Second order factor = conscientiousness
First order factor = achievement striving, orderliness
Manifest/observed variables = individual items “I tend to keep things in order” (orderliness–and ONLY orderliness)
Error terms = measurement error for items

researchers should present 5 types of evidence to support a higher order model using CFA

1 - Higher order model (HOM) can accurately reproduce covariation among manifest variables in an ABSOLUTE sense
–> Evidence: Global fit indices (e.g., RMSEA, CFI, SRMR)

2 - HOM can reproduce the covariation among manifest variables as accurately as the bifactor model and the oblique lower order model
–> Evidence: Model comparison using fit indices and likelihood ratio tests. Incremental fit indices (e.g., difference in CFI)

3 - HOM is characterized by a higher-order factor that reproduces the covariation among FIRST-ORDER factors
–> Evidence: The correlation matrix of the first-order factors should closely match what the higher-order factor predicts. The higher-order factor loadings (from the first-order factors) should be strong and statistically significant.

4 - HOM explains substantial VARIATION in first order factors
–> Evidence: R² values (explained variance) for the first-order factors should be substantial. Higher-order factor loadings should indicate a strong relationship with each first-order factor.

5 - HOM explains substantial variation in manifest variables
–> Evidence: R² values for the manifest variables should indicate that the model explains a significant portion of their variance. Indirect effects can be calculated to show how the higher-order factor influences the manifest variables via the first-order factors.

found that organizational researchers typically do NOT bolster their claims of HOM with these 5 types of evidence (e.g., core self-evaluation construct; Erez & Judge, 2001)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Multidimensional forced choice format

(Use Lee et al., 2018)

A

MFC measures are designed to assess several traits simultaneously using statements (or, more generally, trait descriptors) that are intended to be factorially pure; that is, each statement is intended to measure just one trait. MFC items are groups of statements that are presented together for examinee consideration.

recent research shows that MFC have similar or higher construct and criterion-related validity than Likert-type personality inventories

MFC measures commonly use a two-alternative (pair), three-alternative (triplet), or four-alternative (tetrad) format.

MFC response formats can be classified into three general types:
–> PICK - pick the statement most like you
–> MOLE - choose the most like and least like statements
–> RANK - rank statements from most like you to least like you

Recent research suggests that the RANK response format yields better latent trait (person parameter) recovery than the MOLE and PICK (Joo et al., 2018) among tests of equal length

Recommended approach is MFC with rank response with triplets (best balance of info provided and cognitive load required to answer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

IRT basics

(use Farr, 2018)

A

IRT is a mathematical model that relates a test-taker’s latent trait or ability level with the probability of responding in a specific response category of an item.

Item characteristic curve (ICC) → function relating probability of a correct answer on an item to the ability (true score) measured by the test containing that item

Parameters of item characteristic curve:
Ability (θ) — represents the “true” ability as the construct measured by the test –> measured along horizontal axis
Difficulty (b) — represents the difficulty of the item –> Shifts along the ability axis to show level of difficulty
Discrimination (a) — represents the discrimination of an item –> steeper curve = more discrimination
Guessing (c) — represents the possibility of being correct by guess. –> Lower asymptote for the probability of a correct response

Assumes UNIDIMENSIONALITY (single trait) and LOCAL INDEPENDENCE (latent variable in a model fully explains why the observed items are related to one another)

Evaluate IRT model fit
- Item level: Item-level fit statistics (e.g., various Chi-square tests and fit indices) and plots
- Test level: Overall chi-square fit statistics
- The null hypothesis is that the model fits the data
- Chi-square value showing non-significant indicates that the model fits the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Newman (2014)

Missing data

A

Missing data: five practical guidelines

Types of Missing data
Missing completely at random (MCAR) = missingness is not related to any of the other variables or itself
–> MCAR does not bias the results. Any kinds of imputation can be used, and the results will be unbiased.
–> However, if MCAR takes up large portion, listwise deletion would be inefficient because it reduces sample size, and error gets larger.

Missing at random (MAR) = Missingness is related to other variables, but not to the value of Y itself
–> By using auxiliary variables

Missing not at random (MNAR): Missingness is related to the value of Y itself

Best practice to use maximum likelihood (usus summary estimates) or multiple imputation (20-40 imputed datasets then combine)

MNAR WILL ALWAYS BE BIASED NO MATTER WHAT MECHANISM YOU USE. But - ML and MI allow for accurate standard errors in MNAR

17
Q

Aguinis et al. (2013)

Outliers

A

Best-practice recommendations for defining, identifying, and handling outliers

There are
(a) 14 unique and mutually exclusive outlier definitions, 39 outlier identification techniques, and 20 different ways of handling outliers;
(b) inconsistencies in how outliers are defined, identified, and handled in various methodological sources; and
(c) confusion and lack of transparency in how outliers are addressed by substantive researchers.

Three types of outliers
- ERROR outliers: data points that lie at a distance from other data points because they are results of inaccuracies (errors)
- INTERESTING outliers: data points that have been identified as outlying observations (i.e., potential error outliers), but not confirmed as actual error outliers.
- INFLUENTIAL outliers: model fit outliers and prediction outliers; data points whose presence alter the fit of a model or parameter estimates.

18
Q

DeSimone et al. (2015)

data screening

A

Best practice recommendations for data screening

During study design
- Determine the forms of insufficient-effort responding most likely to be exhibited by respondents / how best to detect
- Insert screening items at various points throughout the survey to detect localized lapses in effort, ensuring that these items use identical response options to surrounding items.
- Understand the range of possible values for each variable as well as distributional characteristics for variables that have been studied previously.

During study administration
- Time respondents (if possible)
- Observe respondents to determine whether or not they are attending to study-related tasks (has issues with demand effects)

After data has been collected
- Visually inspect the data to identify data-entry errors or implausible values for each variable.
- Calculate the distributional characteristics of each item to assist in identifying outliers
- PRIOR to examining individual respondents, carefully determine what should be considered insufficient-effort responding, noting that this will vary by research design
- Using a combination of different screening techniques, eliminate participants who are likely to have exhibited insufficient effort responding.

Report the results of a study both BEFORE and AFTER employing data screening techniques, noting any differences in results that arise as the result of eliminating insufficient-effort respondents.

19
Q

Zickar et al. (2023)

sampling quality annual review

A

Innovations in sampling: improving the appropriateness and quality of samples in organizational research.

Considerations when CHOOSING samples
- cost
- potential degree of accuracy of population you’ll get with the sample
- time to collect
- n of subgroup members
- generalizability considerations

Looking for careless / inefficient effort responding
- attention checks [manipulation checks; bogus items]
- mahalanobis D
- long string data
- response time
- consistency indices [even-odd, response consistency]
- response coherence [open-ended items]
- self-reported effort/attention

20
Q

Podsakoff et al. (2024)

annual review CMB

A

Common method bias: It’s bad, it’s complex, it’s widespread, and it’s not easy to fix

2 harmful effects of common method bias/method variance

1) method variance can bias estimates of reliability and validity of latent variables
2) can bias parameter estimates of the relationship between measures of different constructs

common rater effects = leniency, social desirability, affect

item characteristic effects = item wording, ambiguity, item-level demand effects, common scale anchors

item context effects = item priming effects, item proximity, scale length

measurement context effects = same time, same location, same medium

4 basic procedural remedies
1 - obtain measures of predictor and criterion from different sources
2 - separate the measurement of predictor and criterion variables temporally, proximally, psychologically
3 - protect respondent anonymity and reduce evaluation apprehension
4 - minimize common scale properties

best to match procedural remedy to source of CMB

21
Q

Antonakis et al. (2010)

causality

A

On making causal claims: A review and recommendations

Major problem in non-experimental causal models:

endogeneity bias leads to biased & inconsistent estimates of the model parameters (unreliable results for inference or prediction): effect of x on y (dependent; e.g., income and consumer preference) cannot be interpreted because it includes
- omitted causes: thus its included in the error term and correlates w/included IV violating exogeneity assumption,
- simultaneity: bidirectional IV and DV (e.g., assertiveness to engagement and engagement to assertiveness)
- measurement error: becomes part of error term which then correlates with IV,
- self-selection: not randomly selected where participation is voluntary thus may not be representative of the population

  1. X must precede Y temporally (necessary but not sufficient condition)
  2. X must be reliably correlated with Y (beyond chance)
  3. The relationship between X and Y must not be explained by other causes

Failsafe way to establish causality = randomized experiments

22
Q

Funder & Ozer (2019)

effect sizes

A

Evaluating effect size in psychological research: Sense and nonsense.

the nonsensical but widely used interpretation of effect size is Cohen’s (1988) suggestions in the context of power analysis: r = .10 is small, .30 is medium, and .50 is large

Cohen later regretted these suggestions
small, medium, and large are meaningless in the absence of a frame of reference

Suggest two easy ways to interpret effect sizes
–> use a benchmark: compare with classic studies, with other well-established psychological findings, comparisons with “all” studies, comparisons with intuitively understood non-psychological relations
–> estimate consequences: binomial effect size display (BESD) - the BESD illustrates the size of an effect, reported in terms of r, using a 2 x 2 table of outcomes; consequences in the long run - could this tiny effect size be truly important for a population (e.g., smoking and cancer)?

Implications for Interpreting Research Findings
- Researchers should not automatically dismiss “small effects”
- Researchers should be more skeptical about “large” effects
- Researchers should be more realistic about the aim of their programs of psychological research

Recommendations for Research Practice
- Report effect sizes, always and prominently, with confidence intervals
- Conduct studies with large samples (when possible)
- Report effect sizes in terms that are meaningful in context

revise the cohen guidelines
r = .10 - small at level of single events but potentially more ultimately consequential
r = .20 - medium size that is somewhat explanatory and practical in short run
r = .30 - large, potentially very powerful in both short and long run
r > .40 - very large in psychological research and likely to be a gross overestimate that will rarely be found in large sample or in a replication

23
Q

Bosco et al. (2015)

effect size

A

Correlational effect size benchmarks

use meta-analysis to create correlation effect size benchmarks for applied psychology

In sum, results indicated that commonly used, existing ES benchmarks are not appropriately tailored to the applied psychology research context. Results also indicate that empirical benchmarks for effect size magnitude may vary as a function of bivariate relation type

effect sizes are larger when relations don’t include behaviors
–> medium effect sizes (r)
involving behaviors = .10 - .25
not involving behaviors = .20 - .40

found that relations with movement (e.g., turnover) are smaller than relations with performance

also calculated the sample sizes needed to achieve .80 power with the effect sizes they found

using the broad benchmarks, sample sizes required to achieve .80 power for a 50th percentile effect size vary between 97 and 150 (nonbehavioral relations) and between 215 and 304 (behavioral relations)

24
Q

Nosek et al. (2022)

replication crisis annual review

A

Replicability, robustness, and reproducibility in psychological science

A study is a REPLICATION when the innumerable differences from the original study are believed to be irrelevant for obtaining the evidence about the same finding.

Replication - testing the reliability of a prior finding with different data
Robustness - testing the reliability of a prior finding using the same data and a different analysis strategy
Reproducibility - testing the reliability of a prior finding using the same data and the same analysis strategy

Replication seems straightforward - but it is not. There is no such thing as an exact replication. This fact creates a tension

Can resolve this tension by:
- accepting that every study is unique and the evidence it produces applies only to a context that will never occur again
- understanding replication as a theoretical commitment.

SYSTEMATIC replication - replication efforts that define a sampling frame and conduct replications of as many studies in the sampling frame as possible to minimize selection biases
MULTI-SITE replications - studies that conduct the same replication protocol in a variety of samples and settings to obtain highly precise estimates of effect size and heterogeneity

Bakker et al. (2012) found that if the goal is to create as many positive results as possible, it is in the researchers’ self interest to run MANY UNDERPOWERED studies rather than fewer well powered ones

Research has found that researchers do disagree with the cultural devaluation of replication (hints that problem is with system and rewards not the minds of scientists)

25
Silberzahn et al. (2018) soccer/red card dataset
Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results Demonstrated the influence that data analysis can have on results Used 29 teams (61 total analysts) to analyze the same data set to determine whether soccer referees are more likely to give red cards to dark skin toned versus light skin toned players Teams used various different analytic approaches effect sizes range from 0.89 to 2.93 in odds ratio units 69% of teams found a statistically significant positive effect and 31% of teams didn’t Analysts’ prior beliefs about effect or interest nor level of expertise nor peer ratings of quality of analyses explained variation in the outcomes Findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions Crowdsourcing data analysis, a strategy in which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective analytic choices influence research results
26
Bryan et al. (2021) heterogeneity revolution
Behavioral science is unlikely to change the world without a heterogeneity revolution the field’s response to concerns about replicability has concentrated almost exclusively on efforts to control type 1 error, buit the single minded focus on this issue is distracting from, and possibly aggravating, more fundamental problems standing in the way of behavioral science’s potential to change the world: --> the narrow emphasis on discovering main effects and the common practice of drawing inferences about an intervention’s likely effect at a population scale based on findings in haphazard convenience samples that cannot support such generalizations narrow focus on main effects in the population as a whole almost necessarily means a focus on effects in the group with the greatest numerical representation Need a heterogeneity revolution with a new paradigm defined by: 1 - a presumption that intervention effects are context dependent 2 - skepticism of insufficiently qualified claims about an intervention’s ‘true effect’ that ignore or downplay heterogeneity 3 - understanding that variation in effect estimates across replications is to be accepted even in the absence of type-I error Specify that this paradigm shift will change current research practice in the following ways: 1 - increased attentiveness, in the hypothesis generation phase, to the likely sources of heterogeneity in treatment effects 2 - efforts to measure characteristics of samples and research contexts that might contribute to such heterogeneity 3 - use of new, conservative statistical techniques to identify sources of heterogeneity that might not have been predicted in advance 4 - large-scale investment in shared infrastructure to reduce the currently prohibitive cost to the individual researchers of collecting data - especially field data - in high-quality generalizable samples 2 key characteristics of emerging paradigm that distinguish it from current one 1 - intervention effects are expected to context and population dependent 2 - decline effects in later replications are not automatically attributed to questionable research practices in original research
27
Regression Lewis-Beck & Lewis-Beck (2016)
Regression is the prediction of one variable’s value based on another variable’s value (Lewis-Beck & Lewis-Beck, 2016). The fundamental goal of fitting a regression line to a dataset is to calculate regression weights that minimize sum of the squares of the prediction errors, and this is generally performed within a framework that assumes a linear relationship between X and Y. understanding of explanatory power via the coefficient of determination (R squared) Parameter estimates may not be significant due to (1) inadequate sample size, (2) Type II error, (3) specification error, or (4) restricted variance in X. core assumptions --> no specification error (linear relationship and relevant variables are in there) --> no measurement error (accurately measured) --> error terms are homoscedastic / not correlated / normally distributed
28
Huffcut (2004)
research perspectives on Meta-analysis As with traditional significance testing, the goal of meta-analysis is to make inferences about population characteristics and relationships using sample data. Thus, meta-analysis and significance testing are tied together by their common purpose. The main difference between them is that one focuses on analysis of a single study, while the other focuses on analysis of a collection of related studies. General meta-analytic process Step 1 - Clearly specify the characteristic being studied Step 2 - Search for research studies which have analyzed that characteristic Step 3 - Establish a list of criteria (i.e., standards) that the studies located have to meet before they are actually included in the meta-analysis Step 4 - Collect and record info from each study which meets the criteria established in the previous step Step 5 - Lastly, summarize the findings of the studies mathematically. Conceptual premise of meta-analysis founded upon the concept of sampling error sampling error - difference between the characteristics of a sample and those of the population from which it was drawn caused by chance and is the direct result of dealing with a sample that typically represents only a small fraction of the population Because sampling errors are random, they have a tendency to average out when combined across studies. Sampling errors tend to form a normal distribution with a mean of zero. The mean test statistic becomes an approximate estimate of the population test statistic
29
Meta analytic approaches cite Huffcutt (2004) for overall info Hunter & Schmidt (1990) Hedges & Olkin (1985)
These five steps are generic and shared by both approaches (H&S and H&O): 1) Define the variable/construct. 2) Gather relevant studies. 3) Set inclusion criteria. 4) Extract data. 5) Summarize findings mathematically. Central foundation of meta-analysis: sampling error --> Sampling error = random difference between a sample and the true population value. --> Key idea: by averaging across many studies, random errors tend to cancel out (central limit theorem). --> Both methods account for sampling error, but Hunter & Schmidt emphasize this more as central to their model. Hunter & Schmidt (1990): - Seeks to estimate true effect sizes by correcting for known artifacts like measurement error and range restriction. - Assumes much of the observed variation is due to statistical artifacts, and focuses on understanding the construct-level relationships. - Typically used more in IO Hedges & Olkin (1985): - Takes a conservative, statistical approach, modeling the observed effect sizes without making psychometric corrections, and places more emphasis on inference precision and heterogeneity modeling. Testing for moderators H/S = (a) if 75 percent or more of the variance is from sampling error we assume no moderation (b) run metas for different levels of the suspected variable H/O = Q statistic (homogeneity, significant = suggests moderators are present) Focus on effect H/S: TRUE effect (psychometric) H/O = OBSERVED effect (statistical) Other notes: --> Study Compatibility: Must ensure studies assess the same construct (otherwise results are meaningless). --> Uneven Sample Sizes: Very large studies can dominate results; some methods adjust for this. Confidence vs. Credibility Intervals: - Confidence Interval (CI) = how precisely you estimate the mean effect size. --> "If I repeated this meta-analysis with different samples, the average effect size would fall in this range most of the time." - Credibility Interval (CrI) = range of effect sizes that exist in the population (based on residual variance) --> "In the real world, the strength of this effect could range from this low to this high, depending on the context."
30
PRISMA Page et al. (2021)
Preferred Reporting items for Systematic Reviews and Meta-Analyses (PRISMA) created to help increase the standardization and transparency in systematic reviews PRISMA 2020 contains a 27-item checklist, an expanded checklist that details reporting recommendations for each item, the PRISMA 2020 abstract checklist, and revised flow diagrams for original and updated reviews
31
Using AI to help streamline meta analysis Tools: - elicit - ASReview
1. Literature Search and Screening (van de Schoot et al. 2021) - AI can automate or semi-automate identifying relevant studies. - Semantic search tools (e.g., Elicit, Research Rabbit) use NLP to find papers that match the meta-analysis topic, beyond just keyword matches. - AI-assisted abstract screening: Tools like ASReview use machine learning to learn your inclusion/exclusion decisions and prioritize the most relevant papers. - De-duplication and sorting: AI can cluster studies, remove duplicates, and tag preprints, grey literature, etc. 2. Data Extraction Text mining and NLP tools can identify and extract: - Effect sizes (e.g., d, r, OR) - Sample sizes - Moderator variables - Study characteristics 3. Effect Size Computation and Conversion AI can help standardize effect sizes, especially when original studies report statistics in inconsistent formats (e.g., F-values, t-tests, odds ratios). - Automated effect size calculators can be paired with NLP to recognize stats and convert them into a common metric like Cohen’s d or r. 4. Artifact Identification and Correction - AI can flag or estimate measurement reliability, range restriction, or missing artifact data by cross-referencing external databases or using imputation models. - For psychometric meta-analyses (Hunter & Schmidt), this can help correct for artifacts more efficiently. 5. Moderator Analysis and Pattern Detection AI (especially machine learning models) can: - Detect complex moderator patterns (e.g., interactions, nonlinear effects). - Help prioritize which moderators to test based on exploratory analysis. - Use clustering or decision trees to identify subgroups of studies with different effects. 7. Reproducibility and Workflow Automation AI tools can be integrated into reproducible workflows (e.g., using R, Python, or PRISMA-compliant pipelines) that document decisions transparently. - AI could even monitor whether your meta-analysis is following PRISMA, MARS, or APA guidelines in real-time. Future Potential: Generative AI + Meta-Analysis Imagine uploading 100 PDFs and asking an AI to: - Extract data - Compute corrected effect sizes - Run a meta-analysis - Generate a forest plot - Summarize moderators - Write the results section — all in one go. This isn’t far off. Researchers are already combining LLMs with statistical packages (like meta in R or metafor) to automate parts of this pipeline.
32
Secondary uses of meta-analytic data Oh, 2020 annual review
SUMAD enables researchers to: - Develop or refine theories by examining patterns across various meta-analyses. - Inform evidence-based practices by identifying consistent findings across multiple studies. - Detect moderators or mediators that influence relationships in different contexts. - Assess the generalizability of findings across diverse populations or settings. Issues: 1. Avoid using meta-analytic results based on very small k (number of studies) 2. Don’t remove outlier effect sizes - often due to real differences in contexts, measures, or samples—not errors. 3. Use multiple publication bias tests - SUMAD treats meta-analytic estimates as inputs for theory testing, like a correlation matrix for a meta-analytic SEM (MASEM). - If your inputs are biased, imprecise, or over-sanitized, the theoretical inferences you draw (e.g., mediation, moderation, causal assumptions) will be flawed. - It’s a "garbage in, garbage out" situation — but with even greater consequences because you’re typically not revisiting the original studies.
33
Ployhart et al., 2025 Intensive longitudinal models annual review
Ployhart et al (2025) gives definition of ILM, examples, and different types of recommendations: theoretical, design/timing, analytical/modeling, and reporting Intensive longitudinal models: necessitate multilevel data (level 1, level 2), nested data collected through frequent measurements –typically 20 or more – overly dense spaced durations. The desire is to understand temporal dynamics. NOT a specific type of model ESM (Experience Sampling Methodology) = Most active area where versions of ILM are found Measurement occasions must be adjacent and sequential. Classification typology of ILM: - Time as substantive variable: linear, quadratic trend? Or ignore/model as control? - Modeling time: lagged relationships, or examine equal intervals or event duration? - Event/Context as substantive variable: Continuous model (event impacts trend), or even duration impacts? - Linear vs Non linear trends: is it a linear trend, or is it truly non-linear? - Presence of multiple levels: Two levels (within and between) or more than 2 levels? - Model residual covariance structure: Yes or assume independence of residuals? Theoretical recommendations: - treat time as substantive - incorporate time/duration in hypotheses - contrast within- and between effects - justify timing of measurement Design recommendations: - ensure measurement aligned with cadence of construct/process - determine if collapsing scores across measurement occasions - estimate reliability between and within-person - evaluate missing data Analytical recommendations: - model time using growth model - model intercept and slope variance (random) - model nonindependence of residuals - test differences for within and between subject variance Reporting recommendations: sample size number of repeated measure occasions total observations report missing data, report reliability estimates, report model estimation methods report both within- and between estimates, even if not relevant to hypotheses
34
Woo et al. (2024) person-centered modeling
Person-Centered Modeling: Techniques for Studying Associations Between People Rather Than Variables. Many commonly used variable-centered models (e.g., linear regression, ANOVA, factor analysis, item response theory, multilevel regression, latent growth models) assume that all individuals come from the same population and differ only by degrees. Person-centered models assume that our population are heterogeneous: population is composed of individuals from groups that differ from one another traditional clustering has hard boundaries (e.g., k-means) and mixture models allow for fuzzy boundaries (e.g., LPA) examples: - K-means: Used to group employees by personality profile based on trait scores, but treats profiles as fixed and doesn't account for measurement error or overlapping profiles. - LPA (a mixture model): Estimates the probability that each person belongs to each profile, accounts for error, and allows for statistical testing of profile differences on outcomes. newer methods such as multilevel mixture models can reveal variability in employee profiles across teams, organizations, or countries. --> Techniques from machine learning, such as unsupervised learning algorithms, and cluster algorithms developed in the context of network models can also be used for person-centred modeling
35
Issues in person-centered research Ployhart et al (2025) annual review
When conducting a person-centered study, researchers encounter many methodological decision points and challenges that can affect the substantive quality and/or informational value of their results. These issues include: (a) determining an appropriate sample size for the method of choice, (b) deciding on model constraints, (c) selecting an optimal number of classes, (d) deciding whether and how to include covariates in the analysis, (e) testing (versus assuming) invariance of latent classes across samples or time points.
36
Beal (2015) esm annual review
Conceptual elements common to various forms of ESM - Natural environment - capturing experiences as closely as possible to how they would naturally occur - Immediacy of experience - prioritizing concrete and immediate experiences over abstract or recalled experiences - Representative sampling - assessing a range of experiences that accurately reflect an individual’s daily life discusses advantages/challenges
37
Gabriel et al. (2019) esm
Experience sampling methods: A discussion of critical trends and considerations for scholarly advancement. Q1: Building Within-Person Theory with ESM - Most org theories are implicitly within-person, but past research often relies on between-person designs. - ESM can help test and refine within-person dynamics. - Decide if you're studying experiences (momentary) or abstractions (aggregated patterns), as this influences how you build theory and apply ESM. Q2: Isomorphism and Homology - Use multilevel construct validation to test if your items work similarly at both within- and between-person levels. - Homologous relationships = similar patterns at both levels, but explain why theory/processes differ. - Use strategies like: Aggregating Level 1 data, Measuring stable traits during ESM, Modeling both levels simultaneously. Q3: Sample Size & Power - Level 1 (within-person) power is usually high—overpowering may be a concern. - Report effect sizes and justify their importance. - Benchmark averages: ~835 observations (L1), ~83 participants (L2). - Base sample size needs on phenomenon duration (e.g., how long does it take for change to occur?). - Always report actual sample sizes and missingness per variable. Q4: Motivating ESM Participation - Financial incentives work well—clearly explain the payment plan (per survey, per day, etc.). - Future research should test influence tactics (e.g., social proof, implementation intentions) to improve engagement and data quality. Q5: Psychometrics of Within-Person Measures - Don’t use test-retest reliability at Level 1—use variance decomposition and multilevel reliability estimates. - Use Multilevel CFA (MCFA) to assess factor structure; report fit stats, loadings, and alternate models. Q6: Adapting Scales for ESM Use - Report why and how items were trimmed or modified for within-person use. - Conduct content analysis on shortened scales, especially for formative constructs. - Always report reliabilities and MCFA results for adapted measures. Q7: Modeling Trends and Cycles - Consider social context (e.g., 9–5 jobs) and individual patterns (e.g., shift work). - Test for fixed/random trends or cycles (e.g., Beal & Weiss, 2003)—include them only if significant. Q8: Common Method Bias (CMB) in ESM - Person-mean centering removes need for Level 2 controls in Level 1 analyses. - Control for mood effects using tools like PANAS-X or POMS, especially in morning surveys. - If long surveys are a concern, use shortened/single-item mood scales. - Use lagged variables and t-1 controls to reduce bias and support causal inference. - Report analyses both with and without CMB controls. Q9: Varying Start Times & Work Schedules C- omplex work schedules (e.g., shift work) are not a barrier—design individualized survey schedules. - Track reminder and response times to capture temporal patterns in the data. Q10: Using Secondary Data in ESM - Use secondary sources only if they help address the research question and are feasible (e.g., IRB, funding). - Consider daily or cross-level data from others (coworkers, spouses, sensors). - Even if self-reports and secondary sources align, it can still add value. Consider ESM-based experiments with best practices (e.g., randomization, control groups, multi-level manipulations).
38
Wilhelmy & Kohler (2022) Qualitative
Qualitative research in work and organizational psychology journals: Practices and future opportunities paper focuses on inductive qualitative research, which develops insights from empirically collected data (i.e., moving from empirical data to interpretations and abstractions) highly iterative Important of qual research for IOOB 1 - useful for studying research topics that explore individuals’ experiences, sensemaking, or meaning-making phenomena 2 - qual methods attempt to explore or uncover mechanisms, whereas quantitative methods rely on the predetermination (i.e., hypothesizing) of such mechanisms (e.g., indirect effects) to be tested 3 - ideal to study current org changes and newly emerging topics such as new ways of working 4 - allow the study of events/behaviors that occur with a lower base rate or are otherwise hard to capture (e.g., workplace bullying and physical violence) 5 - when studying sensitive topics of phenomena in vulnerable populations, qual research approaches make it easier to establish rapport with informants, which helps them open up and communicate their worldviews 6 - Can help study topics with practical relevance and the possibility for making a real difference. Rather than trying to ignore contextual factors by generalizing across them, qual research can bring to the fore contextual boundaries, limitations, influence factors and explore how they shape how people feel, think, and behave Types of qual approaches most commonly used: thematic analysis, case study, grounded theory Types of qual data collection methods: semi-structured interviews, critical incidents, open-ended questions on surveys, focus groups, etc.