Selection Flashcards
(52 cards)
Morgeson & Campion (1997)
Social and cognitive sources of potential inaccuracy in job analysis
SOCIAL SOURCES:
social influence processes (e.g., conformity pressure, extremity shift, motivation loss) and self-presentation processes (e.g., impression management, social desirability, demand effects)
COGNITIVE SOURCES
limitations in information processing (e.g., information overload, heuristics, categorization) and biases in information processing (e.g., carelessness, order and contrast, leniency and severity, method effects)
different sources of inaccuracy affect different parts of job analysis data: interrater reliability, interrater agreement, discriminability between jobs, dimensionality of factor structures, mean ratings, completeness of job information
also specify the job analysis facets impacted in a table, with each type of applicable bias from above checked off:
- job descriptors (job oriented, worker oriented)
- analysis activity (generate, judge)
- data collection (group meeting, individual interview, observation, questionnaire)
- source of data (incumbent, supervisor, analyst)
- purpose (compensation, selection, training)
Strah & Rupp (2022)
Are there cracks in our foundation? an integrative review of diversity issues in job analysis
extending Morgeson & Campion (1997) - describes the sources of true and error (in)variance in JA data across demographic subgroups
job analysis needs to more inclusively and accurately capture the job experiences of individuals from different demographic subgroups
antecedents of TRUE differences in work across subgroups = job-relevant individual differences; performing differently as a response to stereotypes, different assigned work, different environmental/societal restrictions –> true
diversity related barriers = conformity to norms, impression management, lack of opportunity for voice, demand effects, over-reliance on specific perspectives, language bias, majority effect –> non-random error
true (in)variance + error (in)variance = total (in)variance –> HR practices
Campion et al. (2011)
Competency modeling
CM vs JA
1. executives typically pay more attention to CM
2. CM often attempt to distinguish top performers from average performers
3. CM often include how competencies change across employee level
4. CM usually linked directly with business objectives and strategies
5. CM typically developed top-down (start at C suite) rather than bottom up (start with employees)
6. CM may consider future job requirements (directly or indirectly)
7. CM can be easier to interact with (org-specific language, visuals, etc)
8. Finite number of competencies are identified across multiple functions/jobs
9. CM frequently used to align HR systems
10. CM are often used in org development and change, rather than simple data collection
Organize it well and make it accessible
Sanchez & Levine (2009)
Competency modeling vs job analysis
CM should be used in tandem with TJA, using TJA data as a base for the models
in general TJA is used to better understand work assignments, capturing essential elements, work-focused, typical performance
while CM is more about influencing how assignments should be performed, worker-oriented, organization-wide, maximal performance
Putka et al. (2023)
Evaluating NLP approach to estimating KSA and interest JA ratings
input = job descriptions and task statements from ONET (training) and independent set of occupations from large org (testing)
ML approach: produced KSAO predictions that had cross-validated correlation with SME ratings of KSAs:
knowledge (.74)
skills (.8)
abilities (.75)
interests, RIASEC (.84)
found clear evidence for validity of machine-based prediction based on:
(a) convergence of machine-based and SME-furnished ratings
(b) conceptually meaningful patterns of prediction and model regression coefficients among KSAOs
(c) conceptual relevance of top predictor models underlying related clusters of KSAOs in the PCAs analyzed (beyond the stats, the clusters made sense)
prediction models developed on ONET data produced meaningful results on the independent set of job descriptions and tasks (testing data, no KSAOs in that set)
Sackett et al. (2022)
Revisiting meta-analytic estimates of validity in personnel selection
discusses range restriction issues by saying apporaches traditionally used to correct for range restrictions (building range restriction artifact disctributions) have significant flaws that have generally led meta-analysts to substantially overcorrect for range restriction
after critiquing previous RR practices, they offer a best estimate of mean operational validity which often reflects either a range restriction correction or no correction at al
new top 8: structured interview (.42), job knowledge test (.4), empirically keyed biodata (.38), work sample tests (.33), cognitive ability tests (.31), integrity tests (.31) personality based emotional intelligence (.3), assessment centers (.29)
highest BW subgroup differences: cognitive ability tests (.79), work sample tests (.67), job knowledge tests (.54)
contextualized personality tests went up in validity from general personality and have low BW differences
Sackett et al. (2023)
Revisiting the design of selection systems in light of new findings regarding validity of widely used predictors
A number of predictors at the top of the list, such as job knowledge tests, work sample tests, and empirically keyed biodata, are not generally applicable in situations where KSAs are developed after hire via training or on the job. Work samples and job knowledge tests fall into this category
Since cognitive ability no longer emerges as the top predictor for validity findings, cognitive ability does not need to be the centerpiece of selection procedures; in fact, this changes characteristics for the validity-diversity tradeoff
Ultimately, how should practitioners and researchers estimate operational validity?
1) Correct for reliability first, then for range restriction
2) Measurement error exists in all our criteria, correcting for unreliability is important for all validity studies
3) Use estimate of interrater reliability, not internal consistency
4) Consider local interrater reliability, if available
5) If not available, consider reliability estimates from similar settings with similar measures
6) If neither above are available, utilize relevant meta-analytic reliability estimate
7) triangulate between local and meta-analytic reliability estimates if multiple estimates available
8) Lower reliability estimates produce larger corrections (based on the formula)
9) If objective performance is used, consistency over time is the basis for reliability
10) Correcting for range restriction requires credible estimate of predictor standard deviation in applicant pool and the standard deviation among selected employees
11) If predictor in question was used in selecting validation sample, range restriction is particularly important issue
12) Range restriction generally does not have sizeable effect if predictor was not used in selecting validation sample
13) Obtain local applicant and incumbent sample standard deviation if possible
14) Be cautious when using formulas that convert selection ratio into U-ratio for range restriction correction
15) Be cautious about using publisher norms as estimate of applicant pool standard deviation
16) Do not use mean range restriction correction from meta-analysis as basis for correction in concurrent studies (key message from Sackett et al., 2022)
17) Use mean range restriction correction factor from meta-analysis with extreme caution
18) Make no correction unless confident in the standard deviation information at hand
Scherbaum et al. (2017)
Chapter on subgroup differences in selection assessments
big point: Combining multiple methods in a balanced selection battery can help mitigate adverse impact while maintaining predictive validity.
GMA tests, despite their predictive power, present the largest subgroup differences and the greatest risk of adverse impact.
Personality tests, integrity tests, structured interviews, and work samples are more equitable and offer viable alternatives or supplements to cognitive assessments.
Stanek & Ones (2018)
cog ability and personality. massive paper
Schneider & Newman (2015)
Intelligence is multidimensional: Theoretical review and implications of specific cognitive abilities.
HRM usually treats cognitive ability as a unidimensional construct.
possible rationales for this choice = practical convenience, the parsimony of Spearman’s theory of general mental ability (g), positive manifold among cognitive tests (all positively related to each other), and empirical evidence of only modest incremental validity of specific cognitive abilities for predicting job and training performance over and above g.
Recommend use of narrower, second-stratum cognitive abilities (e.g., fluid reasoning, crystallized intelligence).
The renewed focus on multiple dimensions of intelligence is supported by several arguments:
- empirical evidence of modest incremental validity (typically at or above 2%) of specific cognitive abilities predicting job performance beyond g
- compatibility principle - specific abilities predict specific job tasks better than general performance (e.g., spatial reasoning for engineering tasks)
- application of bifactor and relative importance methodologies to predict job performance via g and specific abilities simultaneously
- Selection tools emphasizing specific abilities may reduce racial subgroup differences compared to g-heavy tests
Melson-Silimon et al. (2023) and commentaries
Personality testing and the Americans with Disabilities Act: Cause for concern as normal and abnormal personality models are integrated
Concerns for personality testing out of risk of ADA breach
Neuroticism (+) Borderline PD
Agreeableness (–) Narcissistic PD
Extraversion (–) Avoidant, schizoid PD
Conscientiousness (–) Antisocial PD
RECOMMENDATIONS
- Establish job relatedness through a proper job
analysis. Whenever possible, utilize alternative selection methods that are less invasive but with equivalent validity. - Avoid personality tests that assess constructs closely related to PDs, “dark side” traits, and normal personality traits that are highly correlated with PDs.
- Conduct more research involving development
and validation of personality tests to be used in
preselection. - Ensure items ask about behavior in the workplace.
- Do not involve persons with clinical or medical
licensure in administration or interpretation unless
clinical personality diagnosis is job related and, if so, administer the test AFTER a conditional job offer. - Advocate for direct conversation with various
disciplines in psychology and the EEOC through
research and discussion on implications of an
anticipated change in PD diagnosis.
Dahlke & Sackett (2017)
guidance on handling effect sizes in differential prediction
PREDICTIVE BIAS
subgroup differences and predictive bias can exist independently of one another
testing for predictive bias involves using moderated multiple regression, where the criterion measure is regressed on the predictor score, subgroup membership, and an interaction term between the two
Slope and/or intercept differences between subgroups indicate predictive bias
EFFECT SIZES
in predictive bias analyses, it is useful to consider effect sizes as well as statistical significance. See Nye & Sackett (2017) and Dahlke & Sackett (2017) for treatment of effect sizes in predictve bias analysis
Schmidt & Hunter (1998)
Precursor to Sackett et al. (2022) with meta-analytic estimates of predictor validity
Sackett et al. (2024)
A contemporary look at the relationship between general cognitive ability and job performance. [meta-analysis]
Main point: GCA is related to job performance, but our estimate of the magnitude (validity = .22) of the relationship is lower than prior estimates.
The relationship between general cognitive ability (GCA) and overall job performance has been a long-accepted fact in industrial and organizational psychology. However, the most prominent data on this relationship date back more than 50 years.
mean observed validity of .16, with a residual SD of .09. Correcting for unreliability in the criterion and correcting predictive studies for range restriction produces a mean corrected validity of .22 and a residual SD of .11.
While this is a much smaller estimate than the .51 value offered by Schmidt and Hunter (1998), that value has been critiqued by Sackett et al. (2022), who offered a mean corrected validity of .31 based on integrating findings from prior meta-analyses of 20th century data. (new estimate is based on 21st century data)
Hoffman et al. (2015)
A review of the content, criterion-related, and construct validity of ACs
big recommendation: don’t use exercise based scoring over dimension-based scoring, BUT both can be meaningful and should be further investigated
meta analysis of exercise dimensions for content, criterion related, construct and incremental validity of 5 common AC exercises
in-basket (given a set of info and need to respond accordingly), LGD (leaderless group discussion), case analysis, oral presentation, role play
all 5 types significantly related to job performance (rho = .16 - .19)
nomological network analysis –> exercises tend to be modestly associated with GMA, extraversion, and to a lesser extent openness, and unrelated to agreeableness, conscientiousness, and emotional stability
exercises tend to explain far beyond GMA and big 5, and exercises are not interdependent of what they are measuring
Kleinmann & Ingold (2019)
Toward a Better Understanding of Assessment Centers: A Conceptual Review [annual review]
ACs are a commonly-utilized method for assessing employees, especially leaders
ACs comprise multiple assessment components, at least one of which is a behavioral simulation exercise.
An AC may consist solely of simulation exercises, or combine them with other methods, such as interviews, personality inventories, and/or ability tests.
The result is a comprehensive, partially-or fully-behavioral evaluation of an assessee’s proficiency on a set of job-relevant, behaviorally-defined performance dimensions
Can be used for assessment, diagnostic, and developmental purposes
Can apply systems 1 and 2 processes for influencing assessee ratings, and CAPS theory for assessor behavior
Kuncel & Sackett (2014)
Resolving the AC construct validity problem as we know it
importance of dimension variance in ACs
ongoing concern about the construct validity of AC dimensions, long standing concern that post-exercise dimension ratings (PEDRs) reflect more exercise variance than dimension variance, such that the variance in these measures is more about their performance on specific exercises than actually measuring a dimension (e.g., leadership)
however, PEDRs are not the final score. they are an intermediate step toward an overall dimension rating– and the overall dimension rating should be the focus of inquiry– dimension variance will quickly overtake exercise-specific variance as the dominant source of variance when ratings from multiple exercises are combined (good thing)
– with as few as 2 exercises, dimension variance can reach the lowest level of construct variance dominance
however, the largest source of dimension variance is a general factor (meaning general performance or getting things done, which makes it difficult to really pinpoint multiple distinct constructs)
Suggests that ACs may not be measuring multiple, distinct constructs, but rather a general capability to perform well in workplace simulations.
Speer et al. (2023)
meta-analysis on biodata in employment settings: providing clarity on criterion and constructed related validity estimates
main point: biodata inventories are highly predictive assessment methods and are likely to provide unique variance over other common predictors
2 defining features of biodata validity
(a) construct domain
(b) scoring method (rational, hybrid, empirical)
biodata had criterion related validity with job performance and additional outcomes, convergent validity with common external hiring measures
biodata inventories are one of the most predictive assessment methods available, but the relationship with work outcomes differs by construct domain and scoring method
- empirically scored was strongest CR validity (rho = .44) compared to rational (rho = .29)
- scales developed to measure conscientiousness and leadership were generally the most predictive of the job performance of narrow construct domains, and particularly when empirically keyed
- when biodata scales were correlated with theoretically aligned performance ratings, rational scoring resulted in similar validity coefficients to as empirical scoring
- biodata scales exhibited expected patterns of correlations with external measures and were only moderately correlated with cognitive ability and big 5
Whetzel et al. (2020)
Situational Judgment Tests: An Overview of Development Practices and Psychometric Characteristics
Lots of guidance on SJTs:
SCENARIOS
- critical incidents enhance realism of scenarios
- SPECIFIC scenarios –> higher validity, less assumptions by examinee
- brief scenarios can reduce reading load, can reduce group differences
- avoid: sensitive topics, overly simplistic scenarios (one plausible response), overly complex scenarios
RESPONSE OPTIONS
- use SMEs to develop responses
- range of effectiveness levels
- be careful about transparency/obvious when assessing a construct
- only one action, no double-barreled
- have options of active bad (do something wrong) and passive bad (do nothing)
- check for tone cues
RESPONSE FORMAT
- use knowledge-based (should-do) in high stakes to help with faking
- use behavioral tendency (would-do) in non-cognitive constructs like personality
- use the method where examinees rate each option (higher reliability and favorable applicant reactions)
- single-response SJTs are easy for analysis but can have higher reading load on candidates
SCORING
- empirical and rational keys have similar levels of reliability and validity, use SME input
- develop more scenarios and options than you will end up needing
- use 10-12 raters with different perspectives
- use means (effectiveness levels) and SDs (rater agreement) to select options
reliability and validity
- do NOT use alpha for multidimensional SJTs
- instead used split half with spearman brown (assuming content is balanced)
- validity is similar for knowledge and behavioral tendency
- SJTs have slight incremental validity over cog ability and personality, they likely also measure a general personality factor, and it can correlate with other constructs (cog ability/personality)
- have been used in military settings
group differences
- smaller on SJTs than GMA test
- women perform slightly better
- behavioral tendency has smaller group differences than knowledge
- rate format has lower group differences than ranking or selecting best and worst
presentation methods
- SJTs have several advantages in terms of avatar and video based
- higher face and criterion-related validity, but may be less reliable
- using avatars may be less costly, but developers should consider uncanny valley effects when using 3D human imaging
faking
- faking DOES effect rank ordering of candidates and who is hired
- faking is more of a problem with BEHAVIORAL tendency (would do) than knowledge-based (should do), especially in high stakes situations
- SJTs generally appear less vulnerable to faking than personality measures
coaching
- examinees can be coached on how to maximize SJT responses, orgs can endorse this to help level the playing field (as opposed to individuals seeking it out on their own)
- scoring adjustments (key stretching, within-person standardization across scores)
Hartwell et al. (2022)
social media assesssment
lays out map for how to structure a SMA
identifies potential issues with using SMA including missing information, privacy concerns, discrimination
should base components of an SMA on components of structured interviews
structural components of SMA: job relatedness, procedural consistency, rating scales used, documentation, assessor training, having multiple raters, separating raters from the decision makers, informed consent, notifying about results
Huber et al. (2021)
Faking and the validity of personality tests: An experimental investigation using modern forced choice measures.
MFC scales substantially reduced motivated score elevation but also appeared to
elicit selective faking on work-relevant dimensions.
Despite reducing the effectiveness of
impression management attempts, MFC scales did not retain more validity than Likert scales when participants faked.
However, results suggested that faking artificially bolstered the criterion-related validity of Likert scales while diminishing their construct validity
Blackhurst et al. (2011)
Should You Hire BlazinWeedClown@ Mail. Com?
conducted a study to test whether applicant email addresses are related to their owners’ job-related qualifications
Found that that those with appropriate (versus inappropriate or questionable) email addresses had higher conscientiousness, professionalism, and work-related experience.
NO difference for cognitive ability
however, there is not as strong a distinction between questionable and appropriate
caution the hiring manager who wants to use only email addresses to screen applicants.
although there are significant differences between applicants with appropriate vs questionable or inappropriate email addresses, the effect sizes are not large.
there is a difference of roughly 10% between the high and low group means on each of the masures.
rather than using email addresses to screen applicants, authors suggest view the less-than-professional email address as a yellow flag
Campion & Campion (2023)
overview of special issue which features shorter descriptions of work from practice-side involving ML & selection
illustrative ML applications:
- scoring resumes and employment applications
- scoring constructed responses to assessments (interviews, write-in test answers)
- combining scores to increase prediction
- combining scores to reduce subgroup differences
- creating test questions
- analyzing jobs to determine requirements
- inferring skills and personality from narrative applicant information
lessons learned: alpha is not always best/look at test-retest; model can be more reliable than criterion and may not be fully accurate; etc
emerging best practices
future research suggestions
McDaniel et al. (2011) and commentaries
The Uniform Guidelines are a detriment to the field of personnel selection.
UG hasn’t been updated in 30+ years – science and practice is outdated
SIOP should have a larger role in setting standards
UG’s perspective on separate ‘types’ of validity – rather than types of validity evidence
UG’s false assumptions regarding AI – the 4/5ths rule has no scientific basis and burden on employer to provide validity evidence can be very expensive for small or medium orgs