Psych Testing Flashcards
What is validity?
-
Test Standards Definition: “Validity refers to the degree to which evidence and theory support the interpretations of tests scores entailed by proposed uses of tests”
- Test standards = current framework/operational guidelines
- Validation is the joint responsibility of the test developer and the test user
- Test developer should present a rationale for recomended use and interpretation, accompanied by evidence and theory.
What are some past definitions of validity?
-
1954 Criterion-based view: A test is valid for anything it corrolates with. Validity was a static property.
- Problems: Some tests used for different purposes in different groups.
-
1966 Tripartite view: Criterion validity (concurrent and predictive), Content validity (relevant and representative of domain) and Construct validity (convergent/discriminant)
- Problems: Based on nomological network. If the test isn’t valid, is the theory or the test wrong? Overemphasis on different forms of validity (often indistinct) and correlations as proof.
- Updated in 1985 to included consequences of testing.
What is the current 5 source view of validity?
-
Unchanged from 1999: Unitary form of validity based on evidence from multiple sources to support an argument of what test scores mean.
- No diff types of validity, validity is a property of interpretation not test
- Evidence from 5 sources:
- Content: Relevance and representativeness of content
- Response Processes: if intended to measure a process, this should be provable ie not affected by manipulations
- Internal Structure: Factor analysis should match theory
- Relationship to other variables: Convergent/discriminant, test-criterion, and generalisation across situation (pop, conditions) and purpose (type of jobs)
- Consequences of testing: consider intended and unintended consequences of testing (eg naplan funding vs competition)
What are some of the major purposes of psychological testing?
- Classification: Selection (education and employment), Screening, Certification, Placement
- Diagnosis/Treatment planning: Clinical, Educational (giftedness/learning difficulties), neuropsychological deficits
- Coaching/training: Insight (self-knowledge), career-counselling, coaching
- Legal Application: Diminished responsibility, special dispensations, compensation claims
- Research
- Program Evaluation
What are some of the main tests used to measure intelligence and aptitude?
- Aptitude tests differ a bit from intelligence since they relate to trainable output
-
Individually administered tests: Primarily for children, diagnosis, emphasis on rapport
- Stanford Binet: Verbal and non-verbal factors 5 areas.
- Weschler scales: 3 tests for different ages, subscales and tests vary.
- Woodcock Johnson: achievement and intelligence test batteries.
-
Group administered tests:
- ASVAB: armed services aptitude
- GAMSAT: graduate australiam medical admissions test
- Ravens Progressive Matrices
What is Hollands vocational interest model?
- Holland’s vocational interest tests 6 domains
- Realistic: practical, hands-on, tool-oriented
- Investigative: analytical, intellectual, scientific, explorative
- Artistic: creative, independent, chaotic
- Social: cooperative, supporting, helping, healing
- Enterprising: competitive, leadership, persuading
- Conventional: detail-oriented, organising, clerical.
- These domains are arranged in a hexagon, in order of correlations between them (lowest correlations opposite)
What are some applications of psychological tests?
-
Neuropsychology: Checklists for frontal lobe dysfunction
- Luria-Nebraska Neuropsychological battery (attenton, language, memory, spatial, executive function), the mini-mental state exam (MMSE)
-
Health Psychology: McGill Pain questionnaire, Beck Depression inventory
- Alcholism: TWEAK (tolerance, worry, eye-opener, amnesia, cut-down)
- Forensic Assessment: malingering, assessment for insanity plea, child custody
How can psychometric tests be used for selection and training in the workplace?
-
Selection: Important to match selection criteria with job requirements. Steps:
- Job analysis: What tasks are required?
- Write job description: What qualities does the person need?
- Test candidate pool,
- Select best candidate
-
Score Feedback for training: Focus on profile not raw scores, focus on developmental planning
- Compensatory strategies: reshape problem, externalise
- Developmental activities: 1. Deliberate practice 2. training 3. Mentoring, 4. Goal setting, 5. Plan, monitor, evaluate.
What four factors affect reliability?
-
People taking the test: Reliability is based on variability people people: large SD = strong reliability
- Match person to test to avoid floor/ceiling effects
-
Test Characteristics: Bandwidth vs fidelity (more specific test = higher reliability)
- Don’t sacrifice content coverage for reliability
-
Item Characteristics: internal consistency affected by # of items and correlation between them.
- A reliable test either has many items with small rs, or few with strong rs.
- Method used to estimate reliability: test-retest vs internal consistency etc. Consider appropriateness of method (ie if construct expected to change over time).
What is reliability?
-
Reliability = the ratio of the true score variance to the observed score variance inclusive of error. Aim for
- .9 for high stakes, .7 for research, .6 if multiple measures
- Validity is dependent on reliability: the maximum correlation between 2 variables is determined by the error in the test.
How does reliability relate to test length?
-
Reliability increases as the number of items increases:
- Spearman Brown formula: predicted reliability is a function of test length and existing reliability (assuming equal reliability of items).
-
Balancing test length:
- Too long a test causes boredom, exhaustion, loss of motivation
- Problems with short tests: previous item exposure, inadequate domain sampling.
- Solution: Adaptive testing and Computerised adaptive testing
What is adaptive testing? What are the advantages and disadvantages?
-
Testing is adapted to the persons level of ability: previous responses determine next questions.
- Used in major batteries like stanford binet
-
Computerised adaptive testing (CAT): Computer algorithm used to select futher items according to a rule.
- Used in large scale testing where security is important (ASVAB, TOEFL)
-
CAT Advantages: Tests are shorter but just as reliable:
- economic advantage, fewer problems with motivation, easier to maintain test security
- CAT Disadvantages: Substantial preparation and outlay needed (v.large item pool, analysis of difficulty, algorithms), requires computers.
What are anchoring vignettes?
-
Problems with self-rating scales: there are significant variations in responding styles individually and culturally
- extreme vs conservative responders, tendency to ‘agree’ with statements
-
Anchoring vignettes: vignettes of hyperthetical people are given to be rated.
- The average rating is then subtracted from self-rating
-
Examples: significant cross-cultural discrepancies have been solved using anchoring vignettes such as:
- relationship between teacher helpfulness and achievement
- relationship between conscientiousness and life expectancy
What are situational judgement tests?
-
SJTs give situations and require the respondent to choose the best response option
- can be typical or maximal performance (would/should)
- Seen as more engaging (higher face-validity)
- Show lower adverse impacts than IQ tests
-
Development of SJTs: use subject matter experts
- Collect critical situations from SMEs, and summarise these into items
- Collect responses from everday and SME
- Score answers based on SME opinions
- test items and select the most reliable ones
What are the different motivations and situations that influence response distortion?
-
High stakes situations are prone to faking:
- Faking Good: employment selection, internet dating and educational selection. NEO-PI-R example “I strive for excellence”
-
Faking Bad: Legal (benefits/diminished resp), Education (special comp), military (discharge, special duties, conscription)
- Estimated faking in 30% personal injury cases, instructions on faking dropped to opposing military in WWII (both sides)
-
Types of faking: Conscious and unconscious biases
- Self-deceptive enhancement: Linked to Egoistic Bias. Value = agency, strong, competent, exaggeration of status (social, physical etc)
- Self-deceptive denial: Linked to Moralistic Bias. Value = communion, good kind. Deny socially deviant impulses/behaviours.
- Impression management: conscious bias
What are some methods for detecting faking?
-
Lie Scales: Paulhus balanced index of desirable responding (BID-IR). Ask socially aversive but universal questions. eg “I’ve never wanted to swear”
- Problem: May be measuring personality. Neuroticism, conscientiousness and aggreableness all corrolate strongly.
-
Response time rubrics: Longer response time = faking
- But faking totally can be quicker for some people
-
Over-claiming technique: Paulhus - rate familiarity with concepts, some terms don’t exist (act as foils). Compare confidence of real terms to foils.
- Works well but limited in concepts you can test
- Bayesian truth serum: For each item, estimate proportion of people who would give same answer. False consensus effect - honest answers will over-estimate number of others who share belief
What are some methods for reducing faking?
- Warnings: Best warnings are consequence based but can also be based on detection, reasoning (best interest), educational (validity of test) or moral.
-
Forced Choice: Test takers forced to choose between 2 desirable alternatives (which is more like you).
- Results are only relative though (ipsative) cannot compare actual levels
- Verifiable statements: Less likely to fake information that is easily verified. I work more than needed vs how many hours overtime did you work
- Other reports: referees or friends lower levels of faking (but still there)
- Implicit measurement techniques: eg implicit associations test
What are three paradigms in faking research? What has been found about faking levels?
-
Group Comparison: compare job applicants to other samples
- Measure lower limits of faking (not everyone will fake)
- There may be real group differences
- Findings: changes in OCEAN levels found, largest variation for N and C
-
Instructed faking: Compare scores under “answer honestly” to “maximise your scores” Instruction type varies - imagine you’re applying to X (can affect result)
- Honest answers still have self-deceptive biases.
- Findings: Huge changes found in OCEAN, particularly for N and C
-
Incentive manipulation: Compare no stakes conditions to reward conditions. eg top 10% get $10.
- But hard to mimic real life reward levels.
- Conclusion: people fake but not maximally
What are the reccomendations for dealing with faking in high stakes situations?
- Social desirability scales should not be useed as indicators of faking:
- can indicate real personality factors and exclude good candidates
- If faking is detected, re-test or interpret with caution:
- risk of false positives
- Try to minimise rather than detect faking
- Use personality to screen-out lowest scorers rather than screen in best
- Neutralise evaluative content of items
What are the reasons for measuring job performance?
- Decision making about individuals: high performance (promotions, bonuses, probational periods) as well as low performance (retention, termination, layoffs)
- Organisational Planning: Benchmarking performance, identifying developmental needs, assisting in goal identification
- Legal requirements for the profession: legal requirements for certain levels of performance (eg doctors), legal defense of hiring/firing decisions
- Feedback: individual, team and organisational
- Evaluation of procedures or changes: did selection processes work? did training work? other changes
What are some examples of subjective measurements of job performance?
-
Subjective measures: rating scales filled out by employee or supervisor
- Graphic rating scales: tick along a physical scale
- Behaviourally anchored rating scales (BARS): developed for a specific job dimension within a specific job. Each scale point lists example behaviours. Can the employee do X?
- Behavioural observation scale: Developed for specific job, have you observed the worker perform these behaviours?
- Checklists: list of behaviours, tick the ones that are observed
What are some objective measures of job performance? What are some problems with them?
-
Objective measures of job performance:
- Production Counts: eg number of bricks laid
- Biodata: eg absenteeism
-
Problems with objective data:
- Production counts sometimes not possible (eg a nanny)
- Doesn’t always take quality into account
- Production is dependant on situational variables as well as the worker (eg # customers served)
What are some issues with rating measures of job performance?
-
Correlation between raters: meta-analyses have shown variations
- Harris and Schaubroeck: self/peer = .36, self/supervisor = .35 but peer/supervisor =.62 (reasonable)
-
Conway and Huffcutt: Both reliability and agreement higher for low complexity jobs, non-managerial jobs.
- Reliability highest for supervisors (lowest subordinates)
- correlations between sources are lower than harris but same pattern
-
Sources of error in rating scales
- Social desireablitiy (faking)
- Leniency/severity errors (response styles personal thresholds for high/low ratings)
- “Halo” or “horns” effect: impression based on one quality
- Recency effects
- Causal attribution errors: effort>ability, actor/observer bias
- Personal bias (pregnant, race, age)
What is the difference between task and contextual performance?
-
Task Performance: activities that contribute to an organisations technical core
- tasks required by formal job role
- Lower correlations with personality
-
Contextual performance: Activities that contribute to the social and psychological core of the organisation
- tasks are discretionary and not explicitly stated
- Higher correlations with personality