Week 2 - Score Normalisation and Reliability Flashcards
(17 cards)
‘What’ and ‘Why’ of psychological measurement
What - quantifying behaviour, attitudes, feelings to make inferences about constructs (unobservable attribute)
Why - assessment must be objective, thus tests use standardisation to avoid bias
Score normalisation
Raw score - not meaningful without comparison point (e.g. criterion referenced (pass mark), norm referenced (relative to others))
Derived score - transforming raw score to find someone’s relative position to normative sample (percentiles and z-scores)
Percentiles
Percentage of people who fall below particular raw score
(data points below/total values) x 100
p50 (median), p25 (1st quartile), p75 (third quartile)
Advantages - easy comparison, easy to understand, universal
Limitation - inequality of units
Standard scores (z)
Measure of how extreme a score is relative to normative sample (in SD)
z = (X - M)/SD
Universally applicable - can be calculated for anything if M and SD are known
Transformations of z-scores
T-score (0-100) = (z x 10) + 50
Sten score (1-10) = (z x 2) + 5.5
Deviation IQ (25-175) = (z x 15) + 10
Stanine (1-9) = normal distribution of nine scores with category percentages
Normalised standard scores
Scores only comparable if from similar distribution.
Distribution can be normalised to force compatibility
Raw score > percentile > normal curve frequency table > normalised z score
Specificity of Norms
Norms (M, SD) specific to population they are derived from
Problems - WEIRD, lack of Australian norms
Test reliability
A good test is reliable (reproducible) and valid (measuring what is intended)
Reliability - rxx (correlation between scores on two administrations of test)
True score theory
People have a ‘true score’, but we are never able to measure that due to errors in measurement
If we administered an infinite number of tests - mean of distribution is ‘true score’, SD of distribution is SEm
Reliability and true scores
Individual (observed score (x) = true score (t) + error (e))
Sample (observed score variance (s2x) = true score variance (s2t) + error variance (s2e)
Reliability (rxx) = s2t/s2x
Error variance = 1 - (rxx)
Thus, reliability is the proportion of observed score variance that is due to true score variance
Test-retest
Exact same test done on two occasions
Error variance - changes in conditions and test-takers between conditions (environment threats, time-related factors, order effects)
Alternate forms
Two versions of test administered to same people (immediate or delayed)
Error variance - differences in content, time sampling (delay)
Internal consistency
Split-half reliability
- One test administered but split into two halves
- Error variance - content sampling
- Problems - deciding on split, timed tests, halving test already reduces reliability
Cronbach’s Alpha and Kuder-Richardson
- One test administered, every item correlated with every other item, mean of these correlations (equals mean value of all possible split-half)
- Error variance - content sampling, heterogeneity of behaviour domain
CA - for items on scale with more than 2 options
KR - for dichotomous items
Inter-rater reliability
Two raters give scores for individual on a test (subjective)
Error variance - differences between raters
Additive sources of error variance
Different tests of reliability can be added together to find true error variance (as long as error variance type is different between tests)
E.g. delayed alternate form (time + content) + inter-rater
Sample impact on reliability
R is affected by individual differences in a group
R decreases as homogeneity increases (similar scores, so variation is more likely to be error)
How high does Rxx need to be?
Should be > .8
Nunnally’s heuristic:
- 0.5 for test development
- 0.7 for test in research
- 0.9 for individual assessment