- How well a test measures what it purports to measure
important Implications regarding
- appropriateness of inferences made and
- actions taken on the basis of measurements
- sensitivity & specificity
- always a compromise between sensitivity & specificity
- usually screening process using sensitive test
- then use highly specific test to determine which actually have dementia
- test needs to be accurate
- stability of measurement
- measurement is stable over time & within itself
what are the three components of reliability?
1) inter-rater reliability - more to do with scoring than the nature of tests
2) test-retest reliability - should get the same score when doing the same test twice
3) internal consistency - within the test ppl should be scoring consistently
- items should items should be equally good at measuring what they are trying to measure
What is test reliability?
- this is not scorer reliability
- test-retest - stability over time
- internal consistency
- homogenous - all items just testing one factor (anxiety)
- should be equally good at assessing that factor
- need to be aware of how many factors/behaviours a test is measuring
- if intend to measure one then should only measure one
What is reliability?
- the proportion of total variance (σ2) made up of the true variance (σ2tr)
- variability in test scores: σ2 = σ2tr + σ2e
- reliability of a test score is always made up of
true score + error
- error is made up of random error & systematic error
Whenever we are talking about reliability & validity, we are talking about........
correlation or correlation coefficients
- i.e., how well things are correlated on different aspects
- test-retest (looking at the correlation between first & second time test taken)
- internal consistency (looking at the correlation between different items on the test)
What are some sources of error variance?
- Test-Taker Variables
- Examiner-Related Variables
- Test Scoring/Interpretation
each can contain both random & systematic error
What is the difference between systematic & random error variance?
Systematic - constant, or proportionate source of errror in variables other than the target variable
should not affect variance in scores
Random - caused by unpredictable fluctuations & inconsistencies in variables other than the target variable
- should not affect variance in scores
Systematic changes should not affect the scores; unpredictable changes will affect the correlation; the more robust the test to fluctuation, the greater the reliability.
How does error occur in test contruction?
the way you select or sample test items
- if all items consistently perform in the same way (the way you intended them)
systematic error - could come from an ambiguous question - some ppl may respond one way and others another
random error - may have one or two questions where someone does not have enough experience to give the standard response to the item
How can error occur during test administration?
How do testtakers contribute to error?
- during test administration
- differences between ppl taking the tests
systematic - different ages & not taking ages into account
random - age, personality etc
- dont necessarily want to minimize by only testing 10 year olds coz then test is only relevant to 10 yr olds
so do 10 yrs, 11yrs, 12yrs etc, then create norms for different ages (age norms) - takes care of the variable by having different normative data for different ages
How does the test environment contribute to error?
- during test administration
- one may be tested in noisy another in a quiet environment
- testing in a group or individually
affects test scores
How can examiners contribute to error?
- during test administration
- examiner humanness - may be exhausted by last test - may skip bits to hurry it up
How can test scoring/interpretation contribute to error?
- subjectively scored tests have greater error (because rely on subjective judgements)
- moving toward computer based scoring to remove this source of error
- cannot have computer based if its the quality of the response (qualitative)
- much more error on qualitative than quantitative
What should we aim for with regard to error & reliability
aim to remove systematic error and minimise random error so we get better reliability
What are some reliability estimates?
parallel forms/alternate forms
What is a test-retest reliability estimate?
same test taken twice - then see how well the scores are correlated
issue of how long an interval between testing?
- the shorter the interval = the higher the test-retest reliability, because there are lots of things that can change in an individual over time
- systematic changes should not affect test-retest reliability e.g., hot room, cold room (everyone affected equally) 26:50
- random changes will affect correlation (test-retest reliability) (27:15)
- the more robust the test is to fluctuation = more reliability
e.g., a test that is not affected by time of day, or amount of sleep etc - robust enough to wash those effects out - therefore (28:30)
- participant factors will affect test-retest reliability - experience, practice, fatigue, memory, motivation, morningness/eveningness
- as everyone differs in these areas = greater error variance
- practise effects - give you a clue about what is going to happen next time we do the same test - this may mean that we cannot use test-retest
When would we use Parallel or Alternate forms of a test?
- when we cannot use tes-retest reliability
- due to e.g., practise effects giving testtaker a clue about what will be on the test next time
What is a parallel forms or alternate forms reliability estimate?
parallel vs. alternate
- parallel forms - are better developed
- items have been selected so that the mean & variance has been shown to be equal
- alternate forms - similar but no guarantee that variance is the same (hence have introduced a source of error)
- testing is similar to process as test-retest - do one test then do the parallel or an alternate form.
- test sampling issues - problem: is test sampling issue (choice of items)
- best items are usually the best of the items available (unless create both tests at the same time
What is one of the biggest problems faced when using a parallel form or alternate form of a test?
test sampling issues - problem: is test sampling issue (choice of items)
- best items are usually used when creating the initial version of the test
(unless creating both tests at the same time)
- identifying source of error
- is it because it is not stable over time or is it because the different items (content) of the two tests are introducing error
- is it stable over time? (external)
- internal consistency across the two tests? (internal)
Internal Consistency (Reliability)
- Split-Half testing
Split into two halves
Obtain correlation coefficient
What is the point of Split-Half testing?
To obtain internal consistency of full version - Spearman-Brown Formula
Estimates internal consistency of a test that is twice the length
When is the Spearman-Brown formula used?
- To obtain internal consistency of full version - of split-half tests
- Estimates internal consistency of a test that is twice the length
- not used when more than one factor (heterogeneity)
- not appropriate for speed tests
- must have homogeneity when using split-half method because could end up with an imbalanced distribution of the factors across the two halfs
Spearman-Brown Split-Half Coefficient
rSB = 2rhh / (1+rhh )
rSB = 2 x 0.9/ (1+0.9)
rSB = 1.8/ 1.9
rSB = 0.947
When would we use Cronbach's Alpha?
- when we need an estimate to represent the sum of all of the individual variances in a split half test
- it estimates internal consistency for every possible split-half
- A generalised reliability coefficient for scoring systems that are graded by each item (sums all of them)
- used when items are graded (cannot not be used with dichotomous items)
- Essentially an estimate of ALL possible test-retest or split- half coefficients.
α can range between 0 and 1 (ideally closer to 1)
- cannot measure mutliple traits - must be homogeneous
When would we use Kuder-Richardson?
- when test is dichotomous
- tests every possible split-half correlations or test-retest
- mainly used in split-half
What is acceptable range of reliability?
Clinical – r > 0.85 acceptable
Research – r > ~0.7 acceptable
Reliabilities of Major Psychological Tests
- WAIS – r = 0.87
- MMPI – r = 0.84
- WAIS – r = 0.82
- MMPI – r = 0.74
summary of reliability