Reliability Flashcards
TEST-RETEST
We consider the consistency of the test results when the test is administered on different occasions
only applies to stable traits
Sources of difference between test and retest?
Systematic carryover - everyones score improved the same amount of points - does not harm reliability
Random Carryover - changes are not predictable from earlier scores or when something affects some but not all test takers
Practice effects - skills improve with practice
Midterm exam twice - expect you to do better-
Time before re-administration must be carefully evaluated
Short time: carryover and practice effects
Long time: poor reliability, change in the characteristic with age, combination
Well-evaluated test: test-retest
Well-evaluated test - many retest correlations associated with different time intervals between testing sessions - consider events in between
PARALLEL FORMS
we evaluate the test across different forms of the test
use different items; however, the rules used to select items of a particular difficulty level are the same.
Give two different forms to the same person (same day), calculate the correlation
Reduces learning effect
CON: not always practical - hard to come up with two forms that you expect to behave identically
SPLIT HALF/Internal Consistency
Administer the whole test - split it in half and calculate the correlation between halves
If progressively more difficult - even odd system
CON: how do you figure out which halves? - midterm - don’t expect all questions to be the same
SPLIT HALF: Spearman-Brown Correction
allows you to estimate what the correlation
between the two halves would have been if each half had been the length of the whole test:
R = 2r/1+r
Corrected r = the estimated correlation between the two halves of the test if each had the total number of items
increases the estimate of reliability
r = the correlation between the two halves of the test
Assuming variance btw the two halves are similar
SPLIT HALF: Cronbach’s Alpha
The coefficient alpha for estimating split-half reliability
LOWEST boundary for reliability
Unequal variances
A = the coefficient alpha for estimating split-half reliability
O2x = the variance for scores on the whole o2y1o2y2 = the variance for the two separate halves of the test
SPLIT HALF: KR20 formula
Reliability estimate - math as a way of solving the problem for all possible split halves
S2 = the variance of the total test score
P = the proportion of the people getting each item correct (this is found separately for each item)
Q = the proportion of people getting each item incorrect. For each item, q equals 1-p.
Sumpq = sum of the products of p times q for each item on the test
to have nonzero reliability, the variance for the total test score must be greater than the sum of the variances for the individual items.
This will happen only when the items are measuring the same trait.
The total test score variance is the sum of the item variances and the covariances between items
only situation that will make the sum of the item variance less than the total test score variance is when there is covariance between the items
greater the covariance, the smaller the Spq term will be.
When the items covary, they can be assumed to measure the same general trait, and the reliability for the test will be high.
KR20 formula cons split half
split half: KR21 Formal
Similar - different version
does not require the calculation of the p’s and q’s for every item. Instead, the KR21 uses an approximation of the sum of the pq products—the mean test score
Assumptions need to be met:
most important is that all the items are of equal difficulty, or that the average difficulty level is 50%.
Difficulty is defined as the percentage of test takers who pass the item. In practice, these assumptions are rarely met, and it is usually found that the KR21 formula underestimates the split-half reliability
SPLIT HALF: Coefficient Alpha
Variance of all individual items compared to variance of test score
Tests where there is no correct answer - likert
Similar to the KR20 - sumpq - replaced by sums2i = variance of the individual items -sum individual variances
Factor ANalysis
Can be used to divide the items into subgroups, each internally consistent - subgroups of items will not be related to one another
Help a test constructor build a test tha has submeasures for several different traits
Classical test theory - turning away bc
- Requires that exactly the same test be administered to each person
- Some items are too easy and some are too hard - so few of the items concentrate on a persons exact ability level
- Assumes behavioral dispositions are constant over time
Item Response Theory
Basis of computer adaptive tests
focus on the range of item difficulty that helps assess an individual’s ability level.
turning away from the classical test theory for a variety of different reasons.
IRT, the computer is used to focus on the range of item difficulty that helps assess an individual’s ability level. For example, if the person gets several easy items correct, the computer might quickly move to more difficult items.
more reliable estimate of ability is obtained using a shorter test with fewer items
Item Response Theory - Difficulties
1 - method requires a bank of items that have been systematically evaluated for level of difficulty
2- Considerable effort must go into test development, and complex computer software is required.
Reliability of a Difference Score
When might we want a difference score - difference btw performance at two points in time, before and after a training program
In a difference score, E is expected to be larger than either the observed score or T because E absorbs error from both of the scores used to create the difference score.
T might be expected to be smaller than E because whatever is common to both measures is canceled out when the difference score is created
The low reliability of a difference score should concern
The low reliability of a difference score should concern the practicing psychologist and education researcher. Because of their poor reliabilities, difference scores cannot be depended on for interpreting patterns.
Interrater Reliability
Kappa statistic
introduced by J. Cohen (1960) as a measure of agreement between two judges who each rate a set of objects using nominal scales.
Kappa indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement.
Values of kappa may vary between 1 (perfect agreement) and 21 (less agreement than can be expected on the basis of chance alone).
Greater than .75 = excellent
Interrater Reliability - Nominal scores
-1 less than chance
1 perfect agreement
.75 excellent
.40-.70 - fair to good
Less than .40 is poor
Sources of Error
Time Sampling issues
week later in state anxiety
source of error is typically assessed using the test–retest method
Sources of Error - Item sampling
some items may behave strangely
The same construct or attribute may be assessed using a wide pool of items.
Typically, the correlation between two forms of a test is created by randomly sampling a large pool of items believed to assess a particular construct.
This correlation is used as an estimate of this type of reliability
Sources of Error Internal Consistency -
we examine how people perform on similar subsets of items selected from the same form of the measure
intercorrelations among items within the same test
If the test is designed to measure a single construct and all items are equally good candidates to measure that attribute, then there should be a high correspondence among the items.
determine extent of internal consistency error by
evaluated using split-half reliability, the KR20 method, or coefficient alpha