Module 2: Norms and Reliability Flashcards
What is Classical Test Theory? (CCT)
CCT is a model for understanding measurement
CCT is based on the True Score Model…
… for each person, their observed score on a test is comprised of: - Observed score (X) = True Score (T) + Error (E)
What is a true score?
True score is a person’s actual true ability level (i.e. measured without error).
What is error?
Error is a component of observed score unrelated to the test takers rue ability or trait being measured.
True variance and Error variance thus refer to the variability in a collection/population of test scores.
What is reliability?
Reliability refers to consistency in measurement.
- According to CCT: reliability is the proportion of the total variance attributed to true variance
What is test administration error?
Test administration: variation due to the testing environment
- Testtaker variables (e.g., arousal, stress, physical discomfort, lack of sleep, drugs, medication)
- Examiner variables (e.g., physical appearance, demeanour)
What is test scoring and interpretation error?
Test scoring and interpretation:
Variation due to differences in scoring and interpretation
What are methodological errors?
Variation due to poor training, unstandardized administration, unclear questions, biased questions.
CCT True-score Model vs. Alternative
- True Score Model of measurement (based on CCT) is simple, intuitive, and thus widely used
- Another widely used model of measurement is Item Response Theory (IRT)
- CTT assumptions more readily met than IRT, and assures only two components to measurement
- But, CTT assumes all items on a test have an equal ability to measure the underlying construct of interest.
Item Response Theory (IRT)
- IRT provides a way to model the probability that a person with X ability level will correctly answer a question that is ‘tuned’ to that ability level.
What does IRT incorporate and consider?
- IRT incorporates considerations of item Difficulty and discrimination
o Difficulty relates to an item not being easily accomplished, solved, or comprehended.
o Discrimination refers to the degree to which and item differentiates among people with higher or lower levels of the trail ability, or construct being measures.
Reliability estimates
Because a person’s true score is unknown, we use different mathematical methods to estimate the reliability of tests.
Common examples include: - Test-retest reliability - Parallel an Alternate forms of reliability - Internal consistency reliability o E.g., split in half, inter item correlation, Cronbach’s alpha - Interrater/interscorer reliability
Test-retest reliability
Test-retest reliability is an estimate of reliability over time
- Obtained by correlating pairs of scores from same people on administration and same test at different times
- Appropriate for stable variables (e.g., personality)
- Estimates tend to decrease as time passes
Parallel and Alternate Forms Reliability
- Parallel forms: two versions of a test are parallel if in bother versions the means and variances of test scores are equal
- Alternate forms: there is an attempt to create two forms of a test, but they do not meet strict requirement of parallel forms
- Obtained by correlating the scores of the same people measured with the different forms.
Split half reliability
Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Entails three steps:
- Step 1: Divide the test into two halves
- Step 2: Correlate scores on the two halves of the test.
- Step 3: Generalise the half-test reliability to the full-test reliability using the Spearman-Brown formula.
Inter-item/correlating
The degree of relatedness of items on a test. Able to gauge the homogeneity of a test
Inter-item/correlating
The degree of relatedness of items on a test. Able to gauge the homogeneity of a test
Kuder-Richardson formula 20
statistic of choice for determining the inter-item consistency of dichotomous items
Coefficient alpha
mean of all possible split-half correlations, corrected by the Spearman-brown formula. The most popular approach for internal consistency. Values range from 0 to 1.
Interrater/InterScorer Reliability
The degree of agreement/consistency between two or more scorers (or judges or raters).
- Often used with behavioural measures
- Guards against biases or idiosyncrasies in scoring
- Obtained by correlating scores from different raters:
o Use intraclass correlation for continuous measures
o Use Cohen’s Kappa for categorical measures
Choosing Reliability Estimates
The nature of the test will often determine the reliability metric e.g.,
- Are the test items homogenous or heterogeneous in nature
- Is the characteristic, ability, trait being measured presumed to be dynamic or static
- The range of test scores is or is not restricted
- The test is a speed (how many can you do in a certain amount of time) or a power test (increasing difficulty over the item)
- The test is or is not criterion-referenced (in order to pass you need to reach a threshold)
Otherwise, you can select whatever you think is appropriate.
How do we account for reliability in a single score?
- Our reliability coefficient tells us about error in our test in general
- We can use this reliability to estimate to understand how confident we can be in a single observed score for one person.
Standard Error of the Difference (SED)
The SED is a measure of how large a difference in test scores would be to be considered ‘statistically significant’
Helps with three questions (Note: test 1&2 must be on the same scale)
- How did Person A’s performance on test 1 compare with own performance on test 2?
- How did Person A’s performance on test 1 compare with Person B’s performance on test 1?
- How did Person A’s performance on test 1 compare with person B’s performance on test 2?
Standardization
is the process of administering tests to representative samples to establish norms.
Sampling
the selection of an intended population for the test, that has at least one common, observable characteristic.