Test Construction Flashcards
(41 cards)
true score variability
variability due to real differences in ability or knowledge in the test-takers
error variability
variability caused by chance or random factors
classical test theory
test scores = true score variability + error variability
reliability
the amount of consistency, repeatability, and dependability in scores obtained on a test
reliability coefficient
- represented as ‘r’
- ranges from 0.00-1.00
- minimum acceptability is 0.8
- two factors that affect the size are: range of test scores and the homogeneity of test content
sources of errors in tests
- content sampling
- time sampling (ex - forgetting over time)
- test heterogeneity
factors that effect reliability
- number of items (the more the better)
- homogeneity (the more similar the items are, the better)
- range of scores (the greater the range, the better)
- ability to guess (true/false = the least reliable)
test-retest reliability
(or coefficient of stability)
- correlating pairs of scores from the same sample of people who are administered the identical test at two points in time
parallel forms reliability
(or coefficient of equivalence)
- correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time
internal consistency reliability
- looks at the consistency of the scores within the test
- 2 ways: Kuder-Richardson or Cronbach’s coefficient alpha
split half reliability
- splitting the test in half (ex - odd vs even numbered questions) and then correlating the scores based on half the number of items
Spearman-Brown prophecy formula
- a type of split-half reliability
- tells us how much more reliable the test would be if it were longer
*inappropriate for speeded tests
Kuder-Richardson and Cronbach’s coeffcient alpha
- involve analysis of the correlation of each item with every other item in the test (reliability/internal consistency)
- KR-20 is used when items are scored dichotomously (ex - right or wrong)
- Cronbach’s is used when items are scored non-dichotomously (ex - Likert scale)
interrater reliability
- the degree of agreement between two or more scorers when a test is subjectively scored
- best way to improve = provide opportunity for group discussion, practice exercises, and feedback
validity
- the meaningfulness, usefulness, or accuracy of a measure
3 basic types of validity
- content
- criterion
- construct
face validity
- the degree to which a test subjectively appears to measure what it says it measures
content validity
- how adequately a test samples a particular content area
true positive
- test takers who are accurately identified as possessing what is being measured
*correct prediction
false positive
- test takers who are inaccurately identified as possessing what is being measured
*incorrect prediction
true negative
- test takers who are accurately identified as not possessing what is being measured
*correct prediction
false negatives
- test takers who are inaccurately identified as not possessing what is being measured
*incorrect prediction
item difficulty
- represented by ‘p’
- can range in value from 0 - 1+ (0 easy, 1 very difficult)
- difficulty level = number of people who got it right
- items should have an average difficulty level of 0.50, and a range of 0.30 and 0.80
example - if ‘p’ is 0.10, that means only 10% of people got the item right (therefore it was difficult)
diagnostic validity
- focuses on who DOES NOT have the disorder
- an ideal situation would result in high sensitivity, specificity, hit rate, and predictive values with few false positives and few false negatives