Test construction Flashcards
(37 cards)
What is the item discrimination index?
item discrimination index (D) indicates the difference between the percentage of examinees with high total test scores who answered the item correctly and the percentage of examinees with low total test scores who answered the item correctly. When the same percentage of examinees in the two groups answered the item correctly, D equals 0.
How can you increase reliability coefficient?
Reliability coefficients tend to be larger for longer tests than shorter tests as long as the added items address similar content as the original items do and when the tryout sample is heterogeneous with regard to the content measured by the test so that there is an unrestricted range of scores.
maximized when range of scores is unrestricted, when examinees are heterogeneous the range of scores is maximized
-difficulty level of items will also affect range, all easy or all hard items will lead to all high or low test scores, want average difficulty level of item to be mid-range
Explain classical test theory
Classical test theory is also known as true score theory and predicts that obtained test scores (X) are due to a combination of true score variability (T) and measurement error (E), with measurement error referring to random factors that affect test performance in unpredictable ways.
How do you interpret a reliability coefficient?
A reliability coefficient is interpreted directly as the amount of variability in test scores that’s due to true score variability. When a test’s reliability coefficient is .90, this means that 90% of variability in test scores is due to true score variability and the remaining 10% is due to measurement error.
What is the spearman-brown formula used for?
Test length is one of the factors that affects the size of the reliability coefficient, and the Spearman-Brown formula is often used to estimate the effects of lengthening or shortening a test on its reliability coefficient. This formula is especially useful for correcting the split-half reliability coefficient because assessing split-half reliability involves splitting the test in half and calculating a reliability coefficient for the two halves of the test. Therefore, split-half reliability tends to underestimate a test’s actual reliability, and the Spearman-Brown formula is used to estimate the reliability coefficient for the full length of the test.
When is Cohen’s Kappa coefficient used?
The kappa coefficient is used to assess the consistency of ratings assigned by two raters when the ratings represent a nominal scale (e.g., when a rating scale classifies children as either meeting or not meeting the diagnostic criteria for ADHD).
used to evaluate inner-rater reliability
corrected for change agreement between raters
When do you use the Kuder-Richardson 20?
Kuder-Richardson 20 (KR-20) can be used to assess a test’s internal consistency reliability when test items are dichotomously scored (e.g., as correct or incorrect) (altenrative to cronbach’s alpha)
What is test reliability?
extent to which a test provides consistent info
r =reliability coefficent (a correlation coefficient)
-ranges from 0-1
-interpreted as amoung of variability in test scores that’s due to true score variability
-do NOT square this, interpret as is
formula to calculate standard error of measurement
SEM = (SD)(square root of 1 - r)
where r = reliability coefficient
how to construct CI for 68%, 95% and 99%
from the person’s score add/subtract 1 SEM for 68% CI, 2 SEM for 95% CI, and 3 SEM for 99% CI
what does squaring a correlation coefficient tell you?
can only score correlation coefficient when it represents the correlation between two different tests
when squared it provides a measure of shared variability or uses terms like “accounted for by” or “explained by”
What does cronbach’s alpha measure?
internal consistency reliabilty
What is the problem with split-half reliability?
for split-half reliability you split the test in half and adminster then look at the correlation between the two halves
problem is that shorter tests are less reliable than longer tests, so the reliability coefficent of a split-half test underestimates the full tests true reliability
this is corrected with the spearman-brown prophecy formula
what is percent agreement?
used to assess inter-rater reliability for 2 or more raters, does not take chance agreement into account and can overestimate reliability
cohen’s kappa preferred because it is corrected for change agreement
What are factors that affect the reliability coefficient?
content homogeneity- leads to larger reliability coefficients
range of scores- reliability coefficients are larger when range of test scores are unrestricted
guessing- easier it is to choose the correct answer the lower the reliability coefficient
What is item analysis used for in test construction?
to determine which items to include based on difficulty level and ability to discriminate between examinees who obtain high and low scores
how is item difficulty determined
for dichotomous items it is the % of examinees who answered the item correctly, ranges 0-1, smaller values more difficult
What is item response theory?
an alternative to classical test theory. CTT is test based, IRT is item based.
overcomes limitations of CTT: better suited for developing computerized adaptive tests
What is an item characteristic curve and what does it tell you?
tells about the relationship between each item and the latent trait being measured by the test
x-axis = total test scores
y-axis = probability of answer item correctly
location of curve= difficulty parameter, more likely to be answered correctly are on the left side of the graph and less likely to be answered correctly are on the right side of the graph
slope of the curve= discrimination parameter, how well the item can discriminate between individuals which high and low levels of the trait, steeper slope = better discrimination
point at which curve crosses the y-axis = probability of guessing correctly, closer to 0, more difficult to guess
What is content validity?
items of the test are a clearly representative sample of the domain being tested
What is construct validity?
important for tests designed to measure a hypothetical trait that cannot be directly observed but is inferred from behavior
includes convergent and divergent (discriminant) validity
convergent- degree to which scores on test have high correlation with scores on other measures designed to assess the same or related construct
divergent- degree to which test scores have low correlations with measures of other unrelated constructs
What is multitrait-multimethod matrix used for?
provides info about a tests reliability, and convergent and divergent validity
test and 3 other measures are administered: 1) test assessing same trait but with different method, 2) test of unrelated trait using same method, 3) unrelated trait using different method
correlate all pairs of test scores and interpret
how do you interpret the correlations from a multitrait-multimethod matrix
monotrait-monomethod- this is the reliabilty coefficient or coefficient alpha (correlating test with its self)
monotrait-heteromethod-correlation between the new test and the test that measures the same trait with a different method, when this coefficient is large, provides evidence for convergent validity
heterotrait-monomethod- correlation between new test and the test of a different trait using the same method, small correlation demonstrates divergent validity
heterotrait-heteromethod- correlation between the new test and the test that assesses unrelated trait with a diferent method, small correlation is evidence for divergent validity
What is factor analysis used for?
to assess a test’s convergent and divergent validity
administer test being developed as well as tests of similar and unrelated traits, correlate all pairs of scores and put in a correlation matrix and derive a factor matrix, rotate the matrix and then name and interpret the factors
matrix has to be rotated to be more easily interpreted