Test Construction - Domain Quiz Flashcards
Content appropriateness, taxonomic level, and extraneous abilities are factors that are considered when evaluating:
Select one:
a. a test’s factorial validity.
b. a test’s incremental validity.
c. the relevance of test items.
d. the adequacy of the “actual criterion.”
In the context of test construction, relevance refers to the extent to which test items contribute to achieving the goals of testing.
Answer C is correct: Content appropriateness, taxonomic level, and extraneous abilities are three factors that may be considered when determining the relevance of test items.
Answer A is incorrect: Factorial validity refers to the extent to which a test has high correlations with factors it is expected to correlate with and low correlations with factors it is not expected to correlate with.
Answer B is incorrect: Incremental validity refers to the degree to which a test improves decision-making accuracy.
Answer D is incorrect: The actual criterion refers to the actual (versus ultimate) measure of performance.
The correct answer is: the relevance of test items.
For an achievement test item that has an item discrimination index (D) of +1.0, you would expect:
Select one:
a. high achievers to be more likely than low achievers to answer the item correctly.
b. low achievers to be more likely than high achievers to answer the item correctly.
c. moderate achievers to be more likely than high and low achievers to answer the item correctly.
d. low and high achievers to be equally likely to answer the item correctly.
The item discrimination index is calculated by subtracting the percent of examinees in the lower-scoring group who answered the item correctly from the percent of examinees in the upper-scoring group who answered the item correctly and ranges in value from -1.0 to +1.0.
Answer A is correct: When all examinees in the upper-scoring group and none in the lower-scoring group answered the item correctly, D is equal to +1.0.
The correct answer is: high achievers to be more likely than low achievers to answer the item correctly.
The item difficulty index (p) ranges in value from:
Select one:
a. -1.0 to +1.0.
b. -.50 to +.50.
c. 0 to +1.0.
d. 0 to 50.
The item difficulty index (p) indicates the proportion of examinees in the tryout sample who answered the item correctly.
Answer C is correct: The item difficulty index ranges in value from 0 to +1.0, with 0 indicating that none of the examinees answered the item correctly and +1.0 indicating that all examinees answered the item correctly.
The correct answer is: 0 to +1.0.
The optimal item difficulty index (p) for items included in a true or false test is:
Select one:
a. .25.
b. .50.
c. .75.
d. 1
One factor that affects the optimal difficulty level of an item is the likelihood that an examinee can choose the correct answer by guessing, with the preferred level being halfway between 100% and the level of success expected by chance alone.
Answer C is correct: For true or false items, the probability of obtaining a correct answer by chance alone is .50. Therefore, the optimal difficulty level for true or false items is .75, which is halfway between 1.0 and .50.
The correct answer is: .75.
The slope (steepness) of an item characteristic curve indicate's the item's: Select one:
a. difficulty level.
b. discrimination.
c. reliability.
d. validity.
The various item response theory models provide item characteristic curves that provide information on one, two, or three parameters – i.e., difficulty level, discrimination, and probability of guessing correctly. Additional information on the item characteristic curve is provided in the Test Construction chapter of the written study materials.
Answer B is correct: An item’s ability to discriminate between high and low achievers is indicated by the slope of the item characteristic curve – the steeper the slope, the greater the discrimination.
The correct answer is: discrimination.
According to classical test theory, total variability in obtained test scores is composed of:
Select one:
a. true score variability plus random error
b. true score variability plus systematic error
c. a combination of communality and specificity
d. a combination of specificity and error
Answer A is correct: As defined by classical test theory, total variability in test scores is due to a combination of true score variability plus measurement (random) error - i.e., X = T + E.
The correct answer is: true score variability plus random error
A problem with using percent agreement as a measure of inter-rater reliability is that it doesn’t take into account the effects of:
Select one:
a. sample heterogeneity.
b. test length.
c. chance agreement among raters.
d. inter-item inconsistency.
Inter-rater reliability can be assessed using percent agreement or by calculating the kappa statistic.
Answer C is correct: A disadvantage of percent agreement is that it doesn’t take into account the amount of agreement that could have occurred among raters by chance alone, which can provide an inflated estimate of the measure’s reliability. The kappa statistic is more accurate because it adjusts the reliability coefficient for the effects of chance agreement.
The correct answer is: chance agreement among raters.
A researcher correlates scores on two alternate forms of an achievement test and obtains a reliability coefficient of .80. This means that ___% of observed test score variability reflects true score variability.
Select one:
a. 80
b. 64
c. 36
d. 20
Answer A is correct: A reliability coefficient is interpreted directly as a measure of true score variability.
The correct answer is: 80
A test has a standard deviation of 12, a mean of 60, a reliability coefficient of .91, and a validity coefficient of .60. The test’s standard error of measurement is equal to:
Select one:
a. 12
b. 9.6.
c. 3.6.
d. 2.8.
To calculate the standard error of measurement, you need to know the standard deviation of the test scores and the test’s reliability coefficient.
Answer C is correct: The standard deviation of the test scores is 12 and the reliability coefficient is .91. To calculate the standard error, you multiply the standard deviation times the square root of one minus the reliability coefficient: 1 minus .91 is .09; the square root of .09 is .3; .3 times 12 is 3.6. Additional information about the calculation and use of the standard error of measurement is provided in the Test Construction chapter of the written study materials.
The correct answer is: 3.6.
Consensual observer drift tends to:
Select one:
a. increase the probability of answering a test item correctly by chance alone.
b. decrease the probability of answering a test item correctly by chance alone.
c. produce an overestimate of a test’s inter-rater reliability.
d. produce an underestimate of a test’s inter-rater reliability.
Consensual observer drift occurs when two or more observers working together influence each other’s ratings on a behavioral rating scale so that they assign ratings in a similar idiosyncratic way.
Answer C is correct: Consensual observer drift makes the ratings of different raters more similar, which artificially increases inter-rater reliability.
The correct answer is: produce an overestimate of a test’s inter-rater reliability.
For a newly developed test of cognitive flexibility, coefficient alpha is .55. Which of the following would be useful for increasing the size of this coefficient?
Select one:
a. adding more items that are similar in terms of content and quality
b. adding more items that are similar in terms of quality but different in terms of content
c. reducing the heterogenity of the tryout sample
d. using a true or false format for the items rather than a multiple-choice format
For the exam, you want to be familiar with the methods for increasing reliability that are described in the Test Construction chapter of the written study materials.
Answer A is correct: A test’s reliability is increased when the test is lengthened by adding items of similar content and quality, the range of scores is unrestricted (i.e., the tryout sample heterogeneity is maximized), and the ability to choose the correct answer by guessing is reduced.
The correct answer is: adding more items that are similar in terms of content and quality
Sally Student receives a score of 450 on a college aptitude test that has a mean of 500 and standard error of measurement of 50. The 68% confidence interval for Sally’s score is:
Select one:
a. 400 to 450.
b. 400 to 500.
c. 450 to 550.
d. 350 to 550.
The standard error of measurement is used to construct a confidence interval around an obtained test score.
Answer B is correct: To construct the 68% confidence interval, one standard error of measurement is added to and subtracted from the obtained score. Since Sally obtained a score of 450 on the test, the 68% confidence interval for her score is 400 to 500. Additional information on constructing confidence intervals is provided in the Test Construction chapter of the written study materials.
The correct answer is: 400 to 500.
The kappa statistic for a test is .95. This means that the test has:
Select one:
a. adequate inter-rater reliability.
b. adequate internal consistency reliability.
c. inadequate intra-rater reliability.
d. inadequate alternate forms reliability.
The kappa statistic (coefficient) is a measure of inter-rater reliability.
Answer A is correct: The reliability coefficient ranges in value from 0 to +1.0. Therefore a kappa statistic of .95 indicates a high degree of inter-rater reliability.
The correct answer is: adequate inter-rater reliability.
To assess the internal consistency reliability of a test that contains 50 items that are each scored as either “correct” or “incorrect,” you would use which of the following?
Select one:
a. KR-20
b. Spearman-Brown
c. kappa statistic
d. coefficient of concordance
For the exam, you want to be familiar with all of the measures listed in the answers to this question.
Answer A is correct: The Kuder-Richardson Formula 20 (KR-20) is a measure of internal consistency reliability that can be used when test items are scored dichotomously (correct or incorrect).
Answer B is incorrect: The Spearman-Brown formula is used to estimate the effects of lengthening or shortening a test on its reliability.
Answer C is incorrect: The kappa statistic (also known as the kappa coefficient) is a measure of inter-rater reliability.
Answer D is incorrect: The coefficient of concordance is another measure of inter-rater reliability.
The correct answer is: KR-20
To determine a test’s internal consistency reliability by calculating coefficient alpha, you would:
Select one:
a. administer the test to a single sample of examinees two times.
b. administer two alternate forms of the test to a single sample of examinees.
c. administer the test to a single sample of examinees and have the tests scored by two raters.
d. administer the test to a single sample of examinees one time.
Knowing that coefficient alpha is a measure of internal consistency reliability would have helped you identify the correct answer to this question.
Answer D is correct: Determining internal consistency reliability with coefficient alpha involves administering the test once to a single sample of examinees and using the formula to determine the degree of inter-item consistency.
Answer A is incorrect: Administering the same test to a single sample of examinees on two occasions would be the procedure for assessing test-retest reliability.
Answer B is incorrect: Administering two alternate forms of the test to a single sample of examinees is the procedure for assessing alternate (equivalent) forms reliability.
Answer C is incorrect: Having a test that was administered to a single sample of examinees scored by two raters is the procedure for assessing inter-rater reliability.
The correct answer is: administer the test to a single sample of examinees one time.
To estimate the effects of lengthening a 50-item test to 100 items on the test’s reliability, you would use which of the following?
Select one:
a. eta
b. KR-20
c. kappa coefficient
d. Spearman-Brown formula
For the exam, you want to be familiar with the measures listed in the answers to this questions. These are described in the Test Construction chapter of the written study materials.
Answer D is correct: The Spearman-Brown prophecy formula is used to estimate the effects of lengthening or shortening a test on its reliability coefficient.
The correct answer is: Spearman-Brown formula
Which of the following methods for evaluating reliability is most appropriate for speed tests?
Select one:
a. split-half
b. coefficient alpha
c. kappa statistic
d. coefficient of equivalence
Answer D is correct: Of the methods for evaluating reliability, the coefficient of equivalence (also known as alternative or equivalent forms reliability) is most appropriate for speed tests. Split-half reliability and coefficient alpha are types of internal consistency reliability, and measures of internal consistency reliability overestimate the reliability of speed tests. The kappa statistic is a measure of inter-rater reliability.
The correct answer is: coefficient of equivalence
You administer a test to a group of examinees on April 1st and then re-administer the same test to the same group of examinees on May 1st. When you correlate the two sets of scores, you will have obtained a coefficient of:
Select one:
a. internal consistency.
b. determination.
c. equivalence.
d. stability.
Correlating two sets of scores obtained by the same group of examinees produces a test-retest reliability coefficient.
Answer D is correct: Test-retest reliability indicates the stability of scores over time, and the test-retest reliabiity coefficient is also known as the coefficient of stability.
The correct answer is: stability.
A test developer uses a sample of 50 current employees to identify items for and then validate a new selection test (predictor). When she correlates scores on the test with scores on a measure of job performance (criterion) for this sample, she obtains a criterion-related validity coefficient of .63. When the test developer administers the test and the measure of job performance to a new sample of 50 employees, she will most likely obtain a validity coefficient that is:
Select one:
a. greater than .63.
b. less than .63.
c. about .63.
d. negative in value.
This question is asking about “shrinkage,” which occurs when a test is cross-validated on another sample.
Answer B is correct: The validity coefficient tends to “shrink” (be smaller) on the second sample because the test was tailor-made for the initial sample and the chance factors that contributed to the validity coefficient in the initial sample will not all be present in the second sample.
The correct answer is: less than .63.
A test’s content validity is established primarily by which of the following?
Select one:
a. conducting a factor analysis
b. assessing the test’s convergent and divergent validity
c. having subject matter experts systematically review the test’s items
d. testing hypotheses about the attribute(s) measured by the test
For the exam, you want to be familiar with the differences between content, construct, and criterion-related validity.
Answer C is correct: Content validity refers to the degree to which test items are an adequate sample of the content domain and is determined primarily by the judgment of subject matter experts. The methods listed in the other answers are used to establish a test’s construct validity.
The correct answer is: having subject matter experts systematically review the test’s items
A test’s specificity refers to the number of __________ that were identified by the test.
Select one:
a. true positives
b. false positives
c. true negatives
d. false negatives
For the exam, you want to know the difference between specificity and sensitivity, which are terms that are used to describe a test’s accuracy.
Answer C is correct: Specificity refers to the identification of true negatives (percent of cases in the validation sample who do not have the disorder and were accurately classified by the test as not having the disorder). Additional information on sensitivity and specificity is provided in the Test Construction chapter of the written study materials.
Answer A is incorrect: Sensitivity refers to the number of true positives.
The correct answer is: true negatives
In a multitrait-multimethod matrix, a test’s construct validity would be confirmed when:
Select one:
a. monotrait-monomethod coefficients are low and heterotrait-heteromethod coefficients are high.
b. monotrait-heteromethod coefficients are high and heterotrait-monomethod coefficients are low.
c. monotrait-monomethod coefficients are high and monotrait-heteromethod coefficients are low.
d. heterotrait-monomethod coefficients and heterotrait-heteromethod coefficients are low.
This question is asking about the pattern of correlation coefficients in a multitrait-multimethod matrix that provide evidence of a test’s construct validity.
Answer B is correct: When monotrait-heteromethod (same trait-different methods) coefficients are large, this provides evidence of the test’s convergent validity – i.e., it shows that the test is measuring the trait it was designed to measure. Conversely, when heterotrait-monomethod (different traits-same method) coefficients are small, this provides evidence of the test’s discriminant validity – i.e., it shows that the test is not measuring a different trait. Additional information on the correlation coefficients contained in a multitrait-multimethod matrix is provided in the Test Construction chapter of the written study materials.
The correct answer is: monotrait-heteromethod coefficients are high and heterotrait-monomethod coefficients are low.
In a scatterplot constructed from data collected in a concurrent validity study, the number of “false negatives” is likely to increase if:
Select one:
a. the predictor and criterion cutoff scores are both raised.
b. the predictor and criterion cutoff scores are both lowered.
c. the predictor cutoff score is raised and or or the criterion cutoff score is lowered.
d. the predictor cutoff score is lowered and or or the criterion cutoff score is raised.
An illustration is provided in the Test Construction materials that can help you visualize what happens when the predictor and or or criterion cutoff scores are changed.
Answer C is correct: The number of false negatives increases as the predictor cutoff score is raised (moved to the right in a scatterplot) and when the criterion cutoff score is lowered (moved toward the bottom of the scatterplot).
The correct answer is: the predictor cutoff score is raised and or or the criterion cutoff score is lowered.
____________ refers to the percent of examinees who have the condition being assessed by a predictor who are identified by the predictor as having the condition.
Select one:
a. Specificity
b. Sensitivity
c. Positive predictive value
d. Negative predictive value
Answer B is correct: Sensitivity refers to the probability that a predictor will correctly identify people with the disorder from the pool of people with the disorder. It is calculated using following formula: true positives or (true positives + false negatives).
The correct answer is: Sensitivity