Item Analysis & Test Reliability Flashcards by A HL

this refers to due to random factors that affect the test performance of examinees in unpredictable ways and include distractions during testing, ambiguously worded test items, and examinee fatigue

measurement error

How well did you know this?

Not at all

Perfectly

this refers to the extent to which a test provides consistent information

test reliability

How well did you know this?

Not at all

Perfectly

reliability coefficients are interpreted directly as…

the amount variability in obtained test scores thats due to true variability

e.g., if a test has a reliability coefficient of .80, this means that 80% of variability in obtained test scores is due to true variability and the remaining 20% is due to measurement error

How well did you know this?

Not at all

Perfectly

which type of tests have higher reliability coefficients: attitude tests, personality tests, or cognitive ability tests? why?

standardized cognitive ability tests

How well did you know this?

Not at all

Perfectly

match each description with the correct method of assessing reliability: test-retest, alternate forms, internal consistency, or inter-rater reliability

a) provides information about the consistency of scores over different test items; useful for tests that are designed to measure a single content domain or aspect of behavior
b) provides informaiton on the consistency of scores or ratings assigned by different raters; most useful for measures that are subjectively scored
c) provides information about the consistency of scores over time; most useful fro tests designed to measure a characteristic that’s stable over time
d) provides information about the consistency of scores over different forms of the test and, when the second form is administered at a later time, the consistency of scores over time; most useful/important whenever a test has more than 1 form

a) internal consistency reliability
b) inter-rater reliability
c) test-retest reliability
d) alternate forms reliability

How well did you know this?

Not at all

Perfectly

what tests is internal consistency reliability not useful for and why?

speed tests (tests that measure speed of performance rather than knowledge or skill level)
* because it tends to overestimate their reliability

How well did you know this?

Not at all

Perfectly

list 4 methods of evaluating internal consistency

1) coefficient alpha (aka Cronbach’s alpha)
2) Kuder-Richardson 20 (KR-20)
3) split-half reliability
4) Spearman-Brown prophecy formula

How well did you know this?

Not at all

Perfectly

match the description of methods used to evaluate internal consistency with the correct name of the method: coefficient alpha, KR-20, split-half reliabilty, or Spearman-Brown

A) involves administering the test to a sample of examinees, splitting the test in half (often in terms of even- and odd-numbered items, and correlating the scores on the two halves
B) used to determine the effects of lengthening or shortening a test on its reliabiity coefficient; usually used to correct underestimation of a test’s reliability
C) used when test items are dichotomously scored (e.g., as correct or incorrect)
D) involves administering the test to a sample of examinees & calculating the average inter-item consistency

A) split-half reliability
B) KR-20
C) Spearman-Brown prophecy formula
D) coefficient alpha

How well did you know this?

Not at all

Perfectly

this method of evaluating inter-rater reliability can be calculated for 2 or more raters, but will not take chance agreement into account and can result in an overestimate of reliability

percent agreement

How well did you know this?

Not at all

Perfectly

this method for evaluating inter-rater reliability is used to assess the consistency of ratings assigned by two raters when the ratings represent a nominal scale

Cohen’s kappa coefficient

How well did you know this?

Not at all

Perfectly

the reliability of subjective ratings can be affected by this, which occurs when 2 or more raters communicate with each other while assignning ratings and results in increased consistency (but often decreased accuracy) in ratings & an overestimate of inter-rater reliability

consensual observer drift

How well did you know this?

Not at all

Perfectly

list ways to eliminate/reduce consensual observer drift

not having raters work together
providing raters with adequate training
regularly monitoring the accuracy of raters’ ratings

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

list & describe the 3 factors that affect the size of the reliability coefficient

1) content homogeneity: tests that are homogenous with regard to content tend to have larger reliability coefficients than heterogeneous tests, esp. for internal consistency reliability

2) range of scores: larger when test scores are unrestricted in terms of range, which occurs when the examinees included in the sample are heterogeneous with regard to characteristic(s) measure by the test (e.g., when the sample includes examinees who have high, moderate, and low levels of the characteristic)

3) guessing: affected by the likelihood that test items can be answered correctly by guessing (i.e., the easier it is to choose the correct answer by guessing, the lower the reliability coefficient)

a true/false test is likely to be less reliable than a multiple-choice test that has 3+ answer choices

How well did you know this?

Not at all

Perfectly

this is used to determine which items to include in the test, involves determining each item’s difficulty level & ability to discriminate between examinees who obtain high & low total test scores, and is based on classical test theory

item analysis

How well did you know this?

Not at all

Perfectly

with regard to item analysis, what is the typical p value range for moderately difficult items

Study These Flashcards

.30 to .70

Study These Flashcards

what size p value is preferred for test items used to identify examinees who have mastered a certain level of knowledge or skill (e.g., on mastery tests)

Study These Flashcards

lower p values

this index indicated the difference between the percentage of examinees with high total test scores (often the top 27%) who answered the item correctly and the percentage of examinees with low total test scores (often the bottom 27%) who answered the item correctly

Study These Flashcards

item discrimination index (D)

As an example, when 90% of examinees in the high-scoring group and 20% of examinees in the low-scoring group answered an item correctly, the item’s D value is .90 minus .20, which is .70.

an item’s difficulty level affects its ability to…

Study These Flashcards

discriminate (higher levels of discrimination for items of moderate difficulty)

this indicates the range within which an examinee’s true score is likely to be given their obtained score

Study These Flashcards

confidence interval

this is used to construct a confidence interval, and is calculated by multiplying the test’s standard deviation by the square root of 1 minus the reliability coefficient, e.g., (SD)√1-r

Study These Flashcards

standard error of measurement

For instance, if a test has a standard deviation of 5 and a reliability coefficient of .84, its standard error of measurement equals 5 times the square root of 1 minus .84: 1 minus .84 is .16, the square root of .16 is .4, and 5 times .4 is 2.
* In other words, when a test’s standard deviation is 5 and its reliability coefficient is .84, its standard error of measurement is 2.

constructing a 68%, 95%, & 99% confidence interval

Study These Flashcards

68% CI: add & subtract 1 standard error of measurement to & from the obtained score
95% CI: add & subtract 2 standard errors of measurement to & from the obtained score
99% CI: add & subtract 3 standard errors of measurement to & from the obtained score

an examinee obtained a score of 90 on a test that has a standard error of measurement of 5. Identify the 95% confidence interval for this score.

Study These Flashcards

80 to 100
(90 + 10 = 100
90 - 10 = 80)

To do so, you add and subtract 10 (two standard errors) to and from 90, which gives you a 95% confidence interval of 80 to 100.

classical test theory (CTT) vs. item response theroy (IRT)

CTT is **test based** & focuses on examinees' total test scores IRT is **item based** & focuses on examinees' responses to individual test items **IRT is better suited for developing computerized adaptive tests**

Item Analysis & Test Reliability Flashcards

(26 cards)