Ch. 8 - Test Development Flashcards

(65 cards)

1
Q

5 stages of test development

A
  1. conceptualization 2. construction 3. tryout 4. item analysis 5. test revision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

test construction

A

process of writing possible test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

test tryout

A

administering a test to a representative sample of testtakers under conditions that simulate those of the final version of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

some questions to ask when developing a new test

A

What is the test designed to measure? (what construct)
Is there a need for this test?
Who will use and take the test?
How will the test be administered?
Is there any potential for harm?
How will meaning be attributed to the scores on the test?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

on a norm-referenced test, a good item is one that…

A

high scorers on the whole test get right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

on a criterion referenced test, you need to do exploratory/pilot work with…

A

a group known to have mastered the skill

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pilot work/study

Why is it done?

A

work done surrounding the creation of the prototype of a test
done to determine how to best measure a targeted construct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

scaling

A

setting rules for assigned #s in measurement; the process by which a measuring device is designed and calibrated and by which #s (or other indices) AKA scale values are assigned to different amounts of the thing being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

stanine scale

A

all raw scores on the test can be transformed into scores that range from 1 to 9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

age and grade-based scales

A

if testtakers’ performance is a function of age or grade is of critical interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Likert scale

A

very reliable, has a scale of 1-5 or 1-7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

rating scales

provide what kind of data?

A

grouping of words, statements, or symbols on which judgments of the strength of a particular thing are indicated by the testtaker
ALL rating scales provide ordinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

method of paired comparisons

A

testtakers are presented with a pair of stimuli and must choose between then.
provide ordinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

comparative scaling

A

testtaker must judge a stimulus in comparison with every other stimulus on the scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

catagorical scaling

A

stimului are placed into one of two or more alternative cateogires that differ quantitatively with respect to some continuum. For ex: sort into “never justified” “sometimes justified” “always justified”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Guttman scale

A

items on it range sequentially from weaker to stronger expressions of the thing. (everyone who agrees with the stronger statement agrees with the weaker ones). used in consumer research
AKA scalogram analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

direct estimation vs indirect estimation

A

in direct estimation, you don’t need to transform a testtaker’s responses into some other scale. in indirect, you do need to transform those responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

equal-appearing intervals method

A

the only rating scale described that has items that are interval in nature (ex: suicide scale) - there are presumed to be equal distances between the values on the scale (interval scale)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How many test items should an item pool contain for a multiple-choice test?

A

twice the number of the final number of test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

item pool

A

assembly of many items (from brainstorming all possibilities or many possibilities of test items)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

selected-response format (item types)

A

multiple choice
matching
true-false (binary-choice item)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

constructed-response format (item types)

A

completion item
short answer
essay
looking for synthesis of info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

item bank

A

collection of GOOD test items. These items will continue to be selected and used or rotated. Finalized version of the item bank.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

CAT

A

computerized adaptive testing - test-taking process wherein items presented to the testtaker are based on performance of previous items. may be displayed according to rules (e.g. only after you get 2 hard ones right, show the next level).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
floor effect vs. ceiling effect reduced by?
CAT tends to reduce these. floor effect - not distinguishing between low scores/low ability ceiling effect - not distinguishing between high scorers/high ability
26
item branching
the ability of a computer to tailor the content and order of presentation items on the basis of responses to previous items
27
CAT found to reduce...
``` # of test items (by 50%) measurement error (by 50%) ```
28
class scoring AKA
``` category scoring testtakers will be placed in a certain group/class with other testtakers whose pattern of responses was similar in some way (e.g., diganosis) ```
29
cumulative scoring model
higher the score, the higher the testtaker is on the thing being measured
30
ipsative scoring
compares testtakers score on one thing within a test to another thing on a test (thing = scale). comparing yourself with yourself. e.g. Jon is cooler than he is smart BUT you can't say Jon is cooler than Jenny
31
test tryout - how many people?
no fewer than 5 for EACH item on the test. more the better. preferably 10.
32
How do you tell whether a test item is good?
item analysis | generally, a good test item is answered correctly by high scorers on the test as a whole
33
item analysis
the statistical scrutiny of test data
34
what do you scritinize in item analysis?
item's: difficulty, validity, relaibility, item discrimination (IDDRV)
35
item-difficulty index | expected values
denoted by lowercase italicized p = to testtakers who got it correct / total testtakers greater the p, the easier the item if >.90 - is it really needed? other than as a giveaway/warmup for those with test anxiety
36
what is an item-difficulty index called on a personality test?
item-endorsement index
37
item-reliability index
shows the internal consistency of a test higher the value, the greater the test's internal consistency =s*r (item SD * correlation between item score and test score)
38
item-validity index
measures the degree to which a test is measuring what it's supposed to measure
39
item-discrimination index
measures how adequately an item separates/discriminates between high and low scores on the entire test yields a lowercase italicized d d compares performance on a particular item with performance on the upper and lower regions of a distribution of continuous test scores higher the d, the higher the # of high scorers are answering it correctly
40
what does a high d value mean
higher the d, the higher the # of high scorers are answering a test item correctly Bonus: if it's a -d, that means that more low-scorers answer it correctly than high scorers Bonus: if d = 0, same # of high and low scorers get it right
41
what does a high p mean
the test item is easy
42
analysis of item alternatives
for multiple-choice items, see how many people answered the distractors and evaluate them appropriately (e.g., maybe too distracting/wording needs to be changed)
43
item-characteristic curve
ICC - a graphic representation of item difficulty and discrimination
44
steep slope in ICC means?
greater item discrimination
45
"good" item looks like what in ICC
straight line with a slope
46
"good" item for a cutoff-score test or criterion-based test
looks like the top of an F
47
guessing - what issues does it present in item analysis?
- guesses are not made totally randomly - how do we deal with omitted items? - some people are luckier guessers than others
48
item fairness
the degree (if any) a test item is biased
49
biased test item
favor one particular group of examinees when differences in group ability are controlled
50
if an item is fair, its ICC should...
not be significantly differnt for different groups regardless of ability
51
item analysis in speed tests
yield misleading or uninterpretable results because items that are closer to the end appear more difficult just because few people were able to finish them
52
what are methods of qualitative item analysis
intervies, group discussions, "think aloud" test administration (sheds light on thought patterns), and sensitivity reviews
53
sensitivity reviews
an expert panel. items on a test are examined for fairness to all prospective testatkers, flag offensive language, stereotypes
54
test revision (as a stage in test development) - strategy
characterize each item according to its strengths and weaknesses consider the purpose of the test - if for hiring and firing, eliminate biased items if for culling most skilled performers - get items with the best item discrimination to ID the best of the best
55
sandardization
process used to introduce objectivity and uniformity into test administration, scoring, and interpretation
56
What do you need to do after item analysis?
administer the revised test under standardized conditions, then cross-validation
57
When should you revise a test?
stimulus look dated, dated vocabulary, offensive language, test norms aren't adequate (group membership change), age-related shifts in the abilities over time, improve the reliability or validity of the test, theory on which test was based has improved
58
steps to revising an existing test
-all steps to make a new one (conceptualization, construction, tryout, item analysis, revision) + need to determine whether there is equivalence between the old and new versions of the test. likely scores will not mean the same thing (item analysis to evaluate stability of items between revisions of the same test)
59
cross-validation | what is inevitable?
re-validation of a test on a sample of testtakers other than the original group the test was found to be valid on. (aCROSS groups) validity shrinkage is inevitable
60
co-validation
test validation process conducted on two or more tests using the same sample of testtakers (economical - test subjects ID'ed once, personnel costs)
61
co-norming | benefits?
co-validation on two tests and creating norms or revising existing norms good for test users if tests are often used together bc they are normed on the same population (sampling error has been eliminated basically) like co-validation, saves money
62
quality assurance in test revision
confirming that a test is given the same way
63
anchor protocol
test protocol scored by a highly trained scorer, designed as a model for scoring and mechanism for resolving scoring discrepancies
64
protocol drift
the discrepancy between and anchor protocol and another scorer's protocol
65
differential item functioning
(DIF) - when an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same/similar level of underlying trait. This means that for some reason respondents from different groups have different probabilities of endorsing as a function of their group membership (ex: Asian women fear cultural shame around feeling depressed)