CHAPTER 8: TEST DEVELOPMENT Flashcards

(73 cards)

1
Q

It is an umbrella term for all that goes into the process of creating a test.

A

Test Development

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

It is where the idea is conceived.

A

Test Conceptualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A stage in the test development process that entails writing test items (or re-writing or revising existing items), formatting items, setting scoring rules, and otherwise designing and building a test.

A

Test Construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

It is done after the form of the test that has been developed, and it is administered to a representative sample of test takers under conditions that simulate the conditions that the final version of the test.

A

Test Tryout

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A process where statistical procedures are employed to assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded. The analysis of the test’s items may include analyses of item reliability, item validity, and item discrimination. Depending on the type of test, item-difficulty level may be analyzed as well.

A

Item Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

It refers to action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness as a tool of measurement.

A

Test Revision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The development of test items differs based on whether a test is norm-referenced or criterion-referenced. In norm-referenced tests, good items are those that effectively differentiate between high and low performers, aiming to rank individuals relative to each other. In contrast, criterion-referenced tests are designed to determine whether an individual has mastered specific knowledge or skills, regardless of how others perform. Item development in criterion-referenced tests focuses on clearly measuring mastery of defined criteria, often used in contexts like licensing exams or educational assessments, where competence is the goal, not comparison.

A

Norm-Referenced vs. Criterion-Referenced Tests: Item Development Issues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

It is also known as, pilot study, and pilot research, which refers to the preliminary research surrounding the creation of a prototype of the test.

A

Pilot Work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

It may be defined as the process of setting rules for assigning numbers in measurement. It is the process by which a measuring device is designed and calibrated and by which numbers (or other indices)—scale values—are assigned to different amounts of the trait, attribute, or characteristic being measured.

A

Scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

He significantly advanced the field of psychological measurement by introducing methodologically rigorous scaling methods. He was one of the first to adapt psychophysical techniques to assess psychological constructs like attitudes and values.

A

L. L. Thurstone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

It is a procedure for obtaining a measure of item difficulty across samples of test takers who vary in ability.

A

The notion of absolute scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Categorizes data without any order (e.g., gender, diagnosis type).

A

Nominal scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ranks data in order, but intervals between ranks are not equal (e.g., class ranking).

A

Ordinal scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Equal intervals between points, but no true zero (e.g., IQ scores).

A

Interval scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Equal intervals with an absolute zero point (e.g., reaction time, weight).

A

Ratio scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Measures performance in relation to age (e.g., developmental milestones).

A

Age-based scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Measures performance in relation to educational grade level (e.g., reading level).

A

Grade-based scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Raw scores are converted into a 1–9 scale, with a mean of 5 and a standard deviation of ~2.

A

Stanine scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Measures a single trait or construct (e.g., depression).

A

Unidimensional scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Measures multiple traits or constructs (e.g., Big Five personality traits).

A

Multidimensional scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Requires respondents to compare items or choose between options (e.g., forced-choice format).

A

Comparative scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Assigns responses to distinct, labeled categories (e.g., yes/no, agree/disagree).

A

Categorical scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

It can be defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the test taker.

A

Rating Scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

It was developed to be “a practical means of assessing what people believe, the strength of their convictions, as well as individual differences in moral tolerance”, and it contains 30 items.

A

Morally Debatable Behaviors Scale–Revised (MDBS-R)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
A type of scale wherein the final test score is obtained by summing the ratings across all the items.
Summative Scale
26
It is used extensively in psychology, usually to scale attitudes. These scales are relatively easy to construct. Each item presents the test taker with five alternative responses (sometimes seven), usually on an agree–disagree or approve–disapprove continuum.
Likert Scale
27
It is a scaling technique that is used to produce ordinal data, where test takers are presented with pairs of stimuli (e.g., objects, statements, images) and must choose one based on a specific criterion (e.g., preference, agreement, appeal). This method determines the relative ranking of stimuli by analyzing consistent patterns of choice across multiple pairs.
Method of Paired Comparisons
28
It is yet another scaling method that yields ordinal-level measures. Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured.
Guttman Scale
29
It is an item-analysis procedure and approach to test development that involves a graphic mapping of a test taker’s responses. The objective for the developer of a measure of attitudes is to obtain an arrangement of items wherein endorsement of one item automatically connotes endorsement of less extreme positions.
Scalogram Analysis
30
It was developed by Thurstone, that is a scaling technique used to construct interval-level attitude scales.
Method of Equal-appearing Intervals
31
Items presented in a ___________ format require test takers to select a response from a set of alternative responses.
Selected-response Format
32
Items presented in a ___________ format require test takers to supply or create the correct answer, not merely to select it.
Constructed-response Format
33
An item written that has three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options, variously referred to as distractors or foils.
Multiple-choice Format
34
It pertains to items that have several incorrect alternatives or options.
Distractors or Foils
35
In this type of item, the test taker is presented with two columns: premises on the left and responses on the right. The test taker’s task is to determine which response is best associated with which premise.
Matching Item
36
It is a multiple-choice item that contains only two possible responses.
Binary-choice Item
37
The most familiar binary-choice item, a type of selected-response item, usually takes the form of a sentence that requires the test taker to indicate whether the statement is or is not a fact.
True-False Item
38
A type of item that requires the examinee to provide a word or phrase that completes a sentence.
Completion Item
39
It requires the test-taker to briefly respond with a few words, a phrase, or one or two sentences.
Short Answer Item
40
It requires the test-taker to write a more extended, organized response that explains, discusses, or analyzes a topic in depth.
Essay Item
41
It is a relatively large and easily accessible collection of test questions.
Item Bank
42
It refers to an interactive, computer-administered test-taking process wherein items presented to the test taker are based in part on test taker’s performance on previous items.
Computerized Adaptive Testing (CAT)
43
It refers to the diminished utility of an assessment tool for distinguishing test takers at the low end of the ability, trait, or other attribute being measured.
Floor Effect
44
It refers to the diminished utility of an assessment tool for distinguishing test takers at the high end of the ability, trait, or other attribute being measured.
Ceiling Effect
45
The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items.
Item Branching
46
The test taker's responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses is presumably similar in some way. This approach is used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis.
Class scoring (or Category Scoring)
47
The objective of this scoring is comparing a test taker’s score on one scale within a test to another scale within that same test.
Ipsative scoring
48
It is a neurological disorder characterized by frequent and involuntary outbursts of laughing or crying that may or may not be appropriate to the situation.
Pseudobulbar Affect (PBA)
49
It refers to the different types of statistical scrutiny that the test data can potentially undergo at this point.
Item Analysis
50
A statistic that shows how difficult a test item is. It represents the proportion of test-takers who answered the item correctly.
Item-difficulty Index (p-value)
51
It is used in non-cognitive assessments (like personality or attitude scales) to indicate the proportion of respondents who "endorsed" or agreed with a particular statement or item.
Item-endorsement Index
52
It provides an indication of the internal consistency of a test; the higher this index, the greater the test’s internal consistency.
Item-reliability Index
53
A statistical tool useful in determining whether items on a test appear to be measuring the same thing(s).
Factor Analysis
54
It refers to the degree to which items on a test that are supposed to measure the same construct yield similar or consistent responses. It’s a type of internal consistency reliability, and it tells us whether the items are homogeneous—that is, whether they all measure the same underlying trait or concept.
Inter-item Consistency
55
It is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.
Item-validity Index
56
It is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly; the higher the value of d, the greater the number of high scorers answering the item correctly. It is a measure of item discrimination, symbolized by a lowercase italic “d” (d).
Item-discrimination Index
57
It is the process of evaluating how well each response option (correct and incorrect choices, or "distractors") on a multiple-choice test functions.
Analysis of Item Alternatives
58
It is a graphic representation of item difficulty and discrimination.
Item-Characteristic Curves
59
It is one that has eluded any universally acceptable solution.
Guessing
60
It refers to the degree, if any, a test item is biased.
Item Fairness
61
These are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Qualitative Methods
62
It is a general term for various nonstatistical procedures designed to explore how individual test items work.
Qualitative Item Analysis
63
A qualitative research tool designed to shed light on the test taker’s thought processes during the administration of a test. On a one-to-one basis with an examiner, examinees are asked to take a test, thinking aloud as they respond to each item.
“Think Aloud” Test Administration
64
It is a study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective test takers and for the presence of offensive language, stereotypes, or situations.
Sensitivity Review
65
It refers to the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a valid predictor of some criterion.
Cross-validation
66
The decrease in item validities that inevitably occurs after cross-validation of findings.
Validity Shrinkage
67
It may be defined as a test validation process conducted on two or more tests using the same sample of test takers.
Co-validation
68
Used in conjunction with the creation of norms or the revision of existing norms.
Co-norming
69
It is a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.
Anchor Protocol
70
A discrepancy between scoring in an anchor protocol and the scoring of another protocol.
Scoring Drift
71
This phenomenon, wherein an item functions differently in one group of test takers as compared to another group of test takers known to have the same (or similar) level of the underlying trait).
Differential Item Functioning (DIF)
72
In this process, test developers scrutinize group-by-group item response curves, looking for what are termed DIF items.
DIF analysis
73
Are those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership.
DIF Items