CHAPTER 8: TEST DEVELOPMENT Flashcards by Amara Vale

It is an umbrella term for all that goes into the process of creating a test.

Test Development

How well did you know this?

Not at all

Perfectly

It is where the idea is conceived.

Test Conceptualization

How well did you know this?

Not at all

Perfectly

A stage in the test development process that entails writing test items (or re-writing or revising existing items), formatting items, setting scoring rules, and otherwise designing and building a test.

Test Construction

How well did you know this?

Not at all

Perfectly

It is done after the form of the test that has been developed, and it is administered to a representative sample of test takers under conditions that simulate the conditions that the final version of the test.

Test Tryout

How well did you know this?

Not at all

Perfectly

A process where statistical procedures are employed to assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded. The analysis of the test’s items may include analyses of item reliability, item validity, and item discrimination. Depending on the type of test, item-difficulty level may be analyzed as well.

Item Analysis

How well did you know this?

Not at all

Perfectly

It refers to action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness as a tool of measurement.

Test Revision

How well did you know this?

Not at all

Perfectly

The development of test items differs based on whether a test is norm-referenced or criterion-referenced. In norm-referenced tests, good items are those that effectively differentiate between high and low performers, aiming to rank individuals relative to each other. In contrast, criterion-referenced tests are designed to determine whether an individual has mastered specific knowledge or skills, regardless of how others perform. Item development in criterion-referenced tests focuses on clearly measuring mastery of defined criteria, often used in contexts like licensing exams or educational assessments, where competence is the goal, not comparison.

Norm-Referenced vs. Criterion-Referenced Tests: Item Development Issues

How well did you know this?

Not at all

Perfectly

It is also known as, pilot study, and pilot research, which refers to the preliminary research surrounding the creation of a prototype of the test.

Pilot Work

How well did you know this?

Not at all

Perfectly

It may be defined as the process of setting rules for assigning numbers in measurement. It is the process by which a measuring device is designed and calibrated and by which numbers (or other indices)—scale values—are assigned to different amounts of the trait, attribute, or characteristic being measured.

Scaling

How well did you know this?

Not at all

Perfectly

He significantly advanced the field of psychological measurement by introducing methodologically rigorous scaling methods. He was one of the first to adapt psychophysical techniques to assess psychological constructs like attitudes and values.

L. L. Thurstone

How well did you know this?

Not at all

Perfectly

It is a procedure for obtaining a measure of item difficulty across samples of test takers who vary in ability.

The notion of absolute scaling

How well did you know this?

Not at all

Perfectly

Categorizes data without any order (e.g., gender, diagnosis type).

Nominal scale

How well did you know this?

Not at all

Perfectly

Ranks data in order, but intervals between ranks are not equal (e.g., class ranking).

Ordinal scale

How well did you know this?

Not at all

Perfectly

Equal intervals between points, but no true zero (e.g., IQ scores).

Interval scale

How well did you know this?

Not at all

Perfectly

Equal intervals with an absolute zero point (e.g., reaction time, weight).

Ratio scale

How well did you know this?

Not at all

Perfectly

Measures performance in relation to age (e.g., developmental milestones).

Age-based scale

How well did you know this?

Not at all

Perfectly

Measures performance in relation to educational grade level (e.g., reading level).

Grade-based scale

How well did you know this?

Not at all

Perfectly

Raw scores are converted into a 1–9 scale, with a mean of 5 and a standard deviation of ~2.

Stanine scale

How well did you know this?

Not at all

Perfectly

Measures a single trait or construct (e.g., depression).

Unidimensional scale

How well did you know this?

Not at all

Perfectly

Measures multiple traits or constructs (e.g., Big Five personality traits).

Multidimensional scale

How well did you know this?

Not at all

Perfectly

Requires respondents to compare items or choose between options (e.g., forced-choice format).

Comparative scale

How well did you know this?

Not at all

Perfectly

Assigns responses to distinct, labeled categories (e.g., yes/no, agree/disagree).

Categorical scale

How well did you know this?

Not at all

Perfectly

It can be defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the test taker.

Rating Scale

How well did you know this?

Not at all

Perfectly

It was developed to be “a practical means of assessing what people believe, the strength of their convictions, as well as individual differences in moral tolerance”, and it contains 30 items.

Morally Debatable Behaviors Scale–Revised (MDBS-R)

How well did you know this?

Not at all

Perfectly

A type of scale wherein the final test score is obtained by summing the ratings across all the items.

Summative Scale

It is used extensively in psychology, usually to scale attitudes. These scales are relatively easy to construct. Each item presents the test taker with five alternative responses (sometimes seven), usually on an agree–disagree or approve–disapprove continuum.

Likert Scale

It is a scaling technique that is used to produce ordinal data, where test takers are presented with pairs of stimuli (e.g., objects, statements, images) and must choose one based on a specific criterion (e.g., preference, agreement, appeal). This method determines the relative ranking of stimuli by analyzing consistent patterns of choice across multiple pairs.

Method of Paired Comparisons

It is yet another scaling method that yields ordinal-level measures. Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured.

Guttman Scale

It is an item-analysis procedure and approach to test development that involves a graphic mapping of a test taker’s responses. The objective for the developer of a measure of attitudes is to obtain an arrangement of items wherein endorsement of one item automatically connotes endorsement of less extreme positions.

Scalogram Analysis

It was developed by Thurstone, that is a scaling technique used to construct interval-level attitude scales.

Method of Equal-appearing Intervals

It is the reservoir or well from which items will or will not be drawn for the final version of the test.

Item Pool

It entails the variables such as the form, plan, structure, arrangement, and layout of individual test items.

Item Format

Items presented in a ___________ format require test takers to select a response from a set of alternative responses.

Selected-response Format

Items presented in a ___________ format require test takers to supply or create the correct answer, not merely to select it.

Constructed-response Format

An item written that has three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options, variously referred to as distractors or foils.

Multiple-choice Format

It pertains to items that have several incorrect alternatives or options.

Distractors or Foils

In this type of item, the test taker is presented with two columns: premises on the left and responses on the right. The test taker’s task is to determine which response is best associated with which premise.

Matching Item

It is a multiple-choice item that contains only two possible responses.

Binary-choice Item

The most familiar binary-choice item, a type of selected-response item, usually takes the form of a sentence that requires the test taker to indicate whether the statement is or is not a fact.

True-False Item

A type of item that requires the examinee to provide a word or phrase that completes a sentence.

Completion Item

It requires the test-taker to briefly respond with a few words, a phrase, or one or two sentences.

Short Answer Item

It requires the test-taker to write a more extended, organized response that explains, discusses, or analyzes a topic in depth.

Essay Item

It is a relatively large and easily accessible collection of test questions.

Item Bank

It refers to an interactive, computer-administered test-taking process wherein items presented to the test taker are based in part on test taker’s performance on previous items.

Computerized Adaptive Testing (CAT)

It refers to the diminished utility of an assessment tool for distinguishing test takers at the low end of the ability, trait, or other attribute being measured.

Floor Effect

It refers to the diminished utility of an assessment tool for distinguishing test takers at the high end of the ability, trait, or other attribute being measured.

Ceiling Effect

The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items.

Item Branching

The test taker's responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses is presumably similar in some way. This approach is used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis.

Class scoring (or Category Scoring)

The objective of this scoring is comparing a test taker’s score on one scale within a test to another scale within that same test.

Ipsative scoring

It is a neurological disorder characterized by frequent and involuntary outbursts of laughing or crying that may or may not be appropriate to the situation.

Pseudobulbar Affect (PBA)

It refers to the different types of statistical scrutiny that the test data can potentially undergo at this point.

Item Analysis

A statistic that shows how difficult a test item is. It represents the proportion of test-takers who answered the item correctly.

Item-difficulty Index (p-value)

It is used in non-cognitive assessments (like personality or attitude scales) to indicate the proportion of respondents who "endorsed" or agreed with a particular statement or item.

Item-endorsement Index

It provides an indication of the internal consistency of a test; the higher this index, the greater the test’s internal consistency.

Item-reliability Index

A statistical tool useful in determining whether items on a test appear to be measuring the same thing(s).

Factor Analysis

It refers to the degree to which items on a test that are supposed to measure the same construct yield similar or consistent responses. It’s a type of internal consistency reliability, and it tells us whether the items are homogeneous—that is, whether they all measure the same underlying trait or concept.

Inter-item Consistency

It is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.

Item-validity Index

It is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly; the higher the value of d, the greater the number of high scorers answering the item correctly. It is a measure of item discrimination, symbolized by a lowercase italic “d” (d).

Item-discrimination Index

It is the process of evaluating how well each response option (correct and incorrect choices, or "distractors") on a multiple-choice test function.

Analysis of Item Alternatives

It is a graphic representation of item difficulty and discrimination.

Item-Characteristic Curves

It is one that has eluded any universally acceptable solution.

Guessing

It refers to the degree, if any, a test item is biased.

Item Fairness

These are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.

Qualitative Methods

It is a general term for various nonstatistical procedures designed to explore how individual test items work.

Qualitative Item Analysis

A qualitative research tool designed to shed light on the test taker’s thought processes during the administration of a test. On a one-to-one basis with an examiner, examinees are asked to take a test, thinking aloud as they respond to each item.

“Think Aloud” Test Administration

It is a study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective test takers and for the presence of offensive language, stereotypes, or situations.

Sensitivity Review

It refers to the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a valid predictor of some criterion.

Cross-validation

The decrease in item validities that inevitably occurs after cross-validation of findings.

Validity Shrinkage

It may be defined as a test validation process conducted on two or more tests using the same sample of test takers.

Co-validation

Used in conjunction with the creation of norms or the revision of existing norms.

Co-norming

It is a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.

Anchor Protocol

A discrepancy between scoring in an anchor protocol and the scoring of another protocol.

Scoring Drift

This phenomenon, wherein an item functions differently in one group of test takers as compared to another group of test takers known to have the same (or similar) level of the underlying trait).

Differential Item Functioning (DIF)

In this process, test developers scrutinize group-by-group item response curves, looking for what are termed DIF items.

DIF analysis

Are those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership.

DIF Items

CHAPTER 8: TEST DEVELOPMENT Flashcards

(75 cards)