Test Development Flashcards by Nino Benedict Lumanag

what is an umbrella term for all that goes into the process of creating a test?

test development

How well did you know this?

Not at all

Perfectly

what are the 5 stages of developing a test?

Test Conceptualization
Test Construction
Test Tryout
Item Analysis
Test Revision

How well did you know this?

Not at all

Perfectly

this stage of test development involves statistical procedures employed to assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded

item analysis

How well did you know this?

Not at all

Perfectly

this stage of test development entails writing test items (or rewriting or revising existing items), as well as formatting items, setting scoring rules, and otherwise designing and building a test

Test Construction

How well did you know this?

Not at all

Perfectly

in this type of test, items should measure whether test-takers meet specific criteria, regardless of their position relative to others.

Success is defined by meeting set criteria, not by ranking.

criterion-referenced tests

How well did you know this?

Not at all

Perfectly

this stage of test development refers to action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness as a tool of measurement

test revision

How well did you know this?

Not at all

Perfectly

in this type of test, items are deemed “good” if high scorers answer correctly and low scorers answer incorrectly.

norm-referenced tests

How well did you know this?

Not at all

Perfectly

this term refers to the preliminary research and testing around the creation of a test prototype

pilot work

How well did you know this?

Not at all

Perfectly

what is the purpose of pilot work/studies?

pilot studies help evaluate the potential test items to determine their suitability for the final version of the test

How well did you know this?

Not at all

Perfectly

this term refers to the process of setting rules for assigning numbers or indices to measure different amounts of a trait, attribute, or characteristic.

scaling

How well did you know this?

Not at all

Perfectly

what are the different types of scaling methods?

Rating Scale
Likert Scale
Method of Paired Comparisons
Comparative Scaling
Categorical Scale
Guttman Scale
Thurstone Scale

How well did you know this?

Not at all

Perfectly

this type of scale is a summative scale where test-takers rate the strength of a trait, attitude, or emotion

rating scale

How well did you know this?

Not at all

Perfectly

this type of scale is commonly used in psychology for attitudes, providing options on a continuum

by asking respondents to rate their agreement with a statement

likert scale

How well did you know this?

Not at all

Perfectly

what does a unidimensional rating mean?

the scale only measures one underlying dimension

How well did you know this?

Not at all

Perfectly

what does a multidimensional rating mean?

the scale measures multiple dimensions

How well did you know this?

Not at all

Perfectly

this type of scale produces ordinal data by comparing stimuli

presents two items at time, asking respondents to choose one based on a specific criterion

method of paired comparisons/paired comparison scale

How well did you know this?

Not at all

Perfectly

this type of scale involves judgement of stimuli in relation to others on the scale

rating an item relative to a benchmark or another item on the scale

comparative scaling

How well did you know this?

Not at all

Perfectly

this term refers to the collection of potential test items that will be refined for the final test

Item Pool

How well did you know this?

Not at all

Perfectly

this type of item-format requires choosing an answer from given options
(e.g., multiple-choice, true-false)

selected-response format

How well did you know this?

Not at all

Perfectly

what are the two types of item formats?

Selected-Response Format
Constructed-Response Format

How well did you know this?

Not at all

Perfectly

this type of items include a “stem,” a correct option, and distractors.

multiple-choice items

How well did you know this?

Not at all

Perfectly

this type of items include two possible responses, such as true/false

binary-choice item

How well did you know this?

Not at all

Perfectly

this type of item involves matching premises with correct responses

matching items

How well did you know this?

Not at all

Perfectly

this term refers to interactive testing where item selection depends on previous answers, reducing floor and ceiling effects

computerized adaptive testing (CAT)

How well did you know this?

Not at all

Perfectly

this type of item format requires creating or supplying an answer

constructed-response format

this term refers to a large, accessible database of questions for computerized tests

item bank

this effect limits distinguishing low-ability test-takers

floor effect

this term refers to tailoring item content and order based on previous responses.

item branching

this effect limits distinguishing high-ability test-takers.

ceiling effect

this type of scoring involves how higher scores indicate higher ability or presence of the trait being measured.

cumulative model

what does discriminative ability refer to?

it refers to how a quality test item effectively differentiates between high and low scorers, with high scorers likely to answer correctly or as expected

this tool of item analysis measures the proportion of test-takers who answered an item correctly, denoted as p (e.g., p1 for item 1). A higher p indicates an easier item

item difficulty index

this type of scoring compares scores within different scales of the same test, focusing on internal comparisons rather than across individuals

ipsative scoring

this type of scoring involves responses that assign test-takers to specific classes or categories

class scoring (category scoring)

what is a needed characteristic of a good item?

Discriminative Ability

what are the tools for item analysis?

1. Item Difficulty Index 2. Item Reliability Index 3. Item Validity Index 4. Item Discrimination Index

this tool for item analysis indicates internal consistency of a test. Calculated as the product of the item-score standard deviation (s) and the correlation (r) between the item score and the total test score.

item reliability index

this tool of item analysis reflects how well an item measures what it is intended to measure, determined by the item-score standard deviation and the correlation between the item score and the criterion score

item validity index

this tool of item analysis measures how well an item differentiates between high and low scorers on the test, denoted by d.

item discrimination index

this term refers to whether a test item is biased against certain groups when controlling for group ability

item fairness

this term refers to the graphic representations that illustrate item difficulty and discrimination. A steeper slope indicates greater discrimination ability of the item

item-characteristic curves

what is the threshold for item discrimination?

greater than 0.19

what is the threshold for item difficulty?

0.20-0.80

what is the threshold for item reliability?

greater than 0.75

what are the two ways of defining a construct in test development?

1. a priori assumption 2. pilot work (qualitative preliminary work)

this stage of test development refers to the process of defining the construct to be measured and setting its parameters. This process also includes preliminary decisions about who, what, when, where, and why aspects of the test

test conceptualization

this scaling method involves rating an item relative to a benchmark or another item on the scale. Evaluating a patient's anxiety level by comparing it to their typical anxiety level before treatment, using ratings such as "much lower," "slightly lower," "the same," "slightly higher," or "much higher."

comparative scale

this scaling method presents two items at a time, asking respondents to choose one based on a specific criterion Example: Assessing preferences for different stress-reduction techniques by presenting pairs such as "meditation vs. exercise" and asking individuals to choose which they find more effective.

paired comparison scale

this scaling method uses equal-appearing intervals to measure attitudes. Experts assign values to statements, and respondents select the statements they agree with. The average value of these statements becomes the respondent's score. Example: Measuring attitudes toward psychotherapy where experts rate a set of statements. Respondents then select statements that align with their views, and their attitude score is the average scale value of the statements they endorse.

Thurstone Scale

this scaling method measures the extent to which individuals possess a particular attitude or characteristic. The items are arranged in a cumulative order, such that agreeing with a higher-level item implies agreement with all lower-level items. Example: Measuring social tolerance with items ranging from "willing to live in the same country" to "willing to marry" someone from a different racial group.

Guttman Scale

what scaling methods are the most flexible?

1. Rating Scale 2. Likert Scale

this scaling method uses categories or labels to classify responses. It is nominal, meaning the categories have no inherent order or numerical value.

categorical scale

what scaling methods are usually associated with behaviors?

1. Paired Comparison 2. Comparative Scale 3. Categorical

what scaling methods are associated with attitudes?

1. Guttman Scale 2. Thurstone Scale

what are the designations for the item difficulty index?

% correct 0-20: Very Difficult 21-60: Difficult 61-90: Moderately Difficult 91-100: Easy

what are the levels/designations for the item discrimination index?

<0.19 = Poor item, should be eliminated or needed to be revised 0.20-0.29 = Marginal item, needs some revision 0.30-0.39 = Reasonably good item but possibly for improvement >0.40 = very good item >0.50 = ideal

what is an unacceptable range for item difficulty index?

<0.20 or >0.80

what are the designations for the item reliability and validity index?

Threshold : >0.75 (r) and 0.75 (v) Unacceptable: <0.20 Marginal: 0.21-0.40 Reasonable: 0.41-0.74 Ideal: >0.74

this term refers to the revalidation of a test on a sample other than those on whom it was first found to be a predictor [validity shrinkage]

cross-validation

this term refers to a test protocol scored by a highly authoritative scorer designed as a model for scoring and resolving discrepancies

anchor protocol

this term refers to the validation process conducted on two or more tests using the same sample of test-takers.

co-validation

what does it mean when there are negative item value discrimination indices?

the LG is better than the UG in that item

this refers to a neurological disorder marked by involuntary episodes of laughing or crying, often without an appropriate trigger.

Pseudobulbar Affect (PBA)

Test Development Flashcards

(63 cards)