Chapter 8 Test Development Flashcards Preview

z. Psychological Testing and Assessment > Chapter 8 Test Development > Flashcards

Flashcards in Chapter 8 Test Development Deck (56):

Anchor Protocol?

A test answer sheet developed by a test publisher to check the accuracy of examiner's scoring. To resolve scoring discrepancies.


Biased test item:

Biased test item is an item that favours one particular group of examinees in relation to another when differences in group ability are controlled.


How to detect a biased test item?

Methods of item analysis:
Item characteristic curves. Specific items are identified as biased if exhibit differential item functioning.
The item characteristic curves (ICC)
for the different groups should not be statistically different.


What is the order of Test Development from conceptualization?

Test conceptualization
Test construction
Test Tryout
Revision to Test tryout again


What is a good item on a norm referenced achievement test?

Is an item for which high scorers on the test respond correctly.
Low scorers on the test tend to respond to that item incorrectly.


What pattern should occur on a criterion referenced test?

On a criterion oriented test, the pattern of results may be the same as norm referenced test-
high scorers get a particular item right whereas the low scorers get it wrong.


Criterion-referenced test: difference ...

Ideally, each item on a criterion referenced test addresses the issue of whether the test taker has met a certain criteria - eg pilot.
Norm referenced insufficient when knowledge of mastery is needed.


Pilot work

Refers to the preliminary research surrounding the creation of a prototype of the test.
Test developer typically attempts to determine how best to measure a targeted construct.


What is scaling?

Scaling is the process of setting rules for assigning numbers in measurement.
A process by which a measuring device is designed and calibrated and by which numbers - scale values - are assigned to different amounts of the trait, attribute or characteristic being measured.


Stanine scale?

When raw scores are transformed to scale that can range between 1 to 9.


What is the MDBS?

The MDBS is an example of a rating scale.
Morally debatable behaviours scale.
30 items.Never justified to always justified -10 point scale.
Rating scales are:
A grouping of words, statements or symbols on which judgements of the strength of a particular trait, attitude or emotion are indicated by the test taker.


What is a rating scale?

Rating scales are:
A grouping of words, statements or symbols on which judgements of the strength of a particular trait, attitude or emotion are indicated by the test taker.
Used to record judgements of oneself, others, experiences, or objects, and they can take several forms.


What is a summative scale?

Is where the final test score is obtained by summing the ratings across all the items.


What is the Likert Scale?

A summative scale used to scale attitudes.
Five alternative responses...sometimes 7.
Usually on an agree - disagree or
approve - disapprove continuum.

Use of scales results in ordinal level data.


Unidimensional raring scale?

Only one dimension is underlying the ratings.


Multidimensional rating scales.

More than one dimension is thought to guide the test taker's responses.
When more than one dimension is tapped by an item.p241.


Method of paired comparisons?

A scaling method that produces ordinal data.
Test-takers are presented with pairs of stimuli.. two photos, two statements, two objects...
They must select one of the stimuli according to some rule.
An advantage is that it forces test takers to choose between items.


Categorical scaling

Relies on sorting
Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.
e.g. MDBS-R
eg sorting 30 cards into 3 piles:
behaviours never justified
sometimes justified
always justified


Guttman scale:

Scaling method that yields ordinal level measures.
Items on it range sequentially from weaker to stronger expressions of attitude, belief, or feeling being measured.
Feature is that all respondents that agree with the stronger statements will also agree with the milder statements.
Assessed by a scalogram analysis.


Scalogram analysis.

An item analysis procedure and approach to test development that involves a graphic mapping of a test taker's responses.
Guttman scale.


Item pool

An item pool is the reservoir from which items will or will not be drawn for the final version of a test.


Item format

Variables such as the form, plan, stricture, arrangement, and layout of individual test items...collectively referred to as item format.
Selected response format
Constructed response format.


Selected response format

Require test takers to select a response from a set of alternative responses.
Eg Multiple choice format


Constructed response format.

Requires test takers to supply or to create the correct answer, not merely to select it.
Eg essay
short answer


Multiple choice format.

3 elements:
1. a stem
2. correct alternative or option
3. several incorrect options - distractors or foils.


What sort of item os matching item?

In a matching item the test taker is presented with two columns:
premises on the left and responses on the right.
Test taker task is to determine which response is best suited with which premise.


Binary choice item.

Where a multiple choice item contains only two possible responses.
EG True - false.
Agree - disagree
Yes - no
Fact - opinion
Right - wrong.


Constructed response format:

Completion item
Short answer


Computer administration items:

Ability to store items in an item bank.
Item bank = large collection of testing questions.
Ability to individualize testing through item branching.


Computerized adaptive testing.

CAT refers to an interactive, computer administered test taking process wherein items presented to the test taker are based in part on the teat takers performance on previous items.


Floor effects

A floor effect refers to the diminished utility of an assessment tool for distinguishing test takers at the low end of the ability, trait, or other attribute being measured.
Solution = to add some less difficult items.


Ceiling effect

A ceiling effect refers to the diminished utility of an assessment tool for distinguishing test takers at the high end of the ability, trait, or other attribute being measured.
ie test too easy.
Solution- add some harder questions.


Item branching

Is the ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items.
Patterns of items (eg) based on consecutive correct responses.
p. 252


Class or category scoring.

Test taker responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses similar.


Ipsative Scoring

Scoring model that compares a test taker's score on one scale with a test to another scale within that same test.
p. 253.


Item fairness.
Biased item

A biased item is one that favours one particular group of examinees in relation to another when differences in group ability are controlled.


What do Item Characteristic Curves do?

They can be used to identify biased items.
Specific items are identified as biased in a statistical sense if they exhibit differential item functioning...different shapes of item-characteristic curves for different groups.


Qualitative Item Analysis

Is a general term for various non statistical procedures designed to explore how individual test items work.
Compares individual test items to each other and to the test as a whole.
Qualitative methods involve:
group discussions


Think aloud test administration

Cognitive assessment approach.
Respondents verbalize thoughts as they occur.
p.266 table


Qualitative Analysis
Expert panels

eg A sensitivity review - a study of items - conducted during test development process in which items are examined for fairness to all prospective test takers... and for the presence of offensive language, stereotypes, etc...


Test Revision

Some items will be eliminated and others will be rewritten from the original pool.

Look at difficult- easy - biased - etc



Cross-validation refers to the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a predictor of some criterion.


Validity Shrinkage

Validity shrinkage is the decrease in item validities that occurs after cross-validation of findings
Such shrinkage is expected and integral to the test development process.



Co-validation is a test validation process conducted on two or more tests using the same sample of test takers.



When used in conjunction with the creation of norms or the revision of existing norms, co-validation may also be referred as co-norming.

A current trend among test publishers who publish more than one test designed for use with the same population is to co-validate and/or co-norm tests.


Anchor protocol

Is a mechanism for ensuring consistency in scoring ...
and is a test protocol scored by an authoritative scorer that os designed as a model for scoring and a mechanism for resolving scoring discrepancies.


Scoring drift

A scoring drift is a discrepancy between scoring in an anchor protocol and the scoring of another protocol.

Once protocols are scored, the data from them must be entered into a data base.


Item banks

Each of the items assembled as part of an item bank has undergone rigorous qualitative and quantitative evaluation.

Many items come from existing instruments.
New items may be written.
All items constitute the item pool.


What scales of measurement are there?

.Likert scales (eg 1=strongly disagree - 7=strongly disagree)
.Binary choice scales (true/false: like/dislike)
.Forced choice (eg. I am happy most of the time OR I am sad most of the time)
. Semantic differential scales (eg. strong .......weak).


Writing test items
What's the first step?

To create an item pool.

Two general item format options:
1. selected response items
2. constructed response items


What are the 4 analytic tools that test developers use to analyze and select items?

- Item difficulty Index
- Item discrimination index
- Item validity index
- Item reliability index


Item difficulty.
How is it calculated?

Item difficulty index is calculated as the proportion of test takers who answered the item correctly.

P value ranges from 0 to 1

Each item has a corresponding p value. eg
p1 is read " item difficulty index for item 1"


What is the ideal level of item difficulty for a test as a whole?

It is calculated as the average of all the p values for the test items.

Optimal average item difficulty is 0.5

IE individual items should range in difficulty from 0.3 (somewhat difficult) to 0.8 (somewhat easy).
The effect of guessing must be taken into account.


Which items do not discriminate between test takers?

Items that everyone answers correctly p item = 1
that no one answers correctly
p item = 0

DO NOT DISCRIMINATE between test takers.


What is the Item Discrimination Index?

Item discrimination index is the degree to which an item differentiates correctly on the behaviour the test is designed to measure.

IE. An item is good if most of the high scorers on the test overall answer the item correctly.

Most of the low scorers on the test answer the item incorrectly.


Item difficulty

(1 + Probability) /2. =


(1+.25) /2. =.625

=>. .63