Test Development Flashcards

1
Q

An emerging social phenomenon or pattern of behavior might serve as the stimulus for the development of a new test or in response to a need to assess mastery in emerging occupations or professions.

A

Test Conceptualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Criterion-referenced testing and assessment are commonly employed in _ and _ contexts.

A

Licensing
Educational context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The items that best discriminate between 2 groups would be considered the _ items.

A

Good items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A good items on a _ test is an item for which high scorers on the test respond correctly and low scorers respond incorrectly.

A

Norm-referenced test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The preliminary research surrounding the creation of a prototype of the test. It should be done to evaluate whether they should be included in the final form of the instrument.

A

Pilot work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The process by which a measuring device is designed and calibrated and by which numbers are assigned to different amounts of trait, attribute or characteristics being measured.

A

Scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

He is the one credited for being at the forefront efforts to develop methodologically sound scaling methods.

A

L. L. Thurstone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of scales

A

Age-based scale
Grade-based scale
Stanine scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A type of scale where all raw scores on the test are to be transformed into scores that can range from 1-9.

A

Stanine scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The 3 scaling methods

A

Rating Scale
Summative scale
Likert scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A grouping of words, statements or symbols on which judgments of the strength of a particular trait, attitude of emotion are indicated by the test taker.

A

Rating Scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Test score is obtained by summing the rating across all the items.

A

Summative scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A type of summative rating scale that is used extensively in psychology to scale attitudes. Each items present the testtakers with five alternative responses usually on an agree-disagree or approve-disappaprove continuum.

A

Likert scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When one dimension is presumed to underlie the ratings.

A

Unidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When more than 1 dimension is thought to guide the testtaker’s responses.

A

Multidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 4 scaling methods that produce ordinal data?

A

Method of paired comparison
Comparative scaling
Categorical scaling
Guttman scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A scaling method that produces ordinal data. Testtakers are presented with pairs of stimuli which they are asked to compare, and they must select one of the stimuli according to some rule. Then they receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges.

A

Method of Paired comparison

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A scaling method that produces ordinal data. Stimuli such as printed cards, drawings, photographs or other objects are typically presented to testtakers for evaluation and must be sort from most justifiable to least justifiable. It could also be accomplished through the use of list of items on a sheet of paper.

A

Comparative scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A scaling method that produces ordinal data. Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.

A

Categorical scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A scaling method that produces ordinal data. Items on it range sequentially from weaker to stronger expressions of the attitude, belief or feeling being measured. All respondents who agree with the stronger statements of the attitude will also agree with milder statements.

A

Guttman scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The resulting data of Guttman scale are analyzed by the means of this. This is an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses.

A

Scalogram Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The reservoir from which the item will or will not be drawn for the final version of the test. Items available for use as well as new items created especially for the item bank.

A

Item pool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

It is the form, plan, structure, arrangement and layout of individual test items.

A

Item format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

The two types of item format:

A

Selected response format
Constructed response format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
It requires testtakers to select a response from a set of alternative responses.
Selected response format
26
3 Types of selected response format:
Multiple choice format Matching item True-false
27
Several incorrect alternatives or options in a multiple choice format are referred to as _.
Distractors or foils
28
A selected response format where the testtaker is presented with 2 columns where they have to determine which response is best associated with which premise.
Matching item
29
A multiple choice item format that contains only two possible responses (binary choice) (agree or not, yes or no, right or wrong, fact or opinion). It usually takes the form of a sentence.
True-false
30
3 types of constructed response items:
Completion item Short-answer item Essay
31
A constructed response format that requires the examinee to provide a word or phrase that completes a sentence.
Completion item
32
A constructed response format where a word, term, sentence or paragraph may qualify as an answer.
Short-answer item
33
A constructed response format that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis and/or interpretation.
Essay
34
A relatively large and easily accessible collection of test questions.
Item bank
35
An interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the test takers' performance on previous items.
Computerized adaptive testing
36
It refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait or other attribute being measured. Testtakers who have not yet achieved such ability might fail all the items.
Floor effect
37
It refers to the diminished utility of an assessment tool for distinguishing testtakers at the high end of the attribute being measured. It is likely that the test users who answered all of the items correctly conclude that the test was too easy.
Ceiling effect
38
The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to the previous items. Random presentation of test items.
Item branching
39
What are the 3 different scoring models?
Cumulative model Class scoring or Category scoring Ipsative scoring
40
Scoring model where the higher the score on the test, the higher the testtakers are on the ability or characteristic that the test purpots to measure.
Cumulative model
41
Scoring model where testtakers earn credit toward placement in a particular class or category with other testtaker whose pattern of responses is presumably similar in some way. Used by some diagnostic systems.
Class scoring or Category scoring
42
Scoring model that compares a testtaker's score on one scale within a test to another scale within that same test.
Ipsative scoring
43
The informal rule of thumb for test tryout is that there should be no fewer than _ subjects and preferably as many as _ for each item on the test.
5 10
44
Factors that actually are just artifacts of the small sample size.
Phantom factors
45
A lowercase itallic "p" is used to denote _.
Item Difficulty
46
The larger the item difficulty index, the _ the item.
Easier
47
The optimal average item difficulty for maximum discrimination among the abilities of testtakers.
Approximate 0.5
48
The range of difficulty for individual items on the test.
0.3-0.8
49
For the possible effect of guessing, the optimal average item difficulty is usually the midpoint between _ and the chance success proportion.
1.00
50
The probability of answering correctly by random guessing.
Chance success proportion
51
The higher the index, the greater the test's _.
Internal consistency
52
A statistical tool useful in determining whether items on a test appear to be measuring the same thing.
Factor analysis
53
It is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.
Item-validity index
54
The higher the item validity index, the greater the test's _.
Criterion-related validity
55
It compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.
Item discrimination index
56
Item discrimination index is symbolized by _.
Lowercase itallic "d"
57
The _ the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring testtakers.
Higher
58
The highest possible value of d.
+/-1.00
59
The value of d that indicates the item is not discriminating for there is the same proportion of members of the upper and lower groups who pass the item.
0
60
The lowest value that an index of item discrimination can take. It indicates that all members of the upper group failed the item and all members of the lower group passed it.
- -1
61
It is a graphic representation of item difficulty and discrimination. The steeper the slope, the greater the item discrimination.
Item-characteristic curves
62
It is an item that favors one particular group of examinees in relation to another when differences in group ability at e controlled.
Biased test item
63
It is exemplified by different shapes of item-characteristic curves for different groups when the 2 groups do not differ in total test score.
Differential item functioning
64
These are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Qualitative methods
65
It is a general term for various no statistical procedures designed to explore how individual test items work. Involves exploration of the issues through verbal means.
Qualitative item analysis
66
A qualitative research tool designed to shed light on the testtaker's thought processes during the administration of a test. They are asked to think aloud as they respond to each item.
"Think aloud" test administration
67
If a study of test items typically conducted during the test developmental processes in which the items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes or situations.
Sensitivity review
68
It refers to the revalidation of a test on a sample of testtakers other than those of whom test performance was originally found to be a valid predictor of some criterion.
Cross validation
69
The decrease in item validitities that inevitably occurs after cross-validation of finding.
Validaty shrinkage
70
Test validation process conducted on two or more tests using the same sample of testtakers.
Co-validation
71
Used in conjunction with the creation of norms or the revision of existing norms.
Co-norming
72
The discrepancies of the scorers are resolved by another scorer which is called?
Resolver
73
Scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.
Anchor protocol
74
Discrepancy between scoring in an anchor protocol and the scoring of another protocol.
Scoring drift
75
Those items that respondents from different groups at the same level have underlying trait have different probabilities of endorsing as a function of their group membership.
Differential item functioning (DIF) items