Creating a Test Flashcards by Olivia Cosham

Give examples of some of the issues to be considered when conceptualizing a test

Consider what’s been done before and if the proposed test offers anything new; who’s going to use it; context in which it will be used; how long or effortful the measure can be; any practicalities to be considered; any ethical issues to address

How well did you know this?

Not at all

Perfectly

Give an example of the sort of practicalities that might need to be considered when conceptualizing a test

Large enough font size if using older adults; Size of budget and amount of time; expense if administered to thousands of people, and skill of examiners

How well did you know this?

Not at all

Perfectly

What sort of ethical issues might you need to consider when conceptualizing a test?

Anonymity and whether this will create problems; whether there’s any potentially sensitive or offensive content

How well did you know this?

Not at all

Perfectly

What sort of issues should you consider when creating the materials for a new test?

Format of test items; scoring of test; whether items yield sufficient response variability (e.g. if norm-referenced); whether the items sample the domain of interest sufficiently; if items test people on the criterion; number of items (more items = more reliable but takes longer); whether you want/have face validity

How well did you know this?

Not at all

Perfectly

What key considerations might you need to take into account when drafting written content for a test?

Avoid ambiguity and lack of clarity; include reversed questions

How well did you know this?

Not at all

Perfectly

If someone asks you advice on designing a good multiple-choice question, what would you tell them?

Make sure distractors are plausible; reduce the chance of a respondent guessing the correct answer without actually knowing it

How well did you know this?

Not at all

Perfectly

What are selected-response questions?

They test recognition; e.g. multiple-choice questions; matching options; true/false

How well did you know this?

Not at all

Perfectly

What are constructed-response questions?

Where respondents generate the answer (recall); e.g. fill in the blank; write an essay/account

How well did you know this?

Not at all

Perfectly

In a multiple-choice question, what is the “stem”?

The question part

How well did you know this?

Not at all

Perfectly

Describe five strategies you could use for pilot research into your new test

Examine previous literature;
Interview people who might know something about the area;
Give people a more open-ended form of the test to generate ideas for items;
Get people to complete a preliminary or “faked-up” version of the test and comment on it;
Get people to give a running commentary about their thoughts as they complete a draft of the test

How well did you know this?

Not at all

Perfectly

List five strategies you could use for evaluating the quality of items in an achievement test

Item difficulty index; Reliability item discrimination index; Validity item discrimination index; If multiple choice, examine pattern of responses across options; Further in-depth examination

How well did you know this?

Not at all

Perfectly

What’s the item difficulty index?

The proportion of respondents who get the item correct

How well did you know this?

Not at all

Perfectly

What the validity item discrimination index?

The item’s contribution to the scale’s relationship to an external criterion measure

How well did you know this?

Not at all

Perfectly

When examining the quality of items, what further in-depth questions need to be asked?

Are there items where respondents expected to get it correct get it wrong?; Are there items where respondents expected to get it wrong are scoring above chance?

How well did you know this?

Not at all

Perfectly

Describe the four strategies you could use to evaluate item quality in a personality-type test

Examine histogram/frequency table of responses to each item to determine the spread; reliability item discrimination index; validity item discrimination index; further in-depth examination (expectations of high vs. low scorers)

How well did you know this?

Not at all

Perfectly

How do you calculate the item difficulty index?

Study These Flashcards

Count the number of correct responses for each item and divide by the total number of people

If you have a norm-referenced aptitude-type test, what is your goal, and what’s the optimal strategy to achieve this?

Study These Flashcards

Goal is to maximize the opportunity for the test to tell apart people high or low on the trait or characteristic; adjust your items so the item difficulty index is 50% (p=.5) so half the respondents get it correct

What is the optimal value for the item difficulty index for a six option multiple-choice question in a norm-referenced aptitude test?

Study These Flashcards

100/6 = 16.67% + 100% = 116.67%/2 = 58.34%; so the optimal difficulty is halfway between chance and everyone being correct

If the item difficulty index on an item is .92 on a norm-referenced test, what does this mean?

Study These Flashcards

That 92% of people got it right and it was probably easy (ceiling effect); doesn’t tell people apart

In a norm-referenced test, what if the item difficulty index is .11?

Study These Flashcards

11% of people got it right which is less than chance; it could be really hard or there’s an error; floor effect

How can you examine the spread of scores for items in a personality test?

Study These Flashcards

Plot a histogram or output a frequency table for each item; check if the scores are spread across all response options for all items without too much skew

If individual score items on a personality test are skewed, does this mean the total score will be?

Study These Flashcards

Not necessarily

What is the item discrimination index?

Study These Flashcards

It determines which items contribute the most to the internal consistency of the test (reliability); and how each item contributes to the overall validity

How do you calculate the item discrimination index with respect to internal consistency?

Study These Flashcards

Count up correct answers for each person
Define upper and lower groups as top and bottom 25% based on their total score
For each item, count up the number of people in the high scoring group who got it correct (U) and the people in the low scoring group that got it correct (L)
Plug in the formula

What is the maximum d can be on the item discrimination index, and what does it mean?

+1 - all high scorers get it correct and low scorers get it wrong; the closer d is to 1 the more similar it is to the other items and more reliable

What does an item discrimination index of -1 mean?

It's the minimum d can be - all low scorers get it correct and high scorers get it wrong (the opposite of what we want so maybe a scoring error)

What if d=0 on the item discrimination index?

There's no difference in performance between high and low scorers so no correlation with other items (measuring a different thing)

How do you calculate the item discrimination index with respect to criterion validity?

In addition to the quiz, all respondents complete a previously validated test measuring the same thing (to obtain criterion score), then we examine the correlation between total quiz score and criterion score; define upper and lower groups as top and bottom 25% based on criterion score, and use the d formula

Why do we examine the correlation between the total quiz score and a criterion score when calculating validity item discrimination index?

To see how each item contributes to the overall validity in case we want to increase validity by removing items

What do the values of the item discrimination index mean if groups are defined by an external criterion?

The closer d is to 1, the more the item is contributing to the test's criterion validity; if d = 0, then the item effectively has zero criterion validity (can't predict the criterion at all)

How do you calculate the item discrimination index for personality type tests?

Choose a threshold and label above or equal to as correct, and below as incorrect (halfway along the scale; e.g. If 1-5, threshold is 3); code responses and proceed with d calculation

How could you go about evaluating items for a criterion-referenced test?

1. Test an experienced group who should fulfill the criterion; 2. Compare with a novice group who shouldn't fulfill the criterion; 3. Work out the item discrimination indices for both groups (substituted for upper and lower group) 4. If multiple choice, see if novices perform above chance on any questions 5. Examine items that experienced people are scoring poorly on

What can examining the pattern of responses across items for high and low scorers in a multiple-choice test tell you?

It can tell how the item is working and look at the spread across distractors: if one isn't chosen at all then it's redundant; if a distractor is chosen more than the correct answer, then the wording may be misleading or it's actually correct and there's an error

Which of the item analyses covered in this lecture are appropriate for speed tests, and why?

None of them, as what discriminates between people is not how many they get correct but how many they complete; it doesn't focus on the content of the item

Creating a Test Flashcards

(34 cards)