Chapter 2: Test Construction, Administration, and Interpretation Flashcards

Question 1

Q

How are tests constructed? (1)

Answer

A

Identify the Need
This could mean testing something not yet tested or improving a test that is already made.

Question 2

Q

How are tests constructed? (2) The Role of Theory

Answer

A

All tests are either implicitly/explicitly influenced or guided by the theory/theories held by the test constructer

A theory might yield some specific guidelines

Example: if a researcher though that depression was a disturbance in four specific areas (self-esteem, sleep quality, etc.) then this would dictate what test they make to measure depression

The theory may also be less explicit and not well formalized. The creation of a test is intrinsically related to the person doing the creating and to their theoretical views.

Even a tests said to be empirically developed (based on observations of real life behaviors) can be influenced by theory

Question 3

Q

How are tests constructed? (3) Practical Choices

Answer

A

What format will the items have?

Will they be true or false, multiple choice, or on a rating scale?

Will my instruments be designed for group administration?

Question 4

Q

How are tests constructed? (4) Pool of Items

Answer

A

The next step is to develop a table of specifications, much like the blueprint needed to construct a house. This table of specifications would indicate the subtopics to be covered by the proposed test.

The table of specifications may reflect the researchers thinking, theoretical notions in present literature, other tests on the topic, and thoughts of other experts.

The table of specifications can be formals, informal, or not present, but leads to the writing when present.

The items on a test reflect the constructor’s creativity or pertain to other researchers/literature. Writing good test questions is both a science and an art. Professionals know that they need to write a pool 4 or 5 times greater than the number they actually need.

Question 5

Q

How are tests constructed? (5) Tryouts and Refinement

Answer

A

The initial pool of items will probably be large and rather unrefined.

The intent of this step is to refine the pool of items to a smaller but usable pool.

pilot testing is used where a preliminary form is administered to a sample of subjects to determine whether there are any glitches

We may also do some preliminary statistical work and assemble the test for a trial run called a pretest.

Administer a test to two different groups and carry out item analyses to see which items in fact differentiate the two groups

Write down the items that are best and then preform a content analysis in which you sort them to determine which groups have too many and not enough questions in their category.

Question 6

Q

How are tests constructed? (6) Reliability and Validity

Answer

A

We need to establish that our measuring instrument is reliable, that is, consistent, and measures what we set out to measure, that is, the test is valid.

Question 7

Q

How are tests constructed? (7) Standardization of Norms

Answer

A

We need to standardize the instrument and develop norms. To standardize means that the administration, time limits, scoring procedures, and so on are all carefully spelled out so that no matter who administers the test, the procedure is the same.

Raw scores in psychology are often meaningless. We need to give meaning to raw scores by changing them into derived scores

We also need to be able to compare an individual’s performance on a test with the performance of a group of individuals; that information is what we mean by norms.

Simply because a sample is large, does not guarantee that it is representative. The sample should be representative of the population to which we generalize.

Question 8

Q

How are tests constructed? (8) Further Refinements

Answer

A

Some- times the changes reflect additional scientific knowledge, and sometimes societal changes, as in our greater awareness of gender bias in language
One type of revision that often occurs is the development of a short form of the original test.
Typically, a different author takes the original test, administers it to a group of subjects, and shows by various statistical procedures that the test can be shortened without any substantial loss in reliability and validity.
Psychologists and others are always on the lookout for brief instruments, and so short forms often become popular, although as a general rule, the shorter the test the less reliable and valid it is.
Still another type of revision that occurs fairly frequently comes about by factor analysis.
The factor analysis will tell you if all the items on the test are useful or if some should be thrown out because their contribution is minimal. It will also tell you if different aspects of the test should be scored together or separately.
Finally, there are a number of tests that are multivariate, that is the test is composed of many scales
The pool of items that comprises the entire test is considered to be an “open system” and additional scales are developed based upon arising needs.

Question 9

Q

What to avoid when writing test items

Answer

A

Biased Questions
Loaded Questions
Double-barreled Questions
Jargon
Double Negatives
Poor answer scale options

Question 10

Q

Biased Questions

Answer

A

Leading questions that sway people to answer one way or another.
Example: How great is our hard-working customer service team?

Question 11

Q

Loaded Question

Answer

A

Contains an assumption about a person’s
habits or perceptions.
Example: Where do you like to go to happy hour after work?

Question 12

Q

Double-barreled Questions

Answer

A

Asks multiple questions within one
item.
Example: Was the product easy to find and did you buy it?

Question 13

Q

Jargon

Answer

A

An item includes words, phrases, acronyms that the
person is not familiar with or doesn’t understand.
Example: The product helped me meet my OKRs.

Question 14

Q

Double Negatives

Answer

A

You need to use proper grammar.
Example: I don’t scarcely buy items online.

Question 15

Q

Poor answer scale options

Answer

A

Make sure you answer scales match
the content of your items. The should not be confusing or
unbalanced.
Example: How easy was it for you to complete the exam on time?
Answer: Yes | No

Question 16

Q

Types of Items

Answer

A

Multiple-choice
True-false
Analogies
Odd-man-out
Sequences
Matching
Completion
Fill-in-the-blank
Forced choice items
10.Vignettes
Rearrangement or continuity

Question 17

Q

What are the incorrect items on a multiple choice test called?

Answer

A

Distractors

Question 18

Q

What are the correct items on a multiple choice test called?

Answer

A

Keyed response

Question 19

Q

What is the keyed response on tests with no definitive answer?

Answer

A

In tests that assess mental health and there is no correct answer the keyed response is the response that reflects what the test assesses. If you are measuring depression than the keyed response will be the choice that correlate with depression. “I feel withdrawn from others”

Question 20

Q

What are the advantages of a multiple choice test?

Answer

A

can be answered quickly so the test can include more items, can be scored quickly & inexpensively

Question 21

Q

What are the disadvantages of a multiple choice test?

Answer

A

better at assessing factual knowledge then problem-solving

Question 22

Q

When is the best time to use true or false?

Answer

A

when there is no right answer

Question 23

Q

Where are analogies usually found?

Answer

A

in tests of intelligents

Question 24

Q

What are matching tests good at?

Answer

A

assessing factual knowledge

Question 25

Q

What is a disadvantage of matching tests?

Answer

A

mismatching one item can affect other items and thus the questions are not independent

Question 26

Q

Where are completion tests usually found?

Answer

A

on personality tests

Question 27

Q

Where are forced choice test usually found?

Answer

A

personality tests

Have to pick one of a few options (I would rather spend time alone or I would rather spend time with friends)

Question 28

Q

What is a vignette?

Answer

A

A brief scenario, like the synopsis of a play or novel.

The subject is asked to react in some way to the vignette, perhaps by providing a story completion, choosing from a set of alternatives, or making some type of judgment.

Question 29

Q

What are the two categories of items?

Answer

A

Constructed-response items: subject is presented with a stimulus and produces a response
Example: essay exams or sentence completion

Selected-response items: subject selects the correct or best response from a list of options
Example: multiple choice

Question 30

Q

Objective test formats

Answer

A

One single response is labeled as “correct.”

Question 31

Q

Subjective test formats

Answer

A

There is not one single answer or response that is labeled as
“correct.”

Question 32

Q

How to decide which Item Format to Use?

Answer

A

Try to increase variation

If it is multiple choice then have many choices such as “strongly agree, agree, undecided, disagree, strongly disagree”

Use more items, a 10 items test can yield scores ranging from 0-10, if each item is scores 0-5 then raw scores can range from 10-50

Question 33

Q

Sequencing of Items

Answer

A

A plan is to use a spiral omnibus format, which involves a series of items from easy to difficult, followed by another series of items from easy to difficult, and so on

Some scales contain filler items that are not scored but are designed to “hide” the real intent of the scale

Question 34

Q

Direct or Performance Assessment

Answer

A

“Authentic”
Direct measurement of the product or performance generated.

If we wanted to test the competence of a football player we would not administer a multiple-choice exam, but would observe that person’s ability to play football.

Question 35

Q

How do we know when an item is working? (Philosophical Issues)

Answer

A

By fiat
Criterion-keyed tests
Factor analysis

Question 36

Q

Fait

Answer

A

A decree on the basis of authority

Claiming that your test effectively measures depression because you are an expert on depression and the content of the items clearly relates to the subject

The Beck Depression Inventory and Standford-Binet test of intelligence

Question 37

Q

Criterion-Keyed Tests

Question 38

Q

Factor analysis

Question 39

Q

Test Administration

Answer

A

Standardization is important
We want to minimize all influences that contribute to the error variance and may decrease test validity

Question 40

Q

Examiners and their tasks

Answer

A

should prepare in advance of administration

• Memorizing or familiarizing self with instructions
• Preparation of test materials
• Layout of necessary materials
• Checking and calibration of equipment
• Before administering individual testing unsupervised, the
examiner should complete supervised training, including
demonstration and practice sessions
• If testing is to be administered in a group setting with multiple
examiners, then a briefing of examiners should be completed
beforehand to assign what functions each will perform

Question 41

Q

Testing Conditions

Answer

A

• Testing environment should be standardized
• Suitable testing room
• Free from undue noise and distraction
• Adequate lighting, seating, ventilation, and workspace
• Prevent interruptions
• Desks and chairs can make a difference
• Type of answer sheet
• Medium of administration (paper and pencil, computer)
• Is examiner familiar or a stranger?
• Manner of the examiner (e.g., smiling, nodding, making positive
comments)
• Presence of the examiner in the room (projective tests) or other
people

Question 42

Q

Rapport

Answer

A

the “bond” between the examiner and the test taker
• Rewards should be consistent across respondents
• Will vary on the test, age of the respondents, group versus
individual testing, personalities of respondents, special difficulties
of respondents
• Reassure respondents at the outset
• Eliminate elements of surprise
• With adults, “sell” the purpose of the test and it is in their best
interest to do their best and to reduce faking and encourage frank
reporting (personality)

Question 43

Q

Examiner and Situational Variables

Answer

A

• More likely with projective tests and individual intelligence tests
• Children are more susceptible
• Have studied examiner age, sex, ethnicity, professional or
socioeconomic status, training and appearance, personality, and
appearance. Results are inconclusive
• Examiner’s behavior preceding and during administration
• Interactions with examiner
• Examiner expectations
• Timing of the test (e.g., military recruits testing shortly after
induction)
• Test taker’s activities shortly prior to the test (e.g., emotional
disturbance, fatigue, success or failure)
• Effects of feedback

Question 44

Q

Derived Scores

Answer

A

Relates the position of a raw score either to
• Other scores in the same distribution
• The distribution of raw scores obtained by a representative group
Norm group
Test norms

Question 45

Q

Norm groups vs. test norms

Answer

A

Norm group – the reference group with known characteristics
Test norms – the distribution of test scores obtained for the norm group

Question 46

Q

What do derived scores do?

Answer

A

Provide a standard frame of reference within which the meaning of a
score can be better understood.
Make it possible for people, under certain conditions, to computer
scores from different measures

Question 47

Q

What are the two kinds of derived score?

Answer

A

Those that preserve the proportional relation of interscore distances in the distribution (z scores and other linear
transformations of raw scores)
Those that do not (e.g., percentiles)

Question 48

Q

What is the formula for finding the kth percentile and the kth quartile?

Answer

A

𝑖= k/100 x ( 𝑛+1 )

𝑖= 𝑘/4 x ( 𝑛+ 1 )

i is the index (ranking or position of a data value)
n is the total number

Question 49

Q

How do you calculate z scores? What is the mean value and SD value?

Answer

A

the mean (μ) is always 0 and the standard
deviation (σ ) is always 1
shape of the original distribution is not changed when converted
𝑧 = 𝑋 − 𝑀 /𝑆𝐷 𝑜𝑟
𝑥 − 𝑥 /𝑠 𝑜𝑟
𝑥 − 𝜇 /𝜎

Question 50

Q

What are the Mound-Shaped Distributions

Answer

A

Approximately 68% of the measurements will have a z-score between -1
and 1.
Approximately 95% of the measurements will have a z-score between -2
and 2.
Approximately 99.7% of the measurements will have a z-score between -3
and 3.

Question 51

Q

How do you translate z scores to item difficulty?

Answer

A

use the table in the back of the boo

Question 52

Q

Bandwidth-fidelity dilemma

Answer

A

A peaked test measures those people at the peak well, but others
very poorly [high fidelity(precision), but low bandwidth]
A rectangular distribution tries to have a few questions for each
difficulty level, so that the average difficulty level is around .50
Thus it will help differentiate people no matter what level they are
on the trait
The test will only have a few items at each difficulty level, so it
won’t be able to differentiate between the individuals at the
various levels well
With this type of test it has good bandwidth, but low fidelity

Question 53

Q

Define Item Difficulty

Answer

A

In psychology, we define item difficulty as the percentage of examinees who answer an item correctly.
𝑝 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑖𝑡𝑒𝑚 𝑖= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑎𝑛𝑠𝑤𝑒𝑟𝑖𝑛𝑔 𝑖𝑡𝑒𝑚 𝑖 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠𝑡𝑎𝑘𝑖𝑛𝑔 h𝑡 𝑒𝑡𝑒𝑠𝑡 (𝑛)

Question 54

Q

What is the p score if everyone gets the item correct (low difficulty)?

Answer

A

p= 1.00

p= 100 people got it right / 100 people

Question 55

Q

What is the p score if everyone gets the item wrong (high difficulty)?

Answer

A

p= 0.00

p= 0 people got it right / 100 people

Question 56

Q

What does calculating item difficulty tell us?

Answer

A

The relative frequency that examinees choose the correct response.
It is a characteristic of both the item and the population taking
the test.
If we give the item to two different groups, the difficulty will not be the same.
The difficulty of items can be compared across domains.

Question 57

Q

When is variability maximized?

Answer

A

When the item difficulty is closer to .50

Question 58

Q

Bandwidth-fidelity dilemma, peaked

Answer

A

It is difficult for a test to measure all people well. Generally, it measures some people at a specific ability level better than others.
A peaked conventional test can provide high fidelity (i.e., precision) where it is peaked, but little bandwidth (i.e., it does not differentiate very well individuals at other positions on the scale).

Question 59

Q

Bandwidth-fidelity dilemma, rectangle

Answer

A

A rectangular distribution tries to have a few questions for each difficulty level, so that the average difficulty level is around .50. Thus, it will help differentiate people no matter what level they are on the trait. The test will only have a few items at each difficulty level, so it won’t be able to differentiate between the individuals at the various levels well.

Question 60

Q

Problems with guessing

Answer

A

This inflates the p value because a p value of .60 really means that among the 60% who answered the item correctly, a certain percentage answered it correctly by lucky guessing

Question 61

Q

Ways to minimize the problems of guessing

Answer

A

score = right − (wrong/k−1)

The more answer choices the lower the significance of guessing (T/F is 50%, multiple choice is 20%)
Tell all candidates to do the same thing – that is, guess when unsure, leave doubtful items blank, etc.

Question 62

Q

Item Discrimination

Answer

A

Item discrimination refers to the ability of an item to correctly “discriminate” between those who are higher on the variable in question and those who are lower.
We expect that those people who do well overall on the test will also do well on individual items. We also expect the opposite to be true.

Question 63

Q

How do we determine what are the high scores and what are the low scores?

Answer

A

Can find the median and label everything above it high and everything below it low.
Advantage: we use all the data we have
Disadvantage: there is a lot of “noise” at the center of distribution

Label the top five high and the bottom five low
Advantage: scores are unlikely to change on retest; likely not a result of guessing; probably represent “real-life” correspondence
Disadvantage: small sample size so we can’t be sure that any calculations preformed are stable

Resolution: (roughly) select (around) the upper 27% and the lower 27%

Question 64

Q

Index of Discrimination (D)

Answer

A

The index of discrimination is expressed as a percentage and is computed from two percentages. It is simply the differences between two percentages.
This method breaks the test takers into the top test scores and the bottom test scores. We compare the number of people in each group who answered the item correctly. If the item is doing a good job of discriminating between the two groups, then more of the high scorers will answer correctly than the low scorers.
If this is an item where there is a correct answer, a negative D would alert us that there is something wrong with the item, that it needs to be rewritten. If this were an item from a personality test where there is no correct answer, the negative D would in fact tell us that we need to reverse the scoring.

Answer 63

A

If we use the total test score as our criterion, then we will be retaining items that tend to be homogeneous, that is items that tend to correlate highly with each other.

Answer 64

A

If we use an external criterion, that criterion will most likely be more complex psychologically than the total test score. For example, teachers’ evaluations of being “good at math” may reflect not only math knowledge, but how likable the child is.

Answer 65

A

This statistic is the simple correlation between the score on an item (a correct response usually receives a score of 1; an incorrect response receives a score of 0), and the total test score

Answer 66

A

A positive item-total correlation indicates that the item successfully discriminates between those that do well on the test and those that do poorly.An item-total correlation near zero indicates that the item doesn’t differentiate between high and low scorers.A negative item-total correlation indicates that the item scores and the overall test scores disagree.Those that do well on an item with a negative item-total correlation do poorly on the test.

Answer 67

A

First, the item we are looking at may not be correlated with the other items in the test. If we want the test to homogeneous, then we should consider dropping the item or rewriting the item so that it assesses similar content to the other items.
Second, the item may show positive correlations with some items, but zero or negative correlations with other items on the test. If a test measures more than one attribute, this could occur.

Answer 68

A

Compute the correlations among all the items.You can use this information to compute the reliability of a test given the average interitem correlation and the number of items on the test.You can also use this information to help you interpret the item discrimination numbers you found.

Answer 69

A

Factor Analysis – Tests should be pure measures of the dimension being assessed. Items are selected statistically and correlate highly with each other.
Scale is homogenous.
Con: Useful for understanding a psychological phenomenon but may not relate to real world behavior

Empiricism – scales should predict real-life behavior. Items are dropped or kept depending on whether the correlate with the criterion.
Scale is heterogenous.

Answer 70

A

In “classical” test theory, a test score is made up of two parts:
“true” score + random “error”
The more a person has this variable, the more likely the person will
answer the question correctly.

IRT also has a basic assumption and that is that performance on a test is a function of an unobservable proficiency variable.

The characteristics of a test item, such as item difficulty, are a function of the particular sample to whom the item was administered.
Certain vocabulary words are harder for 2nd graders than they are for college students

IRT, on the other hand, focuses on a theoretical mathematical model that unites the characteristics of an item, such as item difficulty, to an underlying hypothesized dimension.

IRT is concerned with the inter-play of four aspects:
(1) the ability of the individual on the variable being assessed
(2) the extent to which a test item discriminates between high- and low-scoring groups
(3) the difficulty of the item
(4) the probability that a person of low ability on that variable makes the correct response.

Answer 71

A

We need to have some way to make sense of a test score.
We need to be able to compare a score with the scores of others
who have taken the test.
Usually we compare scores that have been obtained for a normative
sample.

Answer 72

A

In a perfect world, tests are administered to a representative
group, on the basis of random sampling.
Normative groups are formed.
From this data, we could learn what average scores are to be
expected from particular samples.
Norms can be formed on the basis of random sampling or on the basis of certain criteria

Stratified Sampling is used when we test a normative sample that reflects specific percentages
Sample of Convenience is more typical
Neither is random or representative

Answer 73

A

Age norms relates a level of test performance to the age of people who
have taken the test.
In establishing age norms, we need to obtain a representative sample
at each of several ages and to measure the particular age-related
characteristic in each of these samples.
We usually focus on the median because it shows what the typical
performance level is at each age level.
Remember that there is considerable variability within the same age.

Answer 74

A

Very similar to age norms, except the baseline is the grade level rather
than the age.
We need to be careful when interpreting scores with grade-level
norms.
A child at a lower grade may get a score which is the grade equivalent
at a higher grade, but it doesn’t mean that the child should be in that
higher grade.
The higher score may only be true on a subset of material and doesn’t
translate to all areas of what a child of that grade can do.

Answer 75

A

Norms can be based on inappropriate target populations.
Test manuals can be based on samples that don’t adequately represent the populations to which the examinee’s scores should be compared.
Normative data can become out of date quickly.
The sample size of the norm group may be small, which may have more sampling error than a larger sample would

Answer 76

A

Expectancy tables present the data showing the relationship between
test scores and some other variable based on the experience of
members of the norm group.
Shows what can be expected of a person with a particular score.

Answer 77

A

Depending on which set of norms you compare a score to may change the meaning given to the score.

Answer 78

A

Some times it may be more appropriate to compare a score to a set of local norms.
This data is gathered from a local group of individuals.
It may depend on what the scores will be used for and whether decisions are to be made using the scores.

Answer 79

A

You assess your performance in comparison with some standard or
set of standards not in comparison of what others can do.
We must first of all be able to specify the criterion.
Second, criteria are not usually arbitrary, but are based on real-life observation. criterion-referenced decisions can be normative decisions, often with the norms not clearly specified
Lastly, criterion- referenced and norm-referenced refer to how the scores or test results are interpreted, rather than to the tests themselves.

In this course, I could develop a list of topics that I expect each student to master and then assess their ability to meet these
standards.
It is very difficult to develop good criterion-referenced tests.
Specifying standards and determining whether people meet or exceed them is still evolving.

Answer 80

A

Carver (1974) used the terms psychometric to refer to norm referenced and edumetric to refer to criterion referenced.

He argued that the psychometric approach focuses on individual differences, and that item selection and the assessment of reliability and validity are determined by statistical procedures

The edumetric approach, on the other hand, focuses on the measurement of gain or growth of individuals, and item selection, reliability and validity, all center on the notion of gain or growth.

Answer 81

A

Combining scores using statistics: convert test scores to z scores so that they can be compared and combine

Combining scores using clinical intuition: college admissions person deciding “accept” or “reject” based off of combining test score, GAP, recommendations, etc.

Multiple Cutoff Scores: for college admissions a person may need a GPA of 3.0, anyone lower will not be considered. Cutoffs can be determined by clinical judgements or statistical evidence. Sometimes a high score in one area can compensate for a low score in another but not always.

Multiple Regression: Essentially expresses the relationship between a set of variables and a particular outcome that is being predicted. Gives differential weighting to each of the variables. First, is a compensatory model, that is, high scores on one variable can compensate for low scores on another variable. Second, it is a linear model, that is, it assumes that as scores increase on one variable (for example IQ), scores will increase on the predicted variable (for example, GPA). Third, the variables that become part of the regression equation are those that have the highest correlations with the criterion and low correlations with the other variables in the equation.

Brainscape's Knowledge GenomeTM

Chapter 2: Test Construction, Administration, and Interpretation Flashcards

Brainscape's Knowledge Genome^TM