Study Guide 3: Test Development, Pilot Testing, and Item Analysis Flashcards Preview

EPSE 528 > Study Guide 3: Test Development, Pilot Testing, and Item Analysis > Flashcards

Flashcards in Study Guide 3: Test Development, Pilot Testing, and Item Analysis Deck (27):

Acquiescence bias

tendency to agree with any ideas or behaviors presented. Ex. Labeling each statement as true. Solution – Balance items with T and F responses.


Cronbach’s Alpha-if-item-deleted

tells us the reliability estimate that we would obtain for the test if we drop each item from the test.


Balanced scale

having a balance of questions that require true or false answers to avoid acquiescence.


Corrected item-total correlation

This is the correlation between an item and the rest of the exam, without that item considered part of the exam. If the correlation is low for an item, this means the item isn't really measuring the same thing the rest of the exam is trying to measure.



A method of testing validity by using more than one sample of people from the same population.



the items that are incorrect on a multiple choice question, though they are designed to appear correct to someone who doesn’t know the correct answer


Extreme (or moderate) response bias

the tendency to use or avoid extreme response options.


Inter-item correlation matrix

the matrix displays the correlation of each item with another item. each is a phi coefficient – a result of correlating two dichotomous variables. These create a matrix. Ideally, each item should be highly correlated with every other item to increase the test’s internal consistency.


Item bias

an item’s being easier for one group than for another. Item characteristic curves are useful for determining this.


Item characteristic curve

– the line that results when we graph the probability of answering an item correctly with level of ability on the construct being measured. It provides a picture of both the item’s difficulty and discrimination. Difficulty is determined by the location of the point at which the curve indicates a probability of .5 (a 50-50 chance) for answering correctly. The higher the ability level associated with this point the more difficult the question.


Item difficulty

the percentage of test takers who respond correctly. Calculated as a p value, by dividing the number of persons who answered correctly by the total number of persons who responded to the question. Items with p values of .5 yeild distributions of test scores with the most variations. .2 (too difficult), or .8-1.0 (too easy).


Item discrimination

the degree to which an item might affect a test’s internal consistency- but the degree to which an item differentiates people who score high on the total test from those who score low on the total test. Prefer high values


Item discrimination index

– a statistic that compares the performance of the upper group who had very high test scores with the performance of the lower group who had low test scores D= U(Number in upper group who responded correctly/Total number in upper group)-L(Number in lower group who responded correctly/Total number in lower group) – upper and lower groups may be determined by 25-35th percentiles. percentiles.


Item-total correlation

a way of operationalizing an item’s discrimination. Comput the total score on a test, and then compute the correlation between an item with this total test score. It reflects the degree to which differences among persons’ responses to the item are consistent differences in their total test scores.



faking bad. When respondents might attempt to exaggerate their psychological problems.


Objective test format

– a test format where one response that is designated as correct, or that provides evidence of a specific construct ex. Multiple-choice test, TF, fill in the blank – because these can be reliably scored, related to the test plan and the construct, these facilitate the test’s reliability and validity.


Quantitative item analysis

statistical analyses of the responses test takers gave to individual items.


Random responding

results when test takers are unwilling or unable to respond to test items accurately – they may respond randomly by answering without reading or considering the questions. This may happen when they are illiterate, lack other skills or desire to take the test.


Response styles

tied to stable characteristics of individuals (e.g., some individuals are more concerned in general about appearing socially desirable than others).


Reverse scoring

Items that are reverse scored will invert the numeric scale when the data is stored; however, it will not be possible for your participants to see the difference between reverse scored questions and regular questions. For example, a question would present the scale in ascending order 1, 2, 3, 4, 5, 6, 7. However, if the participant chooses option 7, the data will be stored as a 1. A 6 would be scored as a 2, etc. This feature allows you to recreate psychology measures which frequently use reverse scored items in order to prevent the participants from determining the intent of the survey. When filling in the Starting Value for a reverse-scored item, use the lowest number that would appear on the scale. On a normal scale this number will correspond with the left-most data option. On a reversed scale, this number will correspond with the right-most data option.



is the process of constructing a score scale that associates numbers or other ordered indicators with the performance of examinees (Kolen & Brennan, 2004). These numbers and ordered indicators are intended to reflect increasing levels of achievement or proficiency. The process of scaling produces a score scale, and the resulting scores are referred to as scale scores.


Social desirability bias

– tendency of some test takers to choose answers that are socially acceptable or present them in a favorable light.


Standards for Educational and Psychological Testing

(1999) a set of testing standards developed jointly by the American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). Revised significantly from the 1985 version, the 1999 Standards for Educational and Psychological Testing has more in-depth background material in each chapter, a greater number of standards, and a significantly expanded glossary and index. The 1999 version Standards reflects changes in United States federal law and measurement trends affecting validity; testing individuals with disabilities or different linguistic backgrounds; and new types of tests as well as new uses of existing tests. The Standards is written for the professional and for the educated layperson and addresses professional and technical issues of test development and use in education, psychology and employment.


Subjective test format

don’t have a designated correct response. Interpretation of the response as correct or providing evidence of a specific osntruct is left to the judgment of the person who scores or inteprets the test taker’s responses. Ex. Projective tests. Open-ended or essay questions, employment interviews. Documenting the validity and reliability for these types of tests is harder.


Validity scales

sets of items embedded within a large inventory and they are intended to quantify the degree to which a respondent is manifesting specific response biases.


response sets

factors that reflect the testing situation (e.g., consequences of testing) or the test itself (e.g., test format or ambiguity of items).


qualitative item analysis

ex. Test takers complete a questionnaire about how they viewed the test itself and how they answered the questions. Other formats to gain qualitative info might be individual or group discussions with tests takers to understand how test takers perceive the test and how changes in the test content or administration instructions will improve the accuracy of test results. Asking a panel of experts etc.