Test Construction Flashcards

Question

drawbacks of alternate forms reliability coefficient

Answer 1

lower than the test-retest reliability coefficient; sources of measurement error: differences in content between the 2 forms (some do better on Form A, others do better on Form B) and passage of time, since the two forms cannot be administered at the same time; impractical and costly to construct two versions of the same test; should not be used to assess the reliability of a test that measures an unstable trait

Answer 2

obtaining correlations among individual items; split-half reliability, Cronbach’s coefficient alpha, Kuder-Richardson Formula 20; administer the test once to a single group of examinees

Answer 3

dividing the test in two and obtaining a correlation between the halves as if they were two shorter tests

Answer 4

estimates the effect that shortening (or lengthening) a test will have on the reliability coefficient

Answer 5

correlation will vary depending on how the items are divided; splitting the test in this manner artificially lowers the reliability coefficient since the longer a test, the more reliable it will be - so use Spearman-Brown formula

Answer 6

indicate the average degree of inter-item consistency; used when the test items are dichotomously scored (right-wrong, yes/no)

Answer 7

indicate the average degree of inter-item consistency; used for tests with multiple-scored items (“usually”, “sometimes”, “rarely”, “never”)

Answer 8

pros: good for assessing the reliability of tests that measure unstable traits or are affected by repeated administration; cons: major source of measurement error is item heterogeneity; inappropriate for assessing the reliability of speed tests.

Answer 9

degree that items are different in terms of the content they sample

Answer 10

calculating a correlation coefficient between the scores of two different raters

Answer 11

measure of the agreement between two judges who each rate a set of objects using nominal scales

Answer 12

a particular behavior clearly belongs to one and only one category

Answer 13

the categories cover all possible responses or behaviors

Answer 14

rater records the elapsed time during which the target behavior or behaviors occur

Answer 15

observer keeps count of the number of times the target behavior occurs; useful for recording behaviors of short duration and those where duration is not important

Answer 16

observing a subject at a given interval and noting whether the subject is engaging or not engaging in the target behavior during that interval; useful for behaviors that do not have a fixed beginning or end

Answer 17

recording all the behavior of the target subject during each observation session

Answer 18

indicates how much error an individual test score can be expected to have; used to construct a confidence interval

Answer 19

the range within which an examinee’s true score is likely to fall, given his or her obtained score

Answer 20

SEMEAS = SDx (√1−rxx) SEMEAS = standard error of measurement SDx = standard deviation of test scores rxx = reliability coefficient

Answer 21

(± σmeas) of obtained score = 68%; (± 1.96 x σmeas) of obtained score = 95%; (± 2.58 x σmeas) of obtained score = 99%

Answer 22

1. short tests are less reliable than longer tests 2. as the group taking a test becomes more homogeneous, the variability of the scores - and hence the reliability coefficient - decreases 3. if test items are too difficult, most people will get low scores on the test; if items are too easy, most people will get high scores, decreasing score variability, resulting in a lower reliability coefficient 4. the higher the probability that examinees can guess the correct answer to items, the lower the reliability coefficient 5. for inter-item consistency measured by the KR-20 or coefficient alpha methods, reliability is increased as the items become more homogeneous

Answer 23

the extent to which the test items adequately and representatively sample the content area to be measured; educational achievement tests, work samples, EPPP

Answer 24

judgment and agreement of subject matter experts; high correlation with other tests that purport to sample the same content domain; students who are known to have succeeded in learning a particular content domain do well on a test designed to sample that domain

Answer 25

appears valid to examinees who take it, personnel who administer it, and other technically untrained observers

Answer 26

useful for predicting an individual’s behavior in specified situations; applied situations (select employees, college admissions, place students in special classes)

Answer 27

a correlation coefficient (Pearson r) is used to determine the correlation between the predictor and the criterion

Answer 28

rxy “x” refers to the predictor “y” refers to the criterion

Answer 29

the procedures used to determine how valid a predictor is

Answer 30

the predictor and the criterion data are collected at or at about the same time; useful for predicting a given current behavior, say that it has high concurrent validity; focus on current status on a criterion

Answer 31

scores on the predictor are collected first, and the criterion data are collected at some future point; useful for predicting a future behavior, say that the test has high predictive validity; designed to predict future status

Answer 32

estimate the range in which a person’s true score on a criterion is likely to fall, given his/her score as estimated by a predictor

Answer 33

SE est = SDy 1 - rxy2 SEest = standard error of estimate SDy = standard deviation of criterion scores rxy = validity coefficient

Answer 34

68% = ± (1)(σest) of predicted criterion score; 95% = ± (1.96)(σest); 99% = ±(2.58)(σest )

Answer 35

1) SEM is related to the reliability coefficient; SEE is related to the validity coefficient 2) SEM used to estimate where an examinee’s true test score is likely to fall, given obtained score on that same test - no predictor measure is involved; SEE used to determine where an examinee’s actual criterion score is likely to fall, given the criterion score that was predicted by another measure - predictor is being used

Answer 36

whether or not that person will meet or exceed a certain minimum standard or criterion performance

Answer 37

if the examinee scores at or above the predictor cutoff score he or she is selected, but if the examinee scores below the predictor cutoff score, he or she is rejected

Answer 38

scored above the cutoff point on the predictor and turn out to be successful on the criterion; predictor said they would be successful on the job and it was right

Answer 39

scored above the cutoff point on the predictor but did not turn out to be successful on the criterion; the predictor wrongly indicated that they would be successful on the job

Answer 40

scored below the cutoff point on the predictor and turned out to be unsuccessful on the criterion; predictor correctly indicated that they would be unsuccessful on the job

Answer 41

scored below the cutoff point on the predictor but turned out to be successful on the criterion; predictor incorrectly indicated that they would be unsuccessful on the job

Answer 42

“positive”: predictor says the person should be selected; “negative”: predictor says the person should not be selected

Answer 43

where the person actually stands on the criterion; “true”: predictor classified the person into the correct criterion group; “false”: predictor made an incorrect classification

Answer 44

determine the increase in the proportion of correct hiring decisions that would result from using the predictor as a selection tool, relative to when it is not used

Answer 45

1) Heterogeneity of Examinees: lowered if there is a restricted range of scores - either on predictor or criterion - more homogeneous the validation group, lower the validity coefficient 2) Reliability of Predictor and Criterion: for a predictor to be valid, both the predictor and the criterion must be reliable - an unreliable test will always be invalid, but a reliable test will not always be valid 3) Moderator Variables: the criterion-related validity of a test may vary among subgroups within a population by moderator variables 4) Cross-Validation: after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample

Answer 46

variables that influence the relationship between two other variables

Answer 47

test is more valid for one subgroup but not another

Answer 48

after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample

Answer 49

reduction that occurs in a criterion-related validity coefficient upon cross-validation; occurs b/c the predictor is “tailor-made” for the original validation sample and doesn't fully generalize to other samples

Answer 50

the original validation sample is small; the original item pool is large; the number of items retained is small relative to the number of items in the item pool; items are not chosen based on a previously formulated hypothesis or experience with the criterion

Answer 51

in the process of validating a test, the predictor scores themselves influence any individual’s criterion status; artificially inflates the validity coefficient - it makes the predictor look more valid than it actually is

Answer 52

a psychological variable that is abstract

Answer 53

measures a theoretical construct or trait

Answer 54

requires that different ways of measuring the same trait yield similar results (WISC, WJ); tests that measure the same trait have a high correlation, even when they use different methods

Answer 55

low correlation with another test that measures a different construct; two tests that measure different traits have a low correlation, even when they use the same method

Answer 56

assessment of two or more traits by two or more methods (self-report inventory, peer ratings, projective test)

Answer 57

indicate the correlation between the measure and itself and are therefore reliability coefficients

Answer 58

correlations between two measures that assess the same (mono) trait using different (hetero) methods; if a test has convergent validity, this correlation should be high

Answer 59

correlations between two measures that measure different (hetero) traits using the same (mono) method; if a test has discriminant validity, this coefficient should be low

Answer 60

correlations between two measures that measure different (hetero) traits using different (hetero) methods; if a test has discriminant validity, this correlation should be low

Answer 61

reducing a set of many variables (e.g., tests) to fewer variables to assess construct validity of a test; detect structure in several variables; can allow you to start with a large number of variables and classify them into sets

Answer 62

tests in the analysis are not directly intended to measure these constructs (AKA latent variables)

Answer 63

the correlation between a given test and a given factor; range from +1 to -1; can be squared to determine the proportion of variability in the test accounted for by the factor

Answer 64

determine the proportion of variance of a test that is attributable to the factors; part of true variability shared with other tests

Answer 65

factors are also accounting for variance in the other tests included in the analysis

Answer 66

variance specific to the test and not explained by the factors; part of true variability unique to the test itself

Answer 67

measure of the amount of variance in all the tests accounted for by the factor

Answer 68

1) factors will be ordered in terms of the size of their eigenvalue - Factor I larger than Factor II, which is larger than Factor III, etc. Factor I will explain more of “what’s going on” in the tests than Factor II; 2) sum of the eigenvalues can be no larger than the number of tests included in the analysis

Answer 69

procedure that facilitates interpretation of a factor matrix; re-dividing the test’s communalities so that a clearer pattern of loadings emerges

Answer 70

factors that are independent of each other (uncorrelated)

Answer 71

factors that are correlated with each other to some degree

Answer 72

when a test correlates highly with a factor it would be expected to correlate with

Answer 73

1) terminology: “factor” in factor analysis is usually referred to as a principal component or an eigenvector in principal components analysis 2) in principal components analysis variance has 2 elements: explained variance and error variance; in factor analysis, the variance has 3 elements: communality, specificity, and error 3) in principal components analysis, the factors (or components, or eigenvectors) are always uncorrelated

Answer 74

place objects into categories; develop a taxonomy or classification system

Answer 75

1) only variables that are measured using interval or ratio data can be used in a factor analysis; variables measured using any type of data can be included in a cluster analysis 2) factors in factor analysis are usually interpreted as underlying traits or constructs measured by the variables in the analysis; clusters in cluster analysis are just categories, and not necessarily traits 3) cluster analysis used in studies where there is an a priori hypothesis regarding what categories the objects will cluster into; factor analysis used to test a hypothesis regarding what traits a set of variables measures

Answer 76

a test is reliable if it measures “something,” and a test is valid if that “something” is what the test developer claims it is; for a test to be valid, it must be reliable; the validity coefficient is less than or, at the most, equal to the square root of the reliability coefficient - it can't be higher; reliability places an upper limit on validity

Answer 77

the formula answers the following question: “What would the validity coefficient of my predictor be if both the predictor and the criterion were perfectly reliable?”; what would happen to the validity coefficient if reliability (of both the predictor and the criterion) were higher

Answer 78

used to determine which items will be retained for the final version of the test; can be qualitative (content of the test) and quantitative (measurement of item difficulty, item discrimination)

Answer 79

the percentage of examinees who answer it correctly; the higher the p value, the less difficult the item; ideal items have p = ~.50 ordinal scale only

Answer 80

.25 .80 to .90 .75 .60

Answer 81

degree to which a test item differentiates among examinees in terms of the behavior that the test is designed to measure

Answer 82

choose items that have high correlations with the criterion but low correlations with each other

Answer 83

graphs that depict each item in terms of how difficult the item was for individuals in different ability groups

Answer 84

1) performance on an item is related to the estimated amount of a latent trait being measured by the item; implies that the scores of individuals tested with different items can be directly compared to each other since all the items measure the same latent trait. 2) results of testing are sample free (“invariance of item parameters”) - an item should have the same parameters (difficulty and discrimination levels) across all random samples of a population so it can be used with any individual to provide an estimate of their ability

Answer 85

administering a set of items tailored to the examinee’s estimated level of ability

Answer 86

comparing an examinee’s score to norms (scores of other examinees in a standardization sample); indicates where the examinee stands in relation to others who have taken the test

Answer 87

indicate how far along the normal developmental path an individual has progressed

Answer 88

comparing an examinee’s score to the average performance of others at different age levels

Answer 89

computing the average raw score obtained by children in each grade; for educational achievement tests

Answer 90

don't permit comparisons of individuals at different age levels; grade equivalent scores on different tests are not comparable

Answer 91

provide a comparison of the examinee’s score to those of the most nearly comparable standardization sample

Answer 92

the percentage of persons in the standardization sample who fall below a given raw score

Answer 93

pro: easy to understand and interpret; con: represent ranks (ordinal data) and therefore do not allow interpretations in terms of absolute amount of difference between scores

Answer 94

express a raw score’s distance from the mean in terms of standard deviation units; tell us how many standard deviation units a person’s score is above or below the mean

Answer 95

scores can be compared across different age groups; allow for interpretation in terms of the absolute amount of differences between scores

Answer 96

directly indicates how many standard deviation units a score falls above or below the mean

Answer 97

mean of 50 and a SD of 10; a T-score of 60 has a score that falls 1 standard deviation above the mean

Answer 98

scores range from 1 to 9; mean of 5 and a SD of 2

Answer 99

mean of 100 and a standard deviation of 15

Answer 100

a case where given scores on a predictor test predict different outcomes for different subgroups

Answer 101

a test is valid for one subgroup but not another subgroup

Answer 102

the proportion of correctly identified cases; the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic

Answer 103

attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data)

Answer 104

attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used; raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (defining a "2" for an item dealing with job performance)