Test Construction Flashcards

1
Q

psychological test

A

an objective and standardized measure of a sample of behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

standardization

A

uniformity of procedure in administering and scoring the test;
test conditions and scoring procedures should be the same for all examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

norms

A

the scores of a representative sample of the population on a particular test;
interpretation of most psychological tests involves comparing an individual’s test score to norms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

conceptual points about norms

A

1) norms are obtained from a sample that is truly representative of the population for which the test is designed;
2) to be truly representative, a sample must be reasonably large;
3) examinee’s score should be compared to the scores obtained by a representative sample of the population to which he or she belongs;
4) norm-referenced scores indicate an examinee’s standing on a test as compared to other persons, which permits comparison of an individual’s performance on different tests;
5) don’t provide a universal standard of “good” or “bad” performance - represent the performance of persons in the standardization sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

objective

A

administration, scoring, and interpretation of scores are “independent of the subjective judgment of the particular examiner”;
the examinee will obtain the same score regardless of whoever administers or scores the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

sample of behavior

A

the test will sample the behavior in question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

reliability

A

yields repeatable, dependable, and consistent results;
yields examinees’ true scores on whatever attribute that it measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

validity

A

measures what it purports to measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

maximum performance

A

tells us about an examinee’s best possible performance, or what a person can do;
achievement and aptitude tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

typical performance

A

tell us what an examinee usually does or feels;
interest and personality tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

pure speed (speeded) test

A

the examinee’s response rate is assessed;
have time limits and consist of items that all (or almost all) examinees would answer correctly if given enough time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

power test

A

assesses the level of difficulty a person can attain;
no time limit or a time limit that permits most or all examinees to attempt all items;
items are arranged in order from least difficult to most difficult

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

mastery tests

A

designed to determine whether a person can attain a pre-established level of acceptable performance;
“all or none” score (e.g., pass/fail);
commonly employed to test basic skills (e.g., basic reading, basic math) at the elementary school level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ipsative measure

A

individual themself (opposed to a norm group or external criterion) is the frame of reference in score reporting;
scores are reported in terms of the relative strength of attributes within the individual examinee;
scores reflect which needs are strongest or weakest within the examinee, rather than as compared to a norm group;
examinees express a preference for one item over others, rather than responding to each item individually - required to choose which of 2 statements appeals to you the most

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

normative measures

A

provide a measure of the absolute strength of each attribute measured by the test;
examinees answer every item;
score can be compared to those of other examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

classical test theory

A

a given examinee’s obtained test score consists of two components: truth and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

true score

A

reflects the examinee’s actual status on whatever attribute is being measured by the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

error (measurement error)

A

factors that are irrelevant to whatever is being measured; random;
does not affect all examinees in the same way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

reliability coefficient

A

a correlation coefficient that ranges in value from 0.0 to +1.0;
indicates the proportion of variability that is true score variability;
0.0 - test is completely unreliable; observed variability (differences) in test scores due entirely to random factors;
1.0 - perfect reliability; no error - all observed variability reflects true variability;
.90 - 90% of observed variability in obtained test scores due to true score differences among examinees and the remaining 10% of observed variability represents measurement error;
cannot be squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

test-retest reliability coefficient (“coefficient of stability”)

A

administering the same test to the same group of people, and then correlating scores on the first and second administrations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

“time sampling”

A

factors related to time that are sources of measurement error for the test-retest coefficient;
from one administration to the next, there may be changes in exam conditions (noises, weather) or factors such as illness, fatigue, worry, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

practice effects

A

doing better the second time around due to practice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

drawbacks of test-retest reliability coefficient

A

examinees systematically tend to remember their previous responses;
not appropriate for assessing the reliability of tests that measure unstable attributes (mood);
recommended only for tests that are not appreciably affected by repetition, so very few psychological tests fall into this category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

alternate forms (equivalent forms or parallel forms) reliability coefficient

A

administering two equivalent forms of a test to the same group of examinees, and then obtaining the correlation between the two sets of scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
drawbacks of alternate forms reliability coefficient
lower than the test-retest reliability coefficient; sources of measurement error: differences in content between the 2 forms (some do better on Form A, others do better on Form B) and passage of time, since the two forms cannot be administered at the same time; impractical and costly to construct two versions of the same test; should not be used to assess the reliability of a test that measures an unstable trait
26
internal consistency
obtaining correlations among individual items; split-half reliability, Cronbach’s coefficient alpha, Kuder-Richardson Formula 20; administer the test once to a single group of examinees
27
split-half reliability
dividing the test in two and obtaining a correlation between the halves as if they were two shorter tests
28
Spearman-Brown formula
estimates the effect that shortening (or lengthening) a test will have on the reliability coefficient
29
drawbacks of split-half reliability
correlation will vary depending on how the items are divided; splitting the test in this manner artificially lowers the reliability coefficient since the longer a test, the more reliable it will be - so use Spearman-Brown formula
30
Kuder-Richardson Formula 20 (KR-20)
indicate the average degree of inter-item consistency; used when the test items are dichotomously scored (right-wrong, yes/no)
31
coefficient alpha
indicate the average degree of inter-item consistency; used for tests with multiple-scored items (“usually”, “sometimes”, “rarely”, “never”)
32
pros and cons of internal consistency reliablity
pros: good for assessing the reliability of tests that measure unstable traits or are affected by repeated administration; cons: major source of measurement error is item heterogeneity; inappropriate for assessing the reliability of speed tests.
33
content sampling, or item heterogeneity
degree that items are different in terms of the content they sample
34
interscorer (or inter-rater) reliability
calculating a correlation coefficient between the scores of two different raters
35
kappa coefficient
measure of the agreement between two judges who each rate a set of objects using nominal scales
36
mutually exclusive categories
a particular behavior clearly belongs to one and only one category
37
exhaustive categories
the categories cover all possible responses or behaviors
38
duration recording
rater records the elapsed time during which the target behavior or behaviors occur
39
frequency recording
observer keeps count of the number of times the target behavior occurs; useful for recording behaviors of short duration and those where duration is not important
40
interval recording
observing a subject at a given interval and noting whether the subject is engaging or not engaging in the target behavior during that interval; useful for behaviors that do not have a fixed beginning or end
41
continuous recording
recording all the behavior of the target subject during each observation session
42
standard error of measurement (σmeas)
indicates how much error an individual test score can be expected to have; used to construct a confidence interval
43
confidence interval
the range within which an examinee’s true score is likely to fall, given his or her obtained score
44
SEM formula
SEMEAS = SDx (√1−rxx) SEMEAS = standard error of measurement SDx = standard deviation of test scores rxx = reliability coefficient
45
CI formulas
(± σmeas) of obtained score = 68%; (± 1.96 x σmeas) of obtained score = 95%; (± 2.58 x σmeas) of obtained score = 99%
46
factors affecting reliability
1. short tests are less reliable than longer tests 2. as the group taking a test becomes more homogeneous, the variability of the scores - and hence the reliability coefficient - decreases 3. if test items are too difficult, most people will get low scores on the test; if items are too easy, most people will get high scores, decreasing score variability, resulting in a lower reliability coefficient 4. the higher the probability that examinees can guess the correct answer to items, the lower the reliability coefficient 5. for inter-item consistency measured by the KR-20 or coefficient alpha methods, reliability is increased as the items become more homogeneous
47
content validity
the extent to which the test items adequately and representatively sample the content area to be measured; educational achievement tests, work samples, EPPP
48
assessment of content validity
judgment and agreement of subject matter experts; high correlation with other tests that purport to sample the same content domain; students who are known to have succeeded in learning a particular content domain do well on a test designed to sample that domain
49
face validity
appears valid to examinees who take it, personnel who administer it, and other technically untrained observers
50
criterion-related validity
useful for predicting an individual’s behavior in specified situations; applied situations (select employees, college admissions, place students in special classes)
51
criterion-related validity coefficient
a correlation coefficient (Pearson r) is used to determine the correlation between the predictor and the criterion
52
criterion-related validity coefficient formula
rxy “x” refers to the predictor “y” refers to the criterion
53
validation
the procedures used to determine how valid a predictor is
54
concurrent validation
the predictor and the criterion data are collected at or at about the same time; useful for predicting a given current behavior, say that it has high concurrent validity; focus on current status on a criterion
55
predictive validation
scores on the predictor are collected first, and the criterion data are collected at some future point; useful for predicting a future behavior, say that the test has high predictive validity; designed to predict future status
56
standard error of estimate (or σest)
estimate the range in which a person’s true score on a criterion is likely to fall, given his/her score as estimated by a predictor
57
standard error of estimate formula
SE est = SDy 1 - rxy2 SEest = standard error of estimate SDy = standard deviation of criterion scores rxy = validity coefficient
58
CI for standard error of estimate
68% = ± (1)(σest) of predicted criterion score; 95% = ± (1.96)(σest); 99% = ±(2.58)(σest )
59
differences between standard error of estimate and standard error of measurement
1) SEM is related to the reliability coefficient; SEE is related to the validity coefficient 2) SEM used to estimate where an examinee’s true test score is likely to fall, given obtained score on that same test - no predictor measure is involved; SEE used to determine where an examinee’s actual criterion score is likely to fall, given the criterion score that was predicted by another measure - predictor is being used
60
criterion cutoff
whether or not that person will meet or exceed a certain minimum standard or criterion performance
61
predictor cutoff score
if the examinee scores at or above the predictor cutoff score he or she is selected, but if the examinee scores below the predictor cutoff score, he or she is rejected
62
True Positives (or Valid Acceptances)
scored above the cutoff point on the predictor and turn out to be successful on the criterion; predictor said they would be successful on the job and it was right
63
False Positives (or False Acceptances)
scored above the cutoff point on the predictor but did not turn out to be successful on the criterion; the predictor wrongly indicated that they would be successful on the job
64
True Negatives (or Valid Rejections)
scored below the cutoff point on the predictor and turned out to be unsuccessful on the criterion; predictor correctly indicated that they would be unsuccessful on the job
65
False Negatives (or Invalid Rejections)
scored below the cutoff point on the predictor but turned out to be successful on the criterion; predictor incorrectly indicated that they would be unsuccessful on the job
66
"positive" and "negative" for predictor
“positive”: predictor says the person should be selected; “negative”: predictor says the person should not be selected
67
"true" and "false" for predictor
where the person actually stands on the criterion; “true”: predictor classified the person into the correct criterion group; “false”: predictor made an incorrect classification
68
predictor’s functional utility
determine the increase in the proportion of correct hiring decisions that would result from using the predictor as a selection tool, relative to when it is not used
69
Factors Affecting the Validity Coefficient
1) Heterogeneity of Examinees: lowered if there is a restricted range of scores - either on predictor or criterion - more homogeneous the validation group, lower the validity coefficient 2) Reliability of Predictor and Criterion: for a predictor to be valid, both the predictor and the criterion must be reliable - an unreliable test will always be invalid, but a reliable test will not always be valid 3) Moderator Variables: the criterion-related validity of a test may vary among subgroups within a population by moderator variables 4) Cross-Validation: after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample
70
moderator variables
variables that influence the relationship between two other variables
71
differential validity
test is more valid for one subgroup but not another
72
cross-validation
after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample
73
shrinkage
reduction that occurs in a criterion-related validity coefficient upon cross-validation; occurs b/c the predictor is “tailor-made” for the original validation sample and doesn't fully generalize to other samples
74
when is shrinkage greatest
the original validation sample is small; the original item pool is large; the number of items retained is small relative to the number of items in the item pool; items are not chosen based on a previously formulated hypothesis or experience with the criterion
75
criterion contamination
in the process of validating a test, the predictor scores themselves influence any individual’s criterion status; artificially inflates the validity coefficient - it makes the predictor look more valid than it actually is
76
construct
a psychological variable that is abstract
77
construct validity
measures a theoretical construct or trait
78
convergent validity
requires that different ways of measuring the same trait yield similar results (WISC, WJ); tests that measure the same trait have a high correlation, even when they use different methods
79
discriminant (divergent) validity
low correlation with another test that measures a different construct; two tests that measure different traits have a low correlation, even when they use the same method
80
multitrait-multimethod matrix
assessment of two or more traits by two or more methods (self-report inventory, peer ratings, projective test)
81
monotrait-monomethod coefficients
indicate the correlation between the measure and itself and are therefore reliability coefficients
82
monotrait-heteromethod coefficients
correlations between two measures that assess the same (mono) trait using different (hetero) methods; if a test has convergent validity, this correlation should be high
83
heterotrait-monomethod coefficients
correlations between two measures that measure different (hetero) traits using the same (mono) method; if a test has discriminant validity, this coefficient should be low
84
heterotrait-heteromethod coefficients
correlations between two measures that measure different (hetero) traits using different (hetero) methods; if a test has discriminant validity, this correlation should be low
85
factor analysis
reducing a set of many variables (e.g., tests) to fewer variables to assess construct validity of a test; detect structure in several variables; can allow you to start with a large number of variables and classify them into sets
86
underlying constructs
tests in the analysis are not directly intended to measure these constructs (AKA latent variables)
87
factor loading
the correlation between a given test and a given factor; range from +1 to -1; can be squared to determine the proportion of variability in the test accounted for by the factor
88
communality (h2)
determine the proportion of variance of a test that is attributable to the factors; part of true variability shared with other tests
89
common variance
factors are also accounting for variance in the other tests included in the analysis
90
unique variance (u2)
variance specific to the test and not explained by the factors; part of true variability unique to the test itself
91
explained variance, or eigenvalues
measure of the amount of variance in all the tests accounted for by the factor
92
things you should know about eigenvalues
1) factors will be ordered in terms of the size of their eigenvalue - Factor I larger than Factor II, which is larger than Factor III, etc. Factor I will explain more of “what’s going on” in the tests than Factor II; 2) sum of the eigenvalues can be no larger than the number of tests included in the analysis
93
rotation
procedure that facilitates interpretation of a factor matrix; re-dividing the test’s communalities so that a clearer pattern of loadings emerges
94
orthogonal
factors that are independent of each other (uncorrelated)
95
oblique
factors that are correlated with each other to some degree
96
factorial validity
when a test correlates highly with a factor it would be expected to correlate with
97
differences between principal components and factor analysis
1) terminology: “factor” in factor analysis is usually referred to as a principal component or an eigenvector in principal components analysis 2) in principal components analysis variance has 2 elements: explained variance and error variance; in factor analysis, the variance has 3 elements: communality, specificity, and error 3) in principal components analysis, the factors (or components, or eigenvectors) are always uncorrelated
98
cluster analysis
place objects into categories; develop a taxonomy or classification system
99
differences between cluster analysis and factor analysis
1) only variables that are measured using interval or ratio data can be used in a factor analysis; variables measured using any type of data can be included in a cluster analysis 2) factors in factor analysis are usually interpreted as underlying traits or constructs measured by the variables in the analysis; clusters in cluster analysis are just categories, and not necessarily traits 3) cluster analysis used in studies where there is an a priori hypothesis regarding what categories the objects will cluster into; factor analysis used to test a hypothesis regarding what traits a set of variables measures
100
relationship between reliability and validity
a test is reliable if it measures “something,” and a test is valid if that “something” is what the test developer claims it is; for a test to be valid, it must be reliable; the validity coefficient is less than or, at the most, equal to the square root of the reliability coefficient - it can't be higher; reliability places an upper limit on validity
101
correction for attenuation
the formula answers the following question: “What would the validity coefficient of my predictor be if both the predictor and the criterion were perfectly reliable?”; what would happen to the validity coefficient if reliability (of both the predictor and the criterion) were higher
102
item analysis
used to determine which items will be retained for the final version of the test; can be qualitative (content of the test) and quantitative (measurement of item difficulty, item discrimination)
103
item difficulty index (“p”)
the percentage of examinees who answer it correctly; the higher the p value, the less difficult the item; ideal items have p = ~.50 ordinal scale only
104
item difficulty index for: gifted mastery true/false multiple choice
.25 .80 to .90 .75 .60
105
item discrimination
degree to which a test item differentiates among examinees in terms of the behavior that the test is designed to measure
106
item discrimination index ("D")
choose items that have high correlations with the criterion but low correlations with each other
107
item characteristic curves (ICCs)
graphs that depict each item in terms of how difficult the item was for individuals in different ability groups
108
item response theory assumptions about test items
1) performance on an item is related to the estimated amount of a latent trait being measured by the item; implies that the scores of individuals tested with different items can be directly compared to each other since all the items measure the same latent trait. 2) results of testing are sample free (“invariance of item parameters”) - an item should have the same parameters (difficulty and discrimination levels) across all random samples of a population so it can be used with any individual to provide an estimate of their ability
109
adaptive testing of ability
administering a set of items tailored to the examinee’s estimated level of ability
110
norm-referenced interpretation
comparing an examinee’s score to norms (scores of other examinees in a standardization sample); indicates where the examinee stands in relation to others who have taken the test
111
developmental norms
indicate how far along the normal developmental path an individual has progressed
112
mental age (MA) score
comparing an examinee’s score to the average performance of others at different age levels
113
grade equivalent scores
computing the average raw score obtained by children in each grade; for educational achievement tests
114
disadvantages of developmental norms
don't permit comparisons of individuals at different age levels; grade equivalent scores on different tests are not comparable
115
within-group norms
provide a comparison of the examinee’s score to those of the most nearly comparable standardization sample
116
percentile rank (PR)
the percentage of persons in the standardization sample who fall below a given raw score
117
pros and cons of percentile rank
pro: easy to understand and interpret; con: represent ranks (ordinal data) and therefore do not allow interpretations in terms of absolute amount of difference between scores
118
standard scores
express a raw score’s distance from the mean in terms of standard deviation units; tell us how many standard deviation units a person’s score is above or below the mean
119
pros of using standard scores
scores can be compared across different age groups; allow for interpretation in terms of the absolute amount of differences between scores
120
Z-scores
directly indicates how many standard deviation units a score falls above or below the mean
121
T-scores
mean of 50 and a SD of 10; a T-score of 60 has a score that falls 1 standard deviation above the mean
122
Stanine Scores
scores range from 1 to 9; mean of 5 and a SD of 2
123
Deviation IQ scores
mean of 100 and a standard deviation of 15
124
differential prediction
a case where given scores on a predictor test predict different outcomes for different subgroups
125
single-group validity
a test is valid for one subgroup but not another subgroup
126
sensitivity of a test
the proportion of correctly identified cases; the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic
127
triangulation
attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data)
128
calibration
attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used; raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (defining a "2" for an item dealing with job performance)