Test Construction Flashcards Preview

Research Design, Stats, and Test Construction > Test Construction > Flashcards

Flashcards in Test Construction Deck (46):


amount of consistency, repeatability, and dependability in scores obtained on a given test


classical test theory

any obtained score is a combination of truth and error total variability = true score variability + error variability reliability is the proportion of true score variability


reliability coefficient

rxx or rtt commonly derived by correlating score obtained on test at one point in time (x or t) with score obtained at second point in time (x or t)


common sources of error in tests (3)

content sampling, time sampling, test heterogeneity


content sampling error

when a test, by chance, has items that tape into test-taker's knowledge base or items that don't tap into a test-taker's knowledge


time sampling error

occurs when a test is given at two different points in time and the scores on each administration are different because of factors related to the passage of time (e.g. forgetting over time)


test heterogeneity error

error due to test heterogeneity occurs when a test has heterogeneous items tapping more than one domain


factors affecting reliability

number of items (reliability INCREASES when number of items increased) homogeneity of items - refers to items tapping into similar content items (reliability INCREASES with increased homogeneity) range of scores - unrestricted range maximizes reliability, related to heterogeneity of subjects (range of scores INCREASES with increased subject heterogeneity) ability to guess - true/false tests easier to guess (reliability DECREASES as ability to guess increases)


four estimates of reliability

test-retest reliability parallel forms reliability internal consistency reliability - split-half reliability, Kuder-Richardson (KR-20 & KR-21), Cronbach's Alpha interrater reliability


test-retest reliability

expressed as coefficient of stability involves correlating pairs of scores from the same sample of people who are administered the identical test at two points in time major source of error = time sampling (correlated decreases when time interval between administrations increases)


parallel forms reliability

expressed coefficient of equivalent correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time major source of error = time sampling and content sampling (subjects may be more or less familiar with items on one version of the test)


internal consistency reliability

looks at consistency of scores within the test test administered only once to one group of people split-half reliability Kuder-Richardson (KR-10 & KR-21) or Cronbach's coefficient alpha


split-half reliability

calculated by splitting the test in half and then correlating scores obtained on each half by each person Spearman-Brown formula typically used major source of error = item or content sampling (someone might, by chance, know more items on one half)


Kuder-Richardson (KR-20 & KR-21) & Cronbach's Coefficient Alpha

Sophisticated forms of internal consistency reliability involve analysis of correlation of each item with every other item on the test reliability calculated by taking mean of correlation coefficients for every possible split-half KR-20 & KR-21: when items are scored dichotomously (correct or incorrect) Cronbach's Coefficient Alpha: when items are scored non-dichotomously and there is a range of possible scores for each item or category (e.g. Likert Scale) Major sources of error: content sampling and test heterogeneity


interrater reliability

looks at degree of agreement between two or more scorers when test subjectively scored


standard error of measurement

theoretical distribution: one person's scores if he/she were tested hundreds of times with alternate or equivalent forms of the test standard deviation of a theoretically normal distribution of test scores obtained by one individual on equivalent tests ranges from 0.0 to SD of test when test perfectly reliable, standard error of measurement would be 0.0

95% probability that a person's true score lies within two standard errors of measurement of the obtained score


content validity

addresses how adequately a test samples a particular content area quantified by asking panel of experts if each item is essential, useful/not essential, or not necessary no numerical validity coefficient is derived


criterion-related validity

looks at how adequately a test score can be used to infer, predict, or estimate criterion outcome e.g. how well SAT scores predict college GPA coefficient (rxy) ranges from -1.0 to 1.0 validities as low as 0.20 considered acceptable two subtypes: concurrent validity and predictive validity


concurrent validity

predictor and criterion are measured and correlated at about the same time


predictive validity

delay between the measurement of the predictor and criterion


standard error of estimate

theoretical distribution: one person's criterion scores if he/she were measured hundreds of times on the criterion; spread of this distribution is the average amount of error in estimating standard deviation of a theoretically normal distribution of criterion scores obtained by one person measured repeatedly minimum value of 0.0 to maximum value of SD of the criterion (SDy) when test is perfect predictor, standard error of estimate is 0.0


expectancy tables

list the probability that a person's criterion score will fall in a specified range based on the range in which that person's predictor score fell probabilities expressed in terms of percentages or proportions


Taylor-Russell tables

show how much more accurate selection decisions are when using a particular predictor test as opposed to using no predictor test base rate, selection ratio, incremental validity

incremental validity optimized when base rate is moderate (about .5) and selection ratio is low (close to .1)


base rate

rate of selecting successful employees without using a predictor test


selection ratio

proportion of available opens to available applicants


incremental validity

amount of improvement in success rate that results from using a predictor test optimized when base rate is moderate and selection ratio is low


decision making theory

takes the predictors of performance that were based on the predictor tests and compares them with the actual criterion outcome

                                            Predictor Cut Off

                      Negatives                 I         Positives

Criterion     False Negatives       I        True Positives              Criterion Cut Off

                      True Negatives        I        False Positives



item response theory

used to calculate to what extent a specific item on a test correlates with an underlying construct


factors affecting criterion-related validity

range of scores (validity maximized by unrestricted range of scores on both the predictor and criterion) reliability of the predictor (validity predictor scores before assigning them to criterion ratings


correction for attenuation

calculates how much higher validity would be if the predictor and criterion were both perfectly reliable


construct validity

looks at how adequately a new test measures a construct or trait construct is a hypothetical concept that typically cannot be measured (e.g. motivation, fear, aggression) evidence of construct validity most commonly ascertained using factor analysis, or multi-trait, multi-method matrix


multi-trait, multi-method matrix

table with information about convergent and divergent validity (both necessary for construct validity)


convergent validity

correlation of scores on the new test with other available measures of same trait


divergent (discriminant) validity

correlation of scores on the new test with scores on another test that measures a different trait or construct


Degrees of Freedom for T-Test

Single Sample

Matched/Correlated Sample

Independent Samples

Single Sample: N-1

Matched/Correlated Sample: #pairs-1

INdependent Samples: N-2


Degrees of Freedom for Chi Square

Single Sample

Matched Sample

Single Sample: #rows -1

Multiple Sample: (#rows-1)(#columns-1)


Degrees of Freedom ONe-Way Anova

df total

df between

df within

df total: N-1

df between: # groups - 1

df within: df total - df between


How to calculate shared/explained variability

square the correlation


How to calculate correlation

square root shared/explained variability


True score variability

Reliability coefficient interpreted directly (e.g. if reliability is .64, true score variability is 64%


Pearson r range

-1.0 - +1.0


Range of reliability coefficient

0.0 to +1.0


Range of validity coefficient

-1.0 to +1.0


Range of standard error of measurement

0.0 to SDx


range of the standard error of the estimate

0.0 to SDj


how to calculate 95% confidence interval

multiple the standard error of measurement by 1.96 (or 2) and add and substract the result from the examinee's score