Test Construction Flashcards by Patrice r

reliability

amount of consistency, repeatability, and dependability in scores obtained on a given test

How well did you know this?

Not at all

Perfectly

classical test theory

any obtained score is a combination of truth and error total variability = true score variability + error variability reliability is the proportion of true score variability

How well did you know this?

Not at all

Perfectly

reliability coefficient

rxx or rtt commonly derived by correlating score obtained on test at one point in time (x or t) with score obtained at second point in time (x or t)

How well did you know this?

Not at all

Perfectly

common sources of error in tests (3)

content sampling, time sampling, test heterogeneity

How well did you know this?

Not at all

Perfectly

content sampling error

when a test, by chance, has items that tape into test-taker’s knowledge base or items that don’t tap into a test-taker’s knowledge

How well did you know this?

Not at all

Perfectly

time sampling error

occurs when a test is given at two different points in time and the scores on each administration are different because of factors related to the passage of time (e.g. forgetting over time)

How well did you know this?

Not at all

Perfectly

test heterogeneity error

error due to test heterogeneity occurs when a test has heterogeneous items tapping more than one domain

How well did you know this?

Not at all

Perfectly

factors affecting reliability

number of items (reliability INCREASES when number of items increased) homogeneity of items - refers to items tapping into similar content items (reliability INCREASES with increased homogeneity) range of scores - unrestricted range maximizes reliability, related to heterogeneity of subjects (range of scores INCREASES with increased subject heterogeneity) ability to guess - true/false tests easier to guess (reliability DECREASES as ability to guess increases)

How well did you know this?

Not at all

Perfectly

four estimates of reliability

test-retest reliability parallel forms reliability internal consistency reliability - split-half reliability, Kuder-Richardson (KR-20 & KR-21), Cronbach’s Alpha interrater reliability

How well did you know this?

Not at all

Perfectly

test-retest reliability

expressed as coefficient of stability involves correlating pairs of scores from the same sample of people who are administered the identical test at two points in time major source of error = time sampling (correlated decreases when time interval between administrations increases)

How well did you know this?

Not at all

Perfectly

parallel forms reliability

expressed coefficient of equivalent correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time major source of error = time sampling and content sampling (subjects may be more or less familiar with items on one version of the test)

How well did you know this?

Not at all

Perfectly

internal consistency reliability

looks at consistency of scores within the test test administered only once to one group of people split-half reliability Kuder-Richardson (KR-10 & KR-21) or Cronbach’s coefficient alpha

How well did you know this?

Not at all

Perfectly

split-half reliability

calculated by splitting the test in half and then correlating scores obtained on each half by each person Spearman-Brown formula typically used major source of error = item or content sampling (someone might, by chance, know more items on one half)

How well did you know this?

Not at all

Perfectly

Kuder-Richardson (KR-20 & KR-21) & Cronbach’s Coefficient Alpha

Sophisticated forms of internal consistency reliability involve analysis of correlation of each item with every other item on the test reliability calculated by taking mean of correlation coefficients for every possible split-half KR-20 & KR-21: when items are scored dichotomously (correct or incorrect) Cronbach’s Coefficient Alpha: when items are scored non-dichotomously and there is a range of possible scores for each item or category (e.g. Likert Scale) Major sources of error: content sampling and test heterogeneity

How well did you know this?

Not at all

Perfectly

interrater reliability

looks at degree of agreement between two or more scorers when test subjectively scored

How well did you know this?

Not at all

Perfectly

standard error of measurement

theoretical distribution: one person’s scores if he/she were tested hundreds of times with alternate or equivalent forms of the test standard deviation of a theoretically normal distribution of test scores obtained by one individual on equivalent tests ranges from 0.0 to SD of test when test perfectly reliable, standard error of measurement would be 0.0

95% probability that a person’s true score lies within two standard errors of measurement of the obtained score

How well did you know this?

Not at all

Perfectly

content validity

addresses how adequately a test samples a particular content area quantified by asking panel of experts if each item is essential, useful/not essential, or not necessary no numerical validity coefficient is derived

How well did you know this?

Not at all

Perfectly

criterion-related validity

looks at how adequately a test score can be used to infer, predict, or estimate criterion outcome e.g. how well SAT scores predict college GPA coefficient (rxy) ranges from -1.0 to 1.0 validities as low as 0.20 considered acceptable two subtypes: concurrent validity and predictive validity

How well did you know this?

Not at all

Perfectly

concurrent validity

predictor and criterion are measured and correlated at about the same time

predictive validity

delay between the measurement of the predictor and criterion

standard error of estimate

theoretical distribution: one person’s criterion scores if he/she were measured hundreds of times on the criterion; spread of this distribution is the average amount of error in estimating standard deviation of a theoretically normal distribution of criterion scores obtained by one person measured repeatedly minimum value of 0.0 to maximum value of SD of the criterion (SDy) when test is perfect predictor, standard error of estimate is 0.0

expectancy tables

list the probability that a person’s criterion score will fall in a specified range based on the range in which that person’s predictor score fell probabilities expressed in terms of percentages or proportions

Taylor-Russell tables

show how much more accurate selection decisions are when using a particular predictor test as opposed to using no predictor test base rate, selection ratio, incremental validity

incremental validity optimized when base rate is moderate (about .5) and selection ratio is low (close to .1)

base rate

rate of selecting successful employees without using a predictor test

selection ratio

proportion of available opens to available applicants

incremental validity

amount of improvement in success rate that results from using a predictor test optimized when base rate is moderate and selection ratio is low

decision making theory

takes the predictors of performance that were based on the predictor tests and compares them with the actual criterion outcome Predictor Cut Off Negatives I Positives Criterion _False Negatives I True Positives_ Criterion Cut Off True Negatives I False Positives Predictor

item response theory

used to calculate to what extent a specific item on a test correlates with an underlying construct

factors affecting criterion-related validity

range of scores (validity maximized by unrestricted range of scores on both the predictor and criterion) reliability of the predictor (validity predictor scores before assigning them to criterion ratings

correction for attenuation

calculates how much higher validity would be if the predictor and criterion were both perfectly reliable

construct validity

looks at how adequately a new test measures a construct or trait construct is a hypothetical concept that typically cannot be measured (e.g. motivation, fear, aggression) evidence of construct validity most commonly ascertained using factor analysis, or multi-trait, multi-method matrix

multi-trait, multi-method matrix

table with information about convergent and divergent validity (both necessary for construct validity)

convergent validity

correlation of scores on the new test with other available measures of same trait

divergent (discriminant) validity

correlation of scores on the new test with scores on another test that measures a different trait or construct

Degrees of Freedom for T-Test Single Sample Matched/Correlated Sample Independent Samples

Single Sample: N-1 Matched/Correlated Sample: #pairs-1 INdependent Samples: N-2

Degrees of Freedom for Chi Square Single Sample Matched Sample

Single Sample: #rows -1 Multiple Sample: (#rows-1)(#columns-1)

Degrees of Freedom ONe-Way Anova df total df between df within

df total: N-1 df between: # groups - 1 df within: df total - df between

How to calculate shared/explained variability

square the correlation

How to calculate correlation

square root shared/explained variability

True score variability

Reliability coefficient interpreted directly (e.g. if reliability is .64, true score variability is 64%

Pearson r range

-1.0 - +1.0

Range of reliability coefficient

0.0 to +1.0

Range of validity coefficient

-1.0 to +1.0

Range of standard error of measurement

0.0 to SDx

range of the standard error of the estimate

0.0 to SDj

how to calculate 95% confidence interval

multiple the standard error of measurement by 1.96 (or 2) and add and substract the result from the examinee's score