Test Construction Flashcards

Question 1

Q

Test

Answer

A

A “test” is a systematic method for measuring a sample of behavior. Although the exact procedures used to develop a test depend on its purpose and format, test construction ordinarily involves specifying the test’s purpose, generating test items, administering the items to a sample of examinees for the purpose of item analysis, evaluating the test’s reliability and validity, and establishing norms.

Question 2

Q

Relevance

Answer

A

Refers to the extent to which test items contribute to achieving the stated goals of testing.

Determination is based on qualitative judgement based on 3 factors:

Content Appropriateness (Does item assess content it’s designed to evaluate?)
Taxonomic level (Does item reflect approp. cog./ability level?)
Extraneous Abilities (What extent does the item req. knowledge, skills or abilities outside the domain of interest?)

Question 3

Q

Item Difficulty

Answer

A

(Ordinal Scale) An item’s difficulty level is calculated by dividing the # of individuals who answered the item correctly by the total number of individuals.

p=Total # of examinees passing the item/Total # of examinees

p value ranges from 0 (nobody answered item correctly; very difﬁcult) to 1.0 (Item answered correctly by all; very easy).

An item difficulty index of p=.50 is optimal because it maximizes differentiation between individuals w/high & low ability & helps ensure a high reliability coefficient.

Ex: Devel. of EPPP would be interested in assessing Item difficulty level to make sure the exam does not contain too many items that are either too easy or too difficult.

1 exception: true/false tests bc probability of answering the question correctly by chance is .50; optimal difficulty level is p=.75 (Item Difficulty Index = p)

A multiple choice item w/4 options the probability of answering the item correctly by guessing is 25%; so the optimum p value is halfway btwn 1 & .25, which is 0.625.

Question 4

Q

Item Difficulty Index (p)

Answer

A

(Ordinal Scale) For most tests a test developer wants items w/p values close to .50

The goal of testing is to choose a certain # of top performers, the optimal p value corresponds to the proportion of examinees to be choosen.

The optimal value is affected by the likelihood that examinees can select the correct answer by guessing, w/the preferred difficulty level being halfway btwn 100% of examinees answering the item correctly & the probability of answering correctly by guessing.

The optimum p value also depends on the test’s ceiling & floor:

Adequate Ceiling: Occurs when the test can distinguish btwn examinees w/high levels of the attribute being measured.
- Ceiling is maxed by including a large proportion of items w/a low p value; difficult items.
Adequate Floor: Occurs when the test can distinguish btwn examinees w/low levels of the attribute being measured.
- Floor is maxed by including a large proportion of items w/a high p value; easy items.

Question 5

Q

Item Discrimination

Answer

A

Refers to the extent to which a test item discriminates (differentiates) btwn examinees who obtain high vs. low scores on the entire test or on an external criterion.

To calculate, need to ID the examinee in sample that got the highest and lowest scores on the test & for each item, subtract the % of examinee in the lower-scoring group (L) from the % in the upper-scoring group (U) who answered the item correctly:

D = U - L

The item discrimination index (D) ranges from -1.0 to + 1.0.

D = +1.0, all examinee in the upper group & none in the lower group answered the item correctly
D = 0, the same percent of examinees in both grps answered the item correctly.
D = -1.0, none of the examinee in the upper group & all examinees in the lower group answered the item correctly

For most test a D=.35 or higher is considered acceptable; yet items with D=.50 have greatest potential for max. discrimination.

Question 6

Q

Item Response Theory (IRT)

Answer

A

Advantages of IRT are that item parameters are sample invariant (same across different samples) & performance on different sets of items or tests can be easily equated.

Use of IRT involves deriving an item characteristic curve for each item. (IRT = ICC)

Question 7

Q

Item Characteristic Curve (ICC)

Answer

A

When using IRT, an ICC is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either:

The total test score,
Performance on an external criterion, or
A mathematically-derived estimate of a latent ability or trait.

The curve provides info. on the relationship btwn an examinees level on the ability or trait measured by the test & the probability that he/she will respond to the item correctly.

The difficulty level of an item is indicated by the ability level (Ex: -3 to +3) where 50% of examinees in sample obtained a correct response. The diff. level for this item is 0 (ave. ability level).

The items’ ability to discriminate btwn high & low achievers is indicated by the slop of the curve; steeper slope, greater discrimination.

Probability of guessing correctly is indicated by the point where the curve intercepts the vertical axis.

Question 8

Q

Classical Test Theory

Answer

A

Theory of measurement that regards observed variability in test scores as reﬂecting 2 components:

True Score Variability (True differences btwn examinees on the attribute(s) measured by the test) &
Variability due to measurment (random) error (The effects of differences due to measurement (random) error).

Reliability is a measure of true score variability.

To calculate an examinee’s obtained test score (X) is compsed of 2 components:

(T) a true score component &
(E) an error component:

X = T + E

(Ex: X = score obtained on licensing exam is likely to be due to both; (T) The knowledge obtained about test items & (E) effects of random factors such as anxiety, way items were written, attention, etc.)

Question 9

Q

Reliability

Answer

A

Refers to the consistency of test scores; i.e., the extent to which a test measures an attribute w/out being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms.

A test is reliable:

To the degree that it is free from error & provides info. about examinees “true” test scores &
The degree it provides repeatable, consistent results.

Problems: Item & test parameters are sample-dependent.

Reliability is est. by evaluating consistency in scores over time or across different forms of the test, different test items, or different raters. This method is based on the assumption that:

True score variability is consistent
Variability due to measurement error is inconsistent.

Methods for establishing reliability include:

Test-retest,
Alternative forms,
Split-half,
Coefficient alpha, and
Inter-rater.

Question 10

Q

Reliability Coefficient

Answer

A

Most produce a Reliability Coefficient (a correlation coefficent that is calc. by correlating at least 2 scores obtained from each examinee in a sample), which is a correlation coefficient that ranges in value from:

0.0 (due to measurement error; no reliability) to +1.0 (true score variability; perfect relability).
The coefficient is interpreted directly as a measure of true score variability*

(Ex: a reliability of .80 indicates that 80% of variability in test scores is true score variability; remaining 20% [1-.80] is due to measurement error). Interpreted directly as the proportion of variability in a set of test scores that is attributable to true score variability.

(Reliability coefficient = rₓₓ)

Question 11

Q

Test-Retest Reliability

Answer

A

A method for assessing reliability that involves administering the same test to the same group of examinees on 2 different occasions & correlating the 2 sets of scores (test-retest reliability).

Yields a Coefficient of Stability: Provides a measure of test score stability over time.

Approp. for tests designed to measure attributes/characterisitics that are stable over time & not affected by repeated measurements or are affected in a random way (ex: aptitude).

Question 12

Q

Alternate Forms Reliability

Answer

A

Provides a measure of test score consistency over 2 forms of the test (aka parallel forms/equivalent forms).

Method for est. a test’s reliability that entails administering 2 equivalent forms of the test to the same group of examinees & correlating the 2 sets of scores.

Forms can be administered at about the same time (coefficient of equivalence) or at different times (coefficient of equivalence & stability).

Considered by some experts to be the best (most thorough) method for assessing reliability.

Best for determining the reliability of tests:

designed to measure attributes that are stable over time &
not affected by repeated measurements, characteristics that fluctuate over time,
when exposure to 1 form is likely to affect perf. on the other in an unsystematic way.

Question 13

Q

Internal Consistency Reliability

Answer

A

Degree to which items included in the test are measuring the same characteristic & indicates the degree of consistency across different test items.

Approp. for tests that measure a single content or behavior domain (Ex: subtest but not entire test)
Useful for est. the reliability of tests that measure characterisitics that fluctuate over time or are susceptible to memory or practice effects.

Including Split-Half Reliability & Cronbach’s coefficient/KR-20

Question 14

Q

Split-Half Reliability

Answer

A

A method for assessing internal consistency reliability & involves:

“splitting” the test in half (e.g., odd- versus even-numbered items) & correlating examinees scores on the 2 halves of the test.

Since the size of a reliability coefficient is affected by test length, the split-half method tends to underestimate a test’s true reliability.

The split-half reliability coefficient is usually corrected with the Spearman-Brown formula.

The Spearman-Brown formula can also be used more generally to est. the effects of shortening or lengthening a test on its reliability coefficent.

(Split-half Reliability = Spearman-Brown Formula)

Not approp. for speeded tests in which score depends on speed of responding.

Shorter tests are less reliable than longer ones.

Question 15

Q

Spearman-Brown Formula

Answer

A

Spearman-Brown formula, which estimates what the test’s reliability would be if it were based on the full length of the test to obtain an estimate of what the test’s true reliability is.

(Split-half Reliability = Spearman-Brown Formula)

Question 16

Q

Coefficient Alpha

Answer

A

Method for assessing internal consistency reliability that provides an index of average inter-item consistency rather than the consistency between 2 halves of the test.

KR-20 can be used as a substitute for coefficient alpha when test items are scored dichotomously.

(KR-20 = Coefficient Alpha)

Question 17

Q

Kuder-Richardson Formula 20 (KR-20)

Answer

A

Kuder-Richardson Formula 20 (KR-20) can be used as a substitute for coefficient alpha when test items are scored dichotomously (scored as right or wrong; Ex: T/F & multiple choice questions).

(KR-20 = Coefficient Alpha)

Question 18

Q

Inter-Rater Reliability

Answer

A

Important for tests that are subjectively scored, such as essay & projective tests, based on judgement.

To be sure an examinee will obtain the same score no matter who is doing the scoring.

The scores assigned by different raters can be used to calculate a correlation (reliability) coefficient or to determine the percent agreement between raters, the resulting index of reliability can be artificially inflated by the effects of chance agreement.

Alt. a special correlation coefficent can be used such as:

Cohen’s Kappa Statistic: Used to measure agreement btwn 2 raters when scores represent a nominal scale.
Kendall’s Coefficent of Concordance: Used to measure agreement btwn 3 or more raters when scores are reported as ranks.

Reliability coefficents over .80 are generally considered acceptable.

(Kappa Statistic = Inter-rater reliability)

Question 19

Q

Cohen’s Kappa Statistic & Kendall’s Coefficient of Concordance

Answer

A

Cohen’s Kappa Statistic: A correlation coefficient used to assess agreement btwn 2 raters (inter-rater reliability) when scores represent a nominal scale.

(Kappa Statistic = Inter-rater reliability)

Kendall’s Coefficent of Concordance: Used to measure agreement btwn 3 or more raters when scores are reported as ranks.

Question 20

Q

Factors that Affect the size of the Reliability Coefficient

Answer

A

4 Factors can increase a tests reliability:

Test Length
Range of Scores/Heterogenity
Test Content
Guessing

Question 21

Q

Test Length

Answer

A

The larger the sample of the attributes being measured by a test, the less relative effects of measurement error & more likely will provide dependable, consistent information.

Longer tests are generally more reliable than shorter tests.

Ways to increase the test length is by adding items of similar content and quality.

Can use Spearman-Brown to estimate reliability, yet it tend to overestimate a test’s true reliability.

Question 22

Q

Range of Scores

Answer

A

A test’s reliability can be increased with the heterogeneity of the sample in terms of the attributes measured by the test, which will increase the range of scores. (p=.50).

Reliability increases w/the degree of similarity of examinees in terms of the attributes measured by the test, which will increase the range of scores.

Maximized when the range of scores is unrestricted.

Question 23

Q

Test Content

Answer

A

The ore homogeneous a test is w/regard to content, the higher it’s reliability coefficient.

Easiest to understand if you consider internal consistency; the more consistent the test items are in terms of content the larger the coefficient alpha or split-half reliability.

Question 24

Q

Guessing

Answer

A

A test’s reliability coefficient is also affected by the probability that examinees can guess the correct answer to test items.

As the probability of correctly guessing increases the reliability coefficient decrease.

More difficult it is to pick the right answer by guessing the larger the reliability coefficint.

Question 25

Q

Confidence Interval

Answer

A

Helps a test user to estimate the range w/in which an examinee’s true score is likely to fall given their obtained score.

Bc tests are not totally reliable, an examinee’s obtained score may or may not be his/her true score.

Always best to interpret an examinee’s obtained score it to construct a confidence interval around that score.

A confidence interval indicates the range w/in which an examinee’s true score is likely to fall given the obtained score.
It is derived using the Standard Error of Measurement (SEM):
- 68% = +/- 1 SEM from obtained score
- 95% = +/- 2 SEM from obtained score
- 99% = +/- 3 SEM from obtained score

Standard Error of Measurement (SEM) = Used to construct a confidence interval around a measured or obtained score.

Question 26

Q

Standard Error of Measurement

Answer

A

It is used to construct a confidence interval around an examinee’s obtained (measured) score.

Range is calculated by multiplying the standard deviation of the test scores by the square root of 1 - the reliability coefficient.

This is an index of the amount of error that can be expected in obtained scores due to the unreliability of the test.

SEmes=SDx√(1-rₓₓ)

Ex: A psychologist administers an interpersonal assertiveness test to a sales applicant who receives a score of 80. Since test‘s reliability is less than 1.0, the psych. knows that this score might be an imprecise est. of the applicant’s true score & decides to use the standard error of measurement to construct a 95% confidence interval. Assuming that the test‘s reliability coefficient is .84 and its standard deviation is 10, the standard error of measurement is equal to 4.0

The psych. constructs a 95% conﬁdence interval by adding and subtracting 2 standard errors from the applicant’s obtained score: 80 ±2(4.0) = 72 to 88. There is a 95% chance that the applicant’s true score falls between 72 and 88.

SEM = SD_{x√(1-rₓₓ)}

_=10√1-.84

=10(.4) = 4.0

Question 27

Q

Validity

Answer

A

Refer’s to a test’s accuracy in terms of the extent to which the test measures what it is intended to measure.

3 Different Types of Validity include:

Content Validity (content or behavior)
Construct Validity (hypothetical trait or construct)
Criterion-Related Validity (status/perf. on external criterion)

Question 28

Q

Content Validity

Answer

A

Important for tests designed to measure a specific content or behavior domain & items are an accurate & respresentative sample of content domains they represent.

Content validity is not the same as face validity.

Most important for achievement & job sample tests.

Determined primarily by “expert judgment”

This is of concern when a test has been designed to measure 1 + content/behavior domains.

A test has content validity when its items are a representative sample of the domain(s) that the test is intended to measure.

Usually built into a test bc it is being constructed & involves clearly defining:

The content/behavior domain
Divided into categories & sub-categories
Then write or select items that represent each sub-category thru the selection of a representative sample of items.

After test devel., content validity is checked by having subject matter experts who determine if test items are an adequate & representative sample of the content or behavior domain & then evaluate the test in a systematic way.

Scores on the test (X) are important bc they provide info. on how much each examinee knows about a content domain w/regard to the traits being measured, then content or construct validity are of interest.

Question 29

Q

Construct Validity

Answer

A

(Broadest category of validity) This is important when a test will be used to measure a hypothetical trait or construct & has been shown that the test actually measures the hypothetical trait its intended to measure.

A method of assessing construct validity is to correlate test scores w/scores on other measures that do & do not measure the same trait. To determine if the test has both:

Convergent Validity: The correlation btwn the test’s we’re validating & the measure of the same trait using a differemt method provides info. about the test’s convergent validity.(Monotrait-Heteromethod; large)
Discriminate (divergent) Validity: The correlation btwn the test’s we’re validating & the measures of unrelated traits provide info about the test’s divergent validity. (Heterotrait-Monomethod; small)
Multitrait-Multimethod Matrix is used to eval. construct validity. A table of correlation coefficients that provide info. about the test’s convergent & divergent (discriminant) validity.

Convergent & divergent combo provide evidence that the test is actually measuring the construct it was designed to measure.

Other methods include:

Constructing a Factor Analysis to assess the test’s factorial validity. Provides info. about convergent & divergent validity but is more complex techniue.
Determining if changes in test scores reflect expected developmental changes; &
Seeing if experimental manipulations have the expected impact on test scores.

There tests include achievement, motivation, intelligence or mechanical aptitude.

When scores on the test (X) are important bc they provide info. on how much each examinee knows about a content domain or on each examinee’s status w/regard to the traits being measured, then content or construct validity are of interest.

Question 30

Q

Convergent & Discriminate Validity

Answer

A

Convergent Validity: The correlation btwn the test’s we’re validating & the measure of the same trait using a different method provides info. about the test’s convergent validity.
- When a test has high correlations w/measures that assess the same construct.
- (Monotrait-Heteromethod)
- High correlations=convergent
Discriminate (divergent) Validity: The correlation btwn the test’s we’re validating & the measures of unrelated traits provide info about the test’s divergent validity.
- When a test has low correlations w/measures of unrelated characteristics.
- (Heterotrait-Monomethod; small)
- Low correlations=discriminate (divergent)

Convergent & divergent combo provide evidence that the test is actually measuring the construct it was designed to measure.

Multitrait-Multimethod Matrix is used to eval. w/a table of correlation coefficients that provide info. about the test’s convergent & divergent (discriminant) validity.

Question 31

Q

Multitrait-Multimethod Matrix

Answer

A

A systematic way to org. the data collected when assessing a test’s convergent & discriminate validity.

The matrix is a table of correlation coefficients that provide info. about the test’s convergent & divergent (discriminant) validity.

Requires measuring at least 2 different traits using at least 2 different methods for each trait. Terms have been linked to 4 correlation coefficients:

Monotrait-Monomethod Coefficients: Measures the same-trait-same method.
Monotrait-Heteromethod Coefficients: Measures same trait-different methods. (Convergent)
Heterotrait-Monomethod Coefficients: Measures different traits-same method. (Divergent)
Heterotrait-Heteromethod Coefficients: Measures different traits-different methods.

Question 32

Q

Monotrait-Monomethod Coefficients (same-trait-same method)

Answer

A

Reliability coefficients: Indicate the correlation between the measured & itself.

Coefficients are not directly relevant to a test’s convergent & discriminate validity, they should be larger for the matrix to provide useful info.

Question 33

Q

Monotrait-Heteromethod Coefficients (same trait-different methods)

Answer

A

These coefficients (coefficients in rectangles) indicate the correlation between different measures of the same trait.

It indicates that a test has convergent validity when the monotrait-heteromethod coefficients are large.

Question 34

Q

Heterotrait-Monomethod Coefficients (different traits-same method)

Answer

A

These coefficients (coefficients of ellipses) show the correlation between different traits that have been measured by the same method.

It indicates discriminate validity when the heterotrait-monomethod coefficients are small.

Question 35

Q

Heterotrait-Heteromethod Coefficients (different traits-different methods)

Answer

A

These coefficients (underlined coefficients) indicate the correlation between different traits that have been measured by different methods.

It indicates discriminate validity when the heterotrait-heteromethod coefficients are small.

Question 36

Q

Factor Analysis

Answer

A

A multivariate statistical technique used to ID how many factors (constructs/dimensions) that underlie the intercorrelations among a set of tests, subtests, or test items.

One use of obtained data is to determine if a test has construct validity by indicating the extent to which the test correlates with factors that it would & would not be expected to correlate with.

A test shows construct validity when it has high correlations w/factors expected to correlate with & low correlations w/factors not expected to correlate with.

True score variability consists of:

Communality & Specificity.

Factors ID in a factor analysis can be either:

Orthogonal
Oblique.

Question 37

Q

5 Basic Steps for a Factor Analysis:

Answer

A

Includes grouping a large number of test items into subtests & subscales to test hypothesis about test item scales & subscales related to one another& test construct validity.

Administer Tests to a Sample of Examinees: Admin. several tests to be validated to a group of examinees to measure the same & diff. traits.
Derive & Interpret the correlation Matrix: Correlate scores on each test w/scores on every other test to obtain a correlation (R) matrix which indicates the correlations of all the pairs included in the analysis. ID clusters of tests that are highly correlated w/1 another & number of clusters determines how many factors should be extracted in the factor analysis.
Extract the inital Factor Matrix: Using 1 of several available factor analytic techniques, convert the correlation matrix to a factor matrix; difficult to interpret. (Data in correlation matrix is used to derive a factor matrix that contains correlation coefficients, that indicate the degree of association btwn each test and factor).
Rotate the Factor Matrix: To obtain the final prouduct & simplify the interpretation of the factors by “rotating” them.
Name the Factors: Interpret and name the factors in the rotated factor matrix, when used to assess construct validity it has high correlations w/factors expected to correlate with & low correlations w/factors not expected to correlate with.

Question 38

Q

Factor Loadings

Answer

A

In factor analysis:

Factor Loading is the correlation btwn a test (or other variable included in the analysis) & a factor.

To interpret a factor loading is to square it to determine the amount of variability in the test scores that is explained by the factor.

A squared factor loading provides a measure of “shared variability.”

Ex: If a test has a correlation of .50² with Factor I, this means that 25% of variability in test scores is explained by Factor I.

Question 39

Q

Communality

Answer

A

The total amount of variability in test scores on the test (or other variable) that is due to the factors that the test shares in common w/other tests in the analysis (identified factors).

Communality is a lower-limit estimate of a test’s reliability coefficient.

Question 40

Q

Specificity

Answer

A

2nd component in a Factor Analysis which is variability that is due to factors that are specific or unique to the test & that are not measured by any other test included in the analysis.

The specificity ID the potion of true score variability that has not been explained by factor analysis.

Question 41

Q

Rotation (Matrix)

Answer

A

Redividing the communality of each test included in the analysis. A researcher decides which is approp. based on his/her theory about the charaterisitics measured by the tests included in the analysis.

Due to the re-division each factor accounts for a different proportion of a test’s variability than prior to the rotation.

This process make sit easier to interpret factor loading’s.

There are 2 types of rotations:

Orthogonal (Uncorrelated)
Oblique (Correlated)

Question 42

Q

Orthogonal

Answer

A

When rotation of ID factors are orthogonal unrelated attributes (uncorrelated), a test’s communality can be calculated by summing the squared factor loading’s.

Can only calc. communality when factors are orthogonal.

Ex: If a test has a correlation of .50 w/Factor 1 & a correlation of .20 w/Factor 2 & the factors are uncorrelated, the test’s communality is

.29 (F1 =.50^2 = .25; F2= .20^2=.4; 25+4=.29).

This means that 29% of the variability in test scores is explained by the ID factors, while the remaining variability is due to some combo of specificity & measurement error.

Question 43

Q

Oblique

Answer

A

When the rotation is oblique the ID factors are correlated, & the attributes measured by the factors are not independent.

Question 44

Q

Criterion-Related Validity

Answer

A

The type of validity that involves determining the relationship (correlation) btwn the predictor & the criterion. Important when test scores will be used to predict/est. scores on a criterion.

The correlation coefficient is referred to as the criterion-related validity coefficient.

This type of validity can be either:

Concurrent: Involves obtaining scores on the predictor & criterion at about the same time; current.
Predictive: Involves obtaining predictor scores before criterin scores; future.

This is of interest when a test has been designed to estimate or predict an examinee’s standing or performance on an another measure (external criterion).

When the test (X) score will be used to predict scores on some other measure (Y) & it is the scores on Y that are of the most interest, then this type of validity is of interest.

Question 45

Q

Criterion-Related Validity Coefficient

Answer

A

The validity coefficient represents the correlation btwn 2 different measures, it can also be interpreted as shared validity.

When the correlation btwn the 2 diff. measures is squared it provides a measure of shared variability.

(Tip: On exam if Q gives the correlation coefficient for X (predictor) & Y (criterion), & asks how much variability in Y is explained by X; need to square the correlation coefficient to get the correct answer).

Question 46

Q

Concurrent Validity

Answer

A

When predictor & criterion scores are obtained at the same time.

Used to estimate current status/performance on the criterion. (Criterion-Related Validity)

Question 47

Q

Predictive Validity

Answer

A

When predictor scores are obtained before criterion scores.

Preferred type to predict future performance on the criterion. (Criterion-Related Validity)

Question 48

Q

Standard Error of Estimate

Answer

A

Used to construct a confidence interval around an estimated or predicted score. An index of error when predicting criterion scores from predictor scores.

It’s magnitude depends on 2 factors:

The criterion’s SD
The predictor’s validity coefficient

Used to construct a confidence interval around an examinee’s predicted criterion score:

68% confidence interval is constructed by +/- 1 SD from the predicted criterion score.
95% interval by +/- 2 SD from the predicted criterion score.
99% interval by +/- 3 SD from the predicted criterion score.

SE_est=SD_y √1-(r_xy)²

Ex: Given SD of 10 & validity coefficient of .60 calc. the standard error of est.?

SE_est = 10√1-.60² = .36

= √1-.36

= √.64

= 10(.8)

= 8

Question 49

Q

Incremental Validity

Answer

A

The extent to which a predictor increases decision-making accuracy when the predictor is used to make selection decisions.

Calculated by subtracting the base rate from the positive hit rate.

Evaluated by comparing the nymber of correct decisions made with & w/out the new predictor.

This has been linked to predictor & criterion cutoff scores;

true & false positives;
true & false negatives.

Scatterplots are used to assess a predictors incremental validity by dividing it into 4 quadrants; predictor cutoff determines if someone is + or -, & criterion determines true or false.

Question 50

Q

True Positive

Answer

A

Scored high on predictor & criterion;

ppl predicted to be successful & are.

On scatterplot usually right upper quadrant. (Incremental Validity)

Question 51

Q

False Positive

Answer

A

Scored high on predictor & low on criterion;

ppl predicted to be successful but are not.

To reduce the number of false positives, the predictor cutoff can be raised and/or the criterion cutoff can be lowered.

On scatterplot usually right bottom quadrant. (Incremental Validity)

Question 52

Q

True Negative

Answer

A

Scored low on predictor & criterion;

ppl predicted to be unsuccessful & are.

On scatterplot usually left bottom quadrant. (Incremental Validity)

Question 53

Q

False Negative

Answer

A

Scored low on predictor & high on criterion;

ppl predicted to be unsuccessful but are successful.

On scatterplot usually left upper quadrant. (Incremental Validity)

Question 54

Q

Relationship Between Reliability & Validity

Answer

A

Reliability is necessary but not sufficient condition for validity.

In terms of criterion-related validity, the validity coefficient can be no greater than the square root of the product of the reliabilities of the predictor & criterion.

The formula indicates that reliability places a upper limit on validity

r_xy < √r_ₓₓ

A valid test must be reliable but

A reliable test may or may not be valid

Question 55

Q

Cross-Validation

Answer

A

Process of re-assessing a test’s criterion-related validity on a new sample to check the generalizability of the original validity coefficient.

Question 56

Q

Shrinkage

Answer

A

Bc the predictor is often “tailor-made” for the orig. validation sample, the cross-validation coefficent tends to “shrink” (becomes smaller) bc the chance factors operating in the original sample are not all present in the cross-validation sample.

As a result the validity coefficient for the cross-validation sample is usually smaller.

Shrinkage refers to a reduction in the magnitude of a measure’s validity coefficient & validating a predictor w/a new sample.

Question 57

Q

Test Score Interpretation

Answer

A

An examinee’s raw score is often difficult to interpret unless it’s anchored to the performance of other examinees or a predefined standard of performance.

Types include:

Norm-Referenced
Percentile Rank
Standard Scores

Question 58

Q

Norm-Referenced Interpretation

Answer

A

Involves comparing an examinees test scores to scored obtained in a standardization sample or other comparison group.

This type of interpretation may entail converting an examinee’s raw score to a percentile rank and/or standard score (e.g., z-scores & T scores).

The examinees raw score is converted to a score that indicates his/her relative standing in the comparison group.

Percentile Ranks & Standard Scores

Question 59

Q

Percentile Ranks (PR)

Answer

A

Ranges from 1 to 99 & express an examinees score in terms of percentage of examinees who achieved lower scores in the sample.

Advantage = easy to interpret

Limitation = Indicate an examinee’s relative position in a distribution but do not provide info. about absolute differences btwn examinees in terms of their raw scores.

The distribution is always flat (rectangular) regardless of the shape of the raw score distribution; nonlinear transformation (looks different).

(Norm-Referenced Interpretation)

Question 60

Q

Standard Scores

Answer

A

Anchor an examinee’s test score to those of the norm group by reporting the examinee’s score in terms of standard deviation units from the mean.

Most common is a z-score & the distribution has a mean of 0 and a SD of 1.0.

Calculated by subtracting the mean of the distribution from the raw score to obtain a deviation score & then dividing the deviation score by the distributions SD. The resulting score indicates how far the raw score is from mean of distribution.
- SD. z=(X-M)/SD

Ex: Z-score of -1.0 indicates that an examinee’s raw score is 1 SD below the mean.

Ex: If an examinee obtains a score of 110 on a test that has a mean of 100 & SD of 10, the z-score is +1.0 (the raw score is 1 Sd above the mean)

Z = 110-100/10 = 10/10 = 1

T-Score Distribution has a mean of 50 & SD of 10

Examinee whose raw score is 1 Sd above the mean will have a T-score of 60

Deviation IQ scores have a mean of 100 & SD of 15

Examinee whose raw score is 1 SD above the mean will have a score of 115

(Norm-Referenced Interpretation)

Question 61

Q

Criterion-Referenced Interpretation

Answer

A

Interpretation of a test score in terms of a pre-specified standard:

Percentage score (% correct) - Indicate the proportion of the test content (e.g. % of test items) that examinees answered correctly.
Regression Equation - Predicted perf. on an external criterion.
Expectancy Table - Makes it possible to use an examinee’s predictor (test) score to estimate the probability that they will attain different scores on a criterion.

Question 62

Q

Criterion Contamination

Answer

A

Refers to bias introduced into a person’s criterion score as a result of the knowledge of the scorer about his/her performance on the predictor.

Tends to artificially inflate the relationship between the predictor and criterion.

Question 63

Q

Regression Model

Answer

A

Cleary’s regression model aka model of test bias; if a test has the same regression line for members of both grps, the test is not biased even if it has different means for the grps.

Question 64

Q

Correction for Guessing

Answer

A

Scores on objective tests are sometimes corrected for guessing in order to ensure that examinees don’t benefit from guessing wildly.

Answer 65

A

Alternate raters This occurs when observers ratings become increasingly less accurate over time in a systematic way. Occurs when raters who are working together influence each other’s ratings so that they assign ratings in increasingly similar (& idiosyncratic) ways.

Answer 66

A

C. 95 To identify the correct answer to this question, you need to be familiar with the areas under the normal curve & know that T scores have a mean of 50 & SD of 10. Ex: T score of 30 & 70 are 2 SD’d below & above the mean &, in a normal distribution, about 95% of scores fall btwn the scores that are +/- 2 SD’s from the mean.

Answer 67

A

Method of score adjustment used to take grp differences into account when assigning or interpreting scores. Ex: When a band is defined as 91-100 pts & examinees who receive scores of 91, 95, & 99 are treated the same.

Answer 68

A

Tends to artifically inflate the size of the inter-rater reliability coefficent.

Occurs when observer ratings become increasingly less accurate over time in a systematic way.

This happens when raters who are working together influence each others ratings so that they they assign ratings in an increasingly similar & idiosyncratic ways.

Answer 69

A

Refers to whether or not test items “look like” they’re measuring what the test is designed to measure.

Not an actual type of validity & is desireable in some situations.

If a test lacks this ppl may not be motivated to respond to items in an honest or accurate way.

Answer 70

A

The degree of peakedness or flatness of a probability distribution, relative to the normal distribution with the same variance. 2 Types:

Leptokurtic: Distribution of scores more peaked than a normal distribution.
Platokurtic: Distribution of scores flatter than a normal distribution.

Answer 71

A

Characterisitcs of this type of analysis is that the components (factors) are extracted so that the 1st component reflects the greatest amount of variability, then 2nd component, 2nd greatest etc…

Involves ID the components (factors) that underlie/explain the variability observed in a set of tests or other variables. (Similar to factor analysis)

Eigenvalue: (For each principal component) is calculated by squaring the correlation btwn each test & the component & then summing the results

The resulting number indicates the total mount of variability in the test that is explained by the principal component.

Brainscape's Knowledge GenomeTM

Test Construction Flashcards

Brainscape's Knowledge Genome^TM