Test Construction Flashcards Preview

EPPP > Test Construction > Flashcards

Flashcards in Test Construction Deck (71):
1

Test

A "test" is a systematic method for measuring a sample of behavior. Although the exact procedures used to develop a test depend on its purpose and format, test construction ordinarily involves specifying the test's purpose, generating test items, administering the items to a sample of examinees for the purpose of item analysis, evaluating the test’s reliability and validity, and establishing norms.

2

Relevance

Refers to the extent to which test items contribute to achieving the stated goals of testing.

Determination is based on qualitative judgement based on 3 factors:

  1. Content Appropriateness (Does item assess content it's designed to evaluate?)
  2. Taxonomic level (Does item reflect approp. cog./ability level?)
  3. Extraneous Abilities (What extent does the item req. knowledge, skills or abilities outside the domain of interest?)

3

Item Difficulty

(Ordinal Scale) An item's difficulty level is calculated by dividing the # of individuals who answered the item correctly by the total number of individuals.

p=Total # of examinees passing the item/Total # of examinees 

p value ranges from 0 (nobody answered item correctly; very difficult) to 1.0 (Item answered correctly by all; very easy).

An item difficulty index of p=.50 is optimal because it maximizes differentiation between individuals w/high & low ability & helps ensure a high reliability coefficient.

Ex: Devel. of EPPP would be interested in assessing Item difficulty level to make sure the exam does not contain too many items that are either too easy or too difficult.

1 exception: true/false tests bc probability of answering the question correctly by chance is .50; optimal difficulty level is p=.75 (Item Difficulty Index = p)

A multiple choice item w/4 options the probability of answering the item correctly by guessing is 25%; so the optimum p value is halfway btwn 1 & .25, which is 0.625.

4

Item Difficulty Index (p)

(Ordinal Scale) For most tests a test developer wants items w/p values close to .50

The goal of testing is to choose a certain # of top performers, the optimal p value corresponds to the proportion of examinees to be choosen.

The optimal value is affected by the likelihood that examinees can select the correct answer by guessing, w/the preferred difficulty level being halfway btwn 100% of examinees answering the item correctly & the probability of answering correctly by guessing. 

The optimum p value also depends on the test's ceiling & floor:

  • Adequate Ceiling: Occurs when the test can distinguish btwn examinees w/high levels of the attribute being measured.
    • Ceiling is maxed by including a large proportion of items w/a low p value; difficult items.
  • Adequate Floor: Occurs when the test can distinguish btwn examinees w/low levels of the attribute being measured.
    • Floor is maxed by including a large proportion of items w/a high p value; easy items.

5

Item Discrimination

Refers to the extent to which a test item discriminates (differentiates) btwn examinees who obtain high vs. low scores on the entire test or on an external criterion.

To calculate, need to ID the examinee in sample that got the highest and lowest scores on the test & for each item, subtract the % of examinee in the lower-scoring group (L) from the % in the upper-scoring group (U) who answered the item correctly:

D = U - L

The item discrimination index (D) ranges from -1.0 to + 1.0.

  • D = +1.0, all examinee in the upper group & none in the lower group answered the item correctly
  • D = 0, the same percent of examinees in both grps answered the item correctly.
  • D = -1.0, none of the examinee in the upper group & all examinees in the lower group answered the item correctly 

For most test a D=.35 or higher is considered acceptable; yet items with D=.50 have greatest potential for max. discrimination.

6

Item Response Theory (IRT)

Advantages of IRT are that item parameters are sample invariant (same across different samples) & performance on different sets of items or tests can be easily equated.

Use of IRT involves deriving an item characteristic curve for each item. (IRT = ICC)

7

Item Characteristic Curve (ICC)

When using IRT, an ICC is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either:

  • The total test score,
  • Performance on an external criterion, or
  • A mathematically-derived estimate of a latent ability or trait.

The curve provides info. on the relationship btwn an examinees level on the ability or trait measured by the test & the probability that he/she will respond to the item correctly. 

The difficulty level of an item is indicated by the ability level (Ex: -3 to +3) where 50% of examinees in sample obtained a correct response. The diff. level for this item is 0 (ave. ability level).

The items' ability to discriminate btwn high & low achievers is indicated by the slop of the curve; steeper slope, greater discrimination.

Probability of guessing correctly is indicated by the point where the curve intercepts the vertical axis.

8

Classical Test Theory

Theory of measurement that regards observed variability in test scores as reflecting 2 components:

  • True Score Variability (True differences btwn examinees on the attribute(s) measured by the test) &
  • Variability due to measurment (random) error (The effects of differences due to measurement (random) error).

Reliability is a measure of true score variability.

To calculate an examinee's obtained test score (X) is compsed of 2 components:

  • (T) a true score component &
  • (E) an error component:

X = T + E

(Ex: X = score obtained on licensing exam is likely to be due to both; (T) The knowledge obtained about test items & (E) effects of random factors such as anxiety, way items were written, attention, etc.)

9

Reliability

Refers to the consistency of test scores; i.e., the extent to which a test measures an attribute w/out being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms.

A test is reliable:

  1. To the degree that it is free from error & provides info. about examinees "true" test scores &
  2. The degree it provides repeatable, consistent results.

Problems: Item & test parameters are sample-dependent.

Reliability is est. by evaluating consistency in scores over time or across different forms of the test, different test items, or different raters. This method is based  on the assumption that:

  • True score variability is consistent
  • Variability due to measurement error is inconsistent.

Methods for establishing reliability include:


  • Test-retest, 



  • Alternative forms, 





  • Split-half, 







  • Coefficient alpha, and 









  • Inter-rater.
     

10

Reliability Coefficient

Most produce a Reliability Coefficient (a correlation coefficent that is calc. by correlating at least 2 scores obtained from each examinee in a sample), which is a correlation coefficient that ranges in value from:

  • 0.0 (due to measurement error; no reliability) to +1.0 (true score variability; perfect relability).

The coefficient is interpreted directly as a measure of true score variability

(Ex: a reliability of .80 indicates that 80% of variability in test scores is true score variability; remaining 20% [1-.80] is due to measurement error). Interpreted directly as the proportion of variability in a set of test scores that is attributable to true score variability.

(Reliability coefficient = rₓₓ)

11

Test-Retest Reliability

A method for assessing reliability that involves administering the same test to the same group of examinees on 2 different occasions & correlating the 2 sets of scores (test-retest reliability).

Yields a Coefficient of Stability: Provides a measure of test score stability over time.

Approp. for tests designed to measure attributes/characterisitics that are stable over time & not affected by repeated measurements or are affected in a random way  (ex: aptitude).

12

Alternate Forms Reliability

Provides a measure of test score consistency over 2 forms of the test (aka parallel forms/equivalent forms).

  • Method for est. a test's reliability that entails administering 2 equivalent forms of the test to the same group of examinees & correlating the 2 sets of scores.

Forms can be administered at about the same time (coefficient of equivalence) or at different times (coefficient of equivalence & stability).

Considered by some experts to be the best (most thorough) method for assessing reliability.

Best for determining the reliability of tests:

  • designed to measure attributes that are stable over time &
  • not affected by repeated measurements, characteristics that fluctuate over time,
  • when exposure to 1 form is likely to affect perf. on the other in an unsystematic way.

13

Internal Consistency Reliability

Degree to which items included in the test are measuring the same characteristic & indicates the degree of consistency across different test items.

  • Approp. for tests that measure a single content or behavior domain (Ex: subtest but not entire test)
  • Useful for est. the reliability of tests that measure characterisitics that fluctuate over time or are susceptible to memory or practice effects.

Including Split-Half Reliability & Cronbach's coefficient/KR-20

14

Split-Half Reliability

A method for assessing internal consistency reliability & involves:

  • "splitting" the test in half (e.g., odd- versus even-numbered items) & correlating examinees scores on the 2 halves of the test.

Since the size of a reliability coefficient is affected by test length, the split-half method tends to underestimate a test's true reliability.

The split-half reliability coefficient is usually corrected with the Spearman-Brown formula.

  • The Spearman-Brown formula can also be used more generally to est. the effects of shortening or lengthening a test on its reliability coefficent.

(Split-half Reliability = Spearman-Brown Formula)

Not approp. for speeded tests in which score depends on speed of responding.

Shorter tests are less reliable than longer ones.

15

Spearman-Brown Formula

Spearman-Brown formula, which estimates what the test's reliability would be if it were based on the full length of the test to obtain an estimate of what the test's true reliability is.

(Split-half Reliability = Spearman-Brown Formula)

16

Coefficient Alpha

Method for assessing internal consistency reliability that provides an index of average inter-item consistency rather than the consistency between 2 halves of the test.

KR-20 can be used as a substitute for coefficient alpha when test items are scored dichotomously.

(KR-20 = Coefficient Alpha)

17

Kuder-Richardson Formula 20 (KR-20)

Kuder-Richardson Formula 20 (KR-20) can be used as a substitute for coefficient alpha when test items are scored dichotomously (scored as right or wrong; Ex: T/F & multiple choice questions).

(KR-20 = Coefficient Alpha)

18

Inter-Rater Reliability

Important for tests that are subjectively scored, such as essay & projective tests, based on judgement.

To be sure an examinee will obtain the same score no matter who is doing the scoring.

The scores assigned by different raters can be used to calculate a correlation (reliability) coefficient or to determine the percent agreement between raters, the resulting index of reliability can be artificially inflated by the effects of chance agreement.

Alt. a special correlation coefficent can be used such as:

  • Cohen's Kappa Statistic: Used to measure agreement btwn 2 raters when scores represent a nominal scale. 
  • Kendall's Coefficent of Concordance: Used to measure agreement btwn 3 or more raters when scores are reported as ranks.

Reliability coefficents over .80 are generally considered acceptable.

(Kappa Statistic = Inter-rater reliability)

19

Cohen's Kappa Statistic & Kendall's Coefficient of Concordance

Cohen's Kappa Statistic: A correlation coefficient used to assess agreement btwn 2 raters (inter-rater reliability) when scores represent a nominal scale.

(Kappa Statistic = Inter-rater reliability)


Kendall's Coefficent of Concordance: Used to measure agreement btwn 3 or more raters when scores are reported as ranks.
 

20

Factors that Affect the size of the Reliability Coefficient

4 Factors can increase a tests reliability:

  1. Test Length
  2. Range of Scores/Heterogenity
  3. Test Content
  4. Guessing

21

Test Length

The larger the sample of the attributes being measured by a test, the less relative effects of measurement error & more likely will provide dependable, consistent information.

Longer tests are generally more reliable than shorter tests.

Ways to increase the test length is by adding items of similar content and quality.

Can use Spearman-Brown to estimate reliability, yet it tend to overestimate a test's true reliability.

22

Range of Scores

A test's reliability can be increased with the heterogeneity of the sample in terms of the attributes measured by the test, which will increase the range of scores. (p=.50).

Reliability increases w/the degree of similarity of examinees in terms of the attributes measured by the test, which will increase the range of scores.

Maximized when the range of scores is unrestricted.

23

Test Content

The ore homogeneous a test is w/regard to content, the higher it's reliability coefficient.

Easiest to understand if you consider internal consistency; the more consistent the test items are in terms of content the larger the coefficient alpha or split-half reliability.

24

Guessing

A test's reliability coefficient is also affected by the probability that examinees can guess the correct answer to test items.

As the probability of correctly guessing increases the reliability coefficient decrease.

More difficult it is to pick the right answer by guessing the larger the reliability coefficint.

25

Confidence Interval

Helps a test user to estimate the range w/in which an examinee's true score is likely to fall given their obtained score.

Bc tests are not totally reliable, an examinee's obtained score may or may not be his/her true score.

Always best to interpret an examinee's obtained score it to construct a confidence interval around that score.

  • A confidence interval indicates the range w/in which an examinee's true score is likely to fall given the obtained score.
  • It is derived using the Standard Error of Measurement (SEM):
    • ​68% = +/- 1 SEM from obtained score
    • 95% = +/- 2 SEM from obtained score
    • 99% = +/- 3 SEM from obtained score

​Standard Error of Measurement (SEM) = Used to construct a confidence interval around a measured or obtained score.

26

Standard Error of Measurement

It is used to construct a confidence interval around an examinee's obtained (measured) score.

Range is calculated by multiplying the standard deviation of the test scores by the square root of 1 - the reliability coefficient.

This is an index of the amount of error that can be expected in obtained scores due to the unreliability of the test.

SEmes=SDx√(1-rₓₓ)

Ex: A psychologist administers an interpersonal assertiveness test to a sales applicant who receives a score of 80. Since test‘s reliability is less than 1.0, the psych. knows that this score might be an imprecise est. of the applicant's true score & decides to use the standard error of measurement to construct a 95% confidence interval. Assuming that the test‘s reliability coefficient is .84 and its standard deviation is 10, the standard error of measurement is equal to 4.0

The psych. constructs a 95% confidence interval by adding and subtracting 2 standard errors from the applicant's obtained score: 80 ±2(4.0) = 72 to 88. There is a 95% chance that the applicant's true score falls between 72 and 88.

SEM = SDx√(1-rₓₓ)

=10√1-.84

=10(.4) = 4.0

27

Validity

Refer's to a test's accuracy in terms of the extent to which the test measures what it is intended to measure.

3 Different Types of Validity include:

  1. Content Validity (content or behavior)
  2. Construct Validity (hypothetical trait or construct)
  3. Criterion-Related Validity (status/perf. on external criterion)

28

Content Validity

Important for tests designed to measure a specific content or behavior domain & items are an accurate & respresentative sample of content domains they represent.

Content validity is not the same as face validity.

Most important for achievement & job sample tests.

Determined primarily by "expert judgment"

This is of concern when a test has been designed to measure 1 + content/behavior domains.

A test has content validity when its items are a representative sample of the domain(s) that the test is intended to measure.

Usually built into a test bc it is being constructed & involves clearly defining:

  • The content/behavior domain
  • Divided into categories & sub-categories
  • Then write or select items that represent each sub-category thru the selection of a representative sample of items.

After test devel., content validity is checked by having subject matter experts who determine if test items are an adequate & representative sample of the content or behavior domain & then evaluate the test in a systematic way.

Scores on the test (X) are important bc they provide info. on how much each examinee knows about a content domain w/regard to the traits being measured, then content or construct validity are of interest. 

29

Construct Validity

(Broadest category of validity) This is important when a test will be used to measure a hypothetical trait or construct & has been shown that the test actually measures the hypothetical trait its intended to measure.

A method of assessing construct validity is to correlate test scores w/scores on other measures that do & do not measure the same trait. To determine if the test has both:

  • Convergent Validity: The correlation btwn the test's we're validating & the measure of the same trait using a differemt method provides info. about the test's convergent validity.(Monotrait-Heteromethod; large) 
  • Discriminate (divergent) Validity: The correlation btwn the test's we're validating & the measures of unrelated traits provide info about the test's divergent validity. (Heterotrait-Monomethod; small)
  • Multitrait-Multimethod Matrix is used to eval. construct validity. A table of correlation coefficients that provide info. about the test's convergent & divergent (discriminant) validity.

Convergent & divergent combo provide evidence that the test is actually measuring the construct it was designed to measure.

Other methods include:

  • Constructing a Factor Analysis to assess the test's factorial validity. Provides info. about convergent & divergent validity but is more complex techniue.
  • Determining if changes in test scores reflect expected developmental changes; &
  • Seeing if experimental manipulations have the expected impact on test scores.

There tests include achievement, motivation, intelligence or mechanical aptitude.

When scores on the test (X) are important bc they provide info. on how much each examinee knows about a content domain or on each examinee's status w/regard to the traits being measured, then content or construct validity are of interest.

30

Convergent & Discriminate Validity

  • Convergent Validity: The correlation btwn the test's we're validating & the measure of the same trait using a different method provides info. about the test's convergent validity.
    • When a test has high correlations w/measures that assess the same construct.
    • (Monotrait-Heteromethod)
    • High correlations=convergent
  • Discriminate (divergent) Validity: The correlation btwn the test's we're validating & the measures of unrelated traits provide info about the test's divergent validity.
    • When a test has low correlations w/measures of unrelated characteristics.
    • (Heterotrait-Monomethod; small)
    • Low correlations=discriminate (divergent)

Convergent & divergent combo provide evidence that the test is actually measuring the construct it was designed to measure.


Multitrait-Multimethod Matrix is used to eval. w/a table of correlation coefficients that provide info. about the test's convergent & divergent (discriminant) validity.
 


31

Multitrait-Multimethod Matrix

A systematic way to org. the data collected when assessing a test's convergent & discriminate validity. 

The matrix is a table of correlation coefficients that provide info. about the test's convergent & divergent (discriminant) validity. 

Requires measuring at least 2 different traits using at least 2 different methods for each trait. Terms have been linked to 4 correlation coefficients:

  1. Monotrait-Monomethod Coefficients: Measures the same-trait-same method.
  2. Monotrait-Heteromethod Coefficients: Measures same trait-different methods. (Convergent)
  3. Heterotrait-Monomethod Coefficients: Measures different traits-same method. (Divergent)
  4. Heterotrait-Heteromethod Coefficients:  Measures different traits-different methods.

32

Monotrait-Monomethod Coefficients (same-trait-same method)

Reliability coefficients: Indicate the correlation between the measured & itself.

Coefficients are not directly relevant to a test's convergent & discriminate validity, they should be larger for the matrix to provide useful info.

33

Monotrait-Heteromethod Coefficients (same trait-different methods)

These coefficients (coefficients in rectangles) indicate the correlation between different measures of the same trait.

It indicates that a test has convergent validity when the monotrait-heteromethod coefficients are large.

34

Heterotrait-Monomethod Coefficients (different traits-same method)

These coefficients (coefficients of ellipses) show the correlation between different traits that have been measured by the same method.

It indicates discriminate validity when the heterotrait-monomethod coefficients are small.

35

Heterotrait-Heteromethod Coefficients (different traits-different methods)

These coefficients (underlined coefficients) indicate the correlation between different traits that have been measured by different methods.

It indicates discriminate validity when the heterotrait-heteromethod coefficients are small.

36

Factor Analysis

A multivariate statistical technique used to ID how many factors (constructs/dimensions) that underlie the intercorrelations among a set of tests, subtests, or test items.

One use of obtained data is to determine if a test has construct validity by indicating the extent to which the test correlates with factors that it would & would not be expected to correlate with.

A test shows construct validity when it has high correlations w/factors expected to correlate with & low correlations w/factors not expected to correlate with.

True score variability consists of:

  • Communality & Specificity.

Factors ID in a factor analysis can be either:

  • Orthogonal
  • Oblique.

37

5 Basic Steps for a Factor Analysis:

Includes grouping a large number of test items into subtests & subscales to test hypothesis about test item scales & subscales related to one another& test construct validity.

  1. Administer Tests to a Sample of Examinees:  Admin. several tests to be validated to a group of examinees to measure the same & diff. traits.
  2. Derive & Interpret the correlation Matrix: Correlate scores on each test w/scores on every other test to obtain a correlation (R) matrix which indicates the correlations of all the pairs included in the analysis. ID clusters of tests that are highly correlated w/1 another & number of clusters determines how many factors should be extracted in the factor analysis.
  3. Extract the inital Factor Matrix: Using 1 of several available factor analytic techniques, convert the correlation matrix to a factor matrix; difficult to interpret. (Data in correlation matrix is used to derive a factor matrix that contains correlation coefficients, that indicate the degree of association btwn each test and factor).
  4. Rotate the Factor Matrix: To obtain the final prouduct & simplify the interpretation of the factors by "rotating" them.
  5. Name the Factors: Interpret and name the factors in the rotated factor matrix, when used to assess construct validity it has high correlations w/factors expected to correlate with & low correlations w/factors not expected to correlate with.

38

Factor Loadings

In factor analysis:

Factor Loading is the correlation btwn a test (or other variable included in the analysis) & a factor.

To interpret a factor loading is to square it to determine the amount of variability in the test scores that is explained by the factor.

A squared factor loading provides a measure of "shared variability."

Ex: If a test has a correlation of .502 with Factor I, this means that 25% of variability in test scores is explained by Factor I.

39

Communality

The total amount of variability in test scores on the test (or other variable) that is due to the factors that the test shares in common w/other tests in the analysis (identified factors).

Communality is a lower-limit estimate of a test's reliability coefficient.

40

Specificity

2nd component in a Factor Analysis which is variability that is due to factors that are specific or unique to the test & that are not measured by any other test included in the analysis.

The specificity ID the potion of true score variability that has not been explained by factor analysis.

41

Rotation (Matrix)

Redividing the communality of each test included in the analysis. A researcher decides which is approp. based on his/her theory about the charaterisitics measured by the tests included in the analysis.

Due to the re-division each factor accounts for a different proportion of a test's variability than prior to the rotation.

This process make sit easier to interpret factor loading's.

There are 2 types of rotations:

  1. Orthogonal (Uncorrelated)
  2. Oblique (Correlated)

42

Orthogonal

When rotation of ID factors are orthogonal unrelated attributes (uncorrelated), a test's communality can be calculated by summing the squared factor loading's.

Can only calc. communality when factors are orthogonal.

Ex: If a test has a correlation of .50 w/Factor 1 & a correlation of .20 w/Factor 2 & the factors are uncorrelated, the test's communality is

.29 (F1 =.50^2 = .25; F2= .20^2=.4; 25+4=.29).

This means that 29% of the variability in test scores is explained by the ID factors, while the remaining variability is due to some combo of specificity & measurement error.

43

Oblique

When the rotation is oblique the ID factors are correlated, & the attributes measured by the factors are not independent.

44

Criterion-Related Validity

The type of validity that involves determining the relationship (correlation) btwn the predictor & the criterion. Important when test scores will be used to predict/est. scores on a criterion.

The correlation coefficient is referred to as the criterion-related validity coefficient.

This type of validity can be either:

  • Concurrent: Involves obtaining scores on the predictor & criterion at about the same time; current.
  • Predictive: Involves obtaining predictor scores before criterin scores; future.

This is of interest when a test has been designed to estimate or predict an examinee's standing or performance on an another measure (external criterion).

When the test (X) score will be used to predict scores on some other measure (Y) & it is the scores on Y that are of the most interest, then this type of validity is of interest.

45

Criterion-Related Validity Coefficient

The validity coefficient represents the correlation btwn 2 different measures, it can also be interpreted as shared validity.

When the correlation btwn the 2 diff. measures is squared it provides a measure of shared variability.

(Tip: On exam if Q gives the correlation coefficient for X (predictor) & Y (criterion), & asks how much variability in Y is explained by X; need to square the correlation coefficient to get the correct answer).

46

Concurrent Validity

When predictor & criterion scores are obtained at the same time.

Used to estimate current status/performance on the criterion. (Criterion-Related Validity)

47

Predictive Validity

When predictor scores are obtained before criterion scores.

Preferred type to predict future performance on the criterion. (Criterion-Related Validity)

48

Standard Error of Estimate

Used to construct a confidence interval around an estimated or predicted score. An index of error when predicting criterion scores from predictor scores.

It's magnitude depends on 2 factors:

  1. The criterion's SD
  2. The predictor's validity coefficient

Used to construct a confidence interval around an examinee's predicted criterion score:

  • 68% confidence interval is constructed by +/- 1 SD from the predicted criterion score.
  • 95% interval by +/- 2 SD from the predicted criterion score.
  • 99% interval by +/- 3 SD from the predicted criterion score.

SEest=SDy √1-(rxy)2

Ex: Given SD of 10 & validity coefficient of .60 calc. the standard error of est.?

      SEest = 10√1-.602 = .36

               = √1-.36

               = √.64

               = 10(.8)

               = 8

49

Incremental Validity

The extent to which a predictor increases decision-making accuracy when the predictor is used to make selection decisions.

Calculated by subtracting the base rate from the positive hit rate.

Evaluated by comparing the nymber of correct decisions made with & w/out the new predictor.

This has been linked to predictor & criterion cutoff scores;

  • true & false positives;
  • true & false negatives.

Scatterplots are used to assess a predictors incremental validity by dividing it into 4 quadrants; predictor cutoff determines if someone is + or -, & criterion determines true or false.

50

True Positive

Scored high on predictor & criterion;

ppl predicted to be successful & are.

On scatterplot usually right upper quadrant. (Incremental Validity)

51

False Positive

Scored high on predictor & low on criterion;

ppl predicted to be successful but are not.

To reduce the number of false positives, the predictor cutoff can be raised and/or the criterion cutoff can be lowered.

On scatterplot usually right bottom quadrant. (Incremental Validity)

52

True Negative

Scored low on predictor & criterion;

ppl predicted to be unsuccessful & are.

On scatterplot usually left bottom quadrant. (Incremental Validity)

53

False Negative

Scored low on predictor & high on criterion;

ppl predicted to be unsuccessful but are successful.

On scatterplot usually left upper quadrant. (Incremental Validity)

54

Relationship Between Reliability & Validity

Reliability is necessary but not sufficient condition for validity.

In terms of criterion-related validity, the validity coefficient can be no greater than the square root of the product of the reliabilities of the predictor & criterion.

The formula indicates that reliability places a upper limit on validity

rxy < √rₓₓ

 

A valid test must be reliable but

A reliable test may or may not be valid

55

Cross-Validation

Process of re-assessing a test's criterion-related validity on a new sample to check the generalizability of the original validity coefficient.

56

Shrinkage

Bc the predictor is often "tailor-made" for the orig. validation sample, the cross-validation coefficent tends to "shrink" (becomes smaller) bc the chance factors operating in the original sample are not all present in the cross-validation sample.

As a result the validity coefficient for the cross-validation sample is usually smaller.

Shrinkage refers to a reduction in the magnitude of a measure's validity coefficient & validating a predictor w/a new sample.

57

Test Score Interpretation

An examinee's raw score is often difficult to interpret unless it's anchored to the performance of other examinees or a predefined standard of performance.

Types include:

  • Norm-Referenced
  • Percentile Rank
  • Standard Scores

58

Norm-Referenced Interpretation

Involves comparing an examinees test scores to scored obtained in a standardization sample or other comparison group.

This type of interpretation may entail converting an examinee's raw score to a percentile rank and/or standard score (e.g., z-scores & T scores). 

The examinees raw score is converted to a score that indicates his/her relative standing in the comparison group.

Percentile Ranks & Standard Scores

59

Percentile Ranks (PR)

Ranges from 1 to 99 & express an examinees score in terms of percentage of examinees who achieved lower scores in the sample.

Advantage = easy to interpret

Limitation = Indicate an examinee's relative position in a distribution but do not provide info. about absolute differences btwn examinees in terms of their raw scores.

The distribution is always flat (rectangular) regardless of the shape of the raw score distribution; nonlinear transformation (looks different).

(Norm-Referenced Interpretation)

60

Standard Scores

Anchor an examinee's test score to those of the norm group by reporting the examinee's score in terms of standard deviation units from the mean.

Most common is a z-score & the distribution has a mean of 0 and a SD of 1.0.

  • Calculated by subtracting the mean of the distribution from the raw score to obtain a deviation score & then dividing the deviation score by the distributions SD. The resulting score indicates how far the raw score is from mean of distribution.
    • SD. z=(X-M)/SD

Ex: Z-score of -1.0 indicates that an examinee's raw score is 1 SD below the mean.

Ex: If an examinee obtains a score of 110 on a test that has a mean of 100 & SD of 10, the z-score is +1.0 (the raw score is 1 Sd above the mean)

Z = 110-100/10 = 10/10 = 1

T-Score Distribution has a mean of 50 & SD of 10

  • Examinee whose raw score is 1 Sd above the mean will have a T-score of 60

Deviation IQ scores have a mean of 100 & SD of 15

  • Examinee whose raw score is 1 SD above the mean will have a score of 115

(Norm-Referenced Interpretation)

61

Criterion-Referenced Interpretation

Interpretation of a test score in terms of a pre-specified standard:

  • Percentage score (% correct) - Indicate the proportion of the test content (e.g. % of test items) that examinees answered correctly.
  • Regression Equation - Predicted perf. on an external criterion.
  • Expectancy Table - Makes it possible to use an examinee's predictor (test) score to estimate the probability that they will attain different scores on a criterion.

62

Criterion Contamination

Refers to bias introduced into a person's criterion score as a result of the knowledge of the scorer about his/her performance on the predictor.

Tends to artificially inflate the relationship between the predictor and criterion.

63

Regression Model

Cleary's regression model aka model of test bias; if a test has the same regression line for members of both grps, the test is not biased even if it has different means for the grps.

64

Correction for Guessing

Scores on objective tests are sometimes corrected for guessing in order to ensure that examinees don't benefit from guessing wildly.

65

What is the best way to control consensual observer drift?

Alternate raters This occurs when observers ratings become increasingly less accurate over time in a systematic way. Occurs when raters who are working together influence each other's ratings so that they assign ratings in increasingly similar (& idiosyncratic) ways.

66

In a normal distribution of scores, what percentage of ppl receive T scores btwn 30 to 70? A. 50 B. 68 C. 95 D. 99

C. 95 To identify the correct answer to this question, you need to be familiar with the areas under the normal curve & know that T scores have a mean of 50 & SD of 10. Ex: T score of 30 & 70 are 2 SD'd below & above the mean &, in a normal distribution, about 95% of scores fall btwn the scores that are +/- 2 SD's from the mean.

67

Banding

Method of score adjustment used to take grp differences into account when assigning or interpreting scores. Ex: When a band is defined as 91-100 pts & examinees who receive scores of 91, 95, & 99 are treated the same.

68

Consensual Observer Drift

Tends to artifically inflate the size of the inter-rater reliability coefficent.

Occurs when observer ratings become increasingly less accurate over time in a systematic way.

This happens when raters who are working together influence each others ratings so that they they assign ratings in an increasingly similar & idiosyncratic ways.

69

Face Validity

Refers to whether or not test items "look like" they're measuring what the test is designed to measure.

Not an actual type of validity & is desireable in some situations.

If a test lacks this ppl may not be motivated to respond to items in an honest or accurate way.

70

Kurtosis

The degree of peakedness or flatness of a probability distribution, relative to the normal distribution with the same variance. 2 Types:

  1. Leptokurtic: Distribution of scores more peaked than a normal distribution.
  2. Platokurtic: Distribution of scores flatter than a normal distribution.

71

Principal Component Analysis & Eigenvalue

Characterisitcs of this type of analysis is that the components (factors) are extracted so that the 1st component reflects the greatest amount of variability, then 2nd component, 2nd greatest etc...

  • Involves ID the components (factors) that underlie/explain the variability observed in a set of tests or other variables. (Similar to factor analysis)

Eigenvalue: (For each principal component) is calculated by squaring the correlation btwn each test & the component & then summing the results

The resulting number indicates the total mount of variability in the test that is explained by the principal component.