Test Construction Flashcards Preview

EPPP BMac > Test Construction > Flashcards

Flashcards in Test Construction Deck (126)
Loading flashcards...

Item Characteristic Curve

A graphical representation of test item's

  1. difficulty
  2. discrimination
  3. chance of false positive.

Difficulty (degree of attribute needed to pass item):

  • indicated by position of curve on the X axis.

Discrimination (ability to differentiate between high and low scorers):

  • indicated by slope of the curve.

Chance of false positives (probability of getting answer correct by guessing):

  • indicated by the Y-intercept of the curve


Criterion-Related Validity Coeffecient

A value that indicates strength of a correlation between test scores & performance on a chosen construct.


Test Characteristic Curve

A graphical representation of the expected number of test items a participant answers correctly versus the constructs measured by the test


Item difficulty

AKA item difficulty index or 'p'.

Percent (%) of examinees that answer the item correctly

(how much of the attribute an individual must possess to pass the item).


What are the item difficulty (p) ranges?

0 and 1. 0 menas that no one passed the item (too hard) and 1 means that everyone passed (too easy). Average item difficulty should be 0.5


With item difficulty, what are the floor and ceiling effects?

Floor effects: a test's ability to distinguish people at the low end of a distribution 

Ceiling effects: a test's ability to distinguish people at the high end of a distribution.


What is item discrimination?

The ability of the item to unambiguously separate out those who fail from those who pass.

Can be visually represented with discrimination as the slope of the curve.

Steeper slopes indicate more discrimination.


How is item discrimination assessed?

Index D (item discrimination index): difference between the proportion of low-scorers who answered the item correctly & high-scorers who answered the item correctly.


What are the D ranges?

1 to -1; it is desirable to have positive values of D, which would indicate that more high-scoring examinees (rather than low-scoring examinees) answered the item correctly


Ratio measure

A level of measurement describing a variable with attributes that have all the qualities of nominal, ordinal, and interval measures as well as a true zero point; measurement of physical objects is an example of ratio measure.


Interval measure

A level of measurement describing a variable whose attributes are rank-ordered and have equal distances between adjacent attributes with no true zero point; the Farenheit temp scale s an example of this, because the distance between 17 an 18 is the same as the distance between 89 and 90


Nominal scale

A variable whose attributes are simply representations for groups and have no ranked relationship; gender would be example of a nominal scale of measurement because male does not imply more gender than female.


Item Response Theory

IRT focuses on determining specific parameters of test items.

Makes use of characteristic curves, which provide info about

  • item difficulty
  • item discrimination
  • probability of false positives.


Assumptions of IRT

  1. Single underlying trait
  2. requires large sample size.
  3. relationship between trait and item response can be displayed in item characteristic curve


Computer Adaptive Assessment

Uses IRT; customizes test to the examinee's ability level.


Classical Test Theory

CTT; AKA Classical Measurement Theory,

True score + Measurement error



is an approach to testing that assumes that individual items are as good a measure of a latent trait as other items; thus, CTT focuses on the reliability of a set of items. in CTT, item and test parameters are sample'dependent


The main purpose of Classical Test Theory within psychometric testing is to recognise and develop the reliability of psychological tests and assessment; this is measured through the performance of the individual taking the test and the difficulty level of the questions or tasks in the test. Reliability is calculated through the individual’s score on the test (observed score) and the amount of errors in the test itself (error), and together these give an indication of what the person’s true score would have been without the errors in the test measurements. Errors can occur through mistakes within the process of testing, as well as everyday malfunctions such as being tired, hungry, etc; but if a standard error can be found then it becomes easier to factor this out of the equation.


Kappa Coefficient

Measured the degree to which judges agree. Measure of inter-rater reliability. Increases when raters are well-trained and aware of being observed.

Applicable only with

  1. nominal
  2. ordinal
  3. discontinuous data.


Ranges of Kappa Coefficient

-1 to +1; .80 - .90 indicates good agreement


Convergent Validity

Indicates the degree of correlation between two instruments that are intended to measure the same thing


Continuous Data

A term used to refer to interval/ratio data


Internal Consistency

A measure indicating the extent to which items within and instrument are correlated to each other; internal consistency indicates the extent to which the given items measure the same construct


Kuder-Richardson Formula 20

A method of evaluating internal consistency reliability

used when test items are dichotomously scored

used when test items vary in difficulty

indicates the degree to which test items are homogenous

falsely elevates internal consistency when used with timed tests


Single-Subjects Designs

One or more participants and are focuses on assessing variables within and individual rather than between individuals. They are ideographic (differences within a participant) rather than nomothetic (differences between participants)


2 types of single-subject designs

Case study (describes an individual by using tests or naturalistic observation) or experimental (determine how the introduction of a factor affects behavior)


Problems with single-subject designs

  1. Autocorrelation (when measured on the same variable multiple times, the variable becomes correlated with itself)
  2. Time-intensive (multiple assessments or intense observations are time-consuming)
  3. Generalizability (may not generalize)
  4. Practice effects (scores may increase from practice)



An approach to personality that focuses on groups of individuals and tries to find the commonalities between individuals.



Two or more of the predictors in a regression model are moderately or highly correlated. Adversely affects regression analysis.



Quantitative Research

Systematic empirical exploration or relationships; deductive, rather than inductive. Involves the collection and statistical analysis of quantitative data, whose results can often be generalized.



Refers to the consistence or repeatability of data; pertains to quantitative research



  1. Test for differences in the mean scores of groups based on one or more variables.
  2. DV must be continuous and IV must be categorical.
  3. Tests the null hypothesis that the means of the group are equal.