CHAPTER 5: RELIABILITY Flashcards by Amara Vale

It is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.

Reliability Coefficient

How well did you know this?

Not at all

Perfectly

A statistic useful in describing sources of test score variability, and is the standard deviation squared.

Variance (σ2)

How well did you know this?

Not at all

Perfectly

It came from true differences.

True Variance (σ2th)

How well did you know this?

Not at all

Perfectly

A variance from irrelevant, random sources.

Error Variance (σ2)

How well did you know this?

Not at all

Perfectly

It is the difference between a person’s observed score on a test and their true score. It reflects inaccuracies or inconsistencies in the testing process, such as unclear questions, environmental distractions, or test-taker factors like fatigue. It refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured.

Measurement Error

How well did you know this?

Not at all

Perfectly

It is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.

Random error

How well did you know this?

Not at all

Perfectly

It refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.

Systematic Error

How well did you know this?

Not at all

Perfectly

Terms that refer to variation among items within a test as well as to variation among items between tests.

Item Sampling or Content Sampling

How well did you know this?

Not at all

Perfectly

What are the sources of error variance?

Test construction
Test administration
Test scoring and interpretation

How well did you know this?

Not at all

Perfectly

Terms that refer to variation among items within a test as well as to variation among items between tests.

Item Sampling or Content Sampling

How well did you know this?

Not at all

Perfectly

It introduces error variance primarily through item or content sampling, meaning that variation in item wording or topic selection can influence test scores. Even tests aiming to measure the same trait or knowledge area may differ significantly based on what and how the content is presented. A test taker’s performance can be boosted or hindered depending on whether the specific items align with what they know or expect. The key challenge for test developers is to minimize error variance and maximize true variance so that scores more accurately reflect the intended construct.

Test Construction

How well did you know this?

Not at all

Perfectly

A source of error variance that can introduce significant error variance through environmental, test taker, and examiner-related factors. Environmental conditions like room temperature, noise, or seating can distract examinees, while physical or emotional discomfort, fatigue, or even current events may impact their performance. Test taker variables such as illness, medications, or personal experiences also affect scores. Additionally, examiner behavior—such as inconsistent administration, physical cues, or personal biases—can unintentionally influence outcomes. Altogether, these factors can distort test results, making them less reflective of the true ability or trait being measured.

Test Administration

How well did you know this?

Not at all

Perfectly

It can introduce error variance, especially in assessments requiring human judgment. While computer scoring has minimized errors for objective tests, many assessments—like intelligence, personality, creativity, and behavioral tests—still depend on trained scorers. Subjectivity can lead to variability in scoring, especially when responses fall in gray areas or when raters interpret behaviors differently. Even with detailed scoring guidelines, inconsistencies may arise due to individual differences among scorers. To reduce such errors, rigorous training and clear criteria are essential to ensure scoring reliability and fairness.

Test Scoring and Interpretation

How well did you know this?

Not at all

Perfectly

It is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.

Test-retest Reliability

How well did you know this?

Not at all

Perfectly

It is a type of test-retest reliability estimate that reflects the consistency of test scores over a long time interval, typically more than six months.

Coefficient of Stability

How well did you know this?

Not at all

Perfectly

It refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

Parallel Forms Reliability

How well did you know this?

Not at all

Perfectly

The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability.

Coefficient of Equivalence

How well did you know this?

Not at all

Perfectly

P_____ ____ of a test exist when, for each form of the test, the means and the variances of observed test scores are equal. Strictly equivalent in means, variances, and reliability

Parallel forms

How well did you know this?

Not at all

Perfectly

It refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.

Alternate Forms Reliability

How well did you know this?

Not at all

Perfectly

These are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” ____ forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty. Different versions intended to be similar; may not be statistically equal.

Alternate forms

How well did you know this?

Not at all

Perfectly

It is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense).

Split-half Reliability

How well did you know this?

Not at all

Perfectly

It is a method of estimating internal consistency by correlating odd vs. even items on a test. It is a type of split-half reliability where you split the test into two halves by assigning all odd-numbered items (1, 3, 5…) to one half, and even-numbered items (2, 4, 6…) to the other half.

Odd-even Reliability

It allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.

Spearman–Brown formula

It refers to the degree of correlation among all the items on a scale.

Inter-item Consistency

H_________ in testing means that all items on a test measure the same trait or construct.

Homogeneity

H__________ in testing means that a test measures multiple traits or factors, not just one.

Heterogeneity

____ tells you how consistently the test items measure the same construct when the answers are either 0 or 1 (wrong or right).

Kuder–Richardson formula 20 (KR-20)

It is a measure of internal consistency reliability—that is, how well the items in a test hang together or measure the same overall concept. It tells you how consistently people respond to items in a test or questionnaire that use scales (like rating scales, Likert scales, etc.).

Cronbach’s alpha

This tells you how consistent the responses are across all test items by checking the average proportion of difference between item scores. Smaller differences between items = higher internal consistency Larger differences = lower internal consistency

Average Proportional Distance (APD)

It is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure.

Inter-scorer Reliability

It is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences.

Dynamic Characteristic

It is a trait, state, or ability presumed to be relatively unchanging.

Static Characteristic

It refers to a situation in which the full range of possible scores or variability in a dataset is limited or narrowed, often due to the way a sample is selected. This typically leads to lower correlation coefficients because there is less variability to detect relationships between variables.

Restriction of range

It occurs when the range of scores is artificially widened, which can inflate the correlation coefficient, potentially exaggerating the strength of relationships between variables

Inflation of range (or inflation of variance)

A test designed with a strict time limit where items are typically easy and uniform in difficulty. The main goal is to measure how quickly a test-taker can perform. Most test-takers do not finish all items, and scores are based on the number of items completed correctly within the time limit.

Speed Test

A test designed without a strict time limit (or with a time limit generous enough that time is not a factor). Items range in difficulty, and few, if any, test-takers get a perfect score. It measures the level of ability or knowledge rather than speed.

Power Test

It is designed to provide an indication of where a test taker stands with respect to some variable or criterion, such as an educational or vocational objective. C________-________ tests tend to contain material that has been mastered in hierarchical fashion.

Criterion-referenced Test

It is a method used to measure how reliable and accurate a test is. It says that a person's test score is made up of two parts: their true score (the actual ability or trait being measured) and error (random factors that affect the score).

Classical Test Theory (CTT)

It seeks to estimate how much the test score is influenced by specific sources of variation under certain conditions. In this theory, the test's reliability is viewed as a measure of how accurately the test samples the behavior domain it intends to assess.

Domain Sampling Theory

It is a modification of domain sampling theory, where a "universe score" replaces the true score. It suggests that a person’s test scores vary due to variables in the testing situation. It considers how facets of the testing situation (e.g., number of items, tester’s experience, test purpose) affect the scores.

Generalizability Theory

In generalizability theory, it is the score that represents the true level of the trait being assessed in the "universe" of possible situations, rather than just a true score in a specific test.

Universe Score

The different elements of the testing situation that can affect the test score, such as test items, scorer training, or test conditions (e.g., group vs. individual testing).

Facets

A study that examines how test scores generalize across different testing situations and contexts, assessing the influence of various facets of the "universe" (e.g., test administration conditions) on the scores.

Generalizability Study

These coefficients are similar to reliability coefficients and indicate how much different facets (e.g., test items, test conditions) contribute to the reliability or generalizability of test scores across different situations.

Coefficients of Generalizability

A study that applies the results from the generalizability study to determine how useful the test scores are in making decisions (e.g., placement, hiring) and how reliable those scores are in different contexts. It helps assess the practical use of test scores in decision-making.

Decision Study

It is a method used to understand how a person's abilities or traits affect their performance on test items. It looks at how difficult the test items are and how well they distinguish between people with different levels of the trait being measured.

Item Response Theory (IRT)

A model that explains how unobservable traits (like abilities or personality traits) influence a person’s performance on a test.

Latent-Trait Theory

The ability of a test item to differentiate between individuals who have high or low levels of the trait being measured.

Discrimination

Test items that have only two possible answers, such as true/false, yes/no, or correct/incorrect.

Dichotomous Test Items

Test items that have three or more possible answers, where only one is correct or consistent with the trait being measured.

Polytomous Test Items

A specific type of IRT model that assumes all test items measure the same underlying trait in the same way

Rasch Model

It provides a measure of the precision of an observed test score, and provides an estimate of the amount of error inherent in an observed score or measurement.

Standard Error of Measurement

It is a range around a test score that is likely to contain the person's true score, based on a certain level of confidence (like 95%). It accounts for measurement error in testing.

Confidence Interval

It is an estimate of how much two test scores (often from the same person on different tests or subtests) are expected to differ just by chance due to measurement error. It helps determine whether the difference between two scores is statistically meaningful.

Standard Error of the Difference (SED)