CHAPTER 5: RELIABILITY Flashcards

1
Q

It is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.

A

Reliability Coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A statistic useful in describing sources of test score variability, and is the standard deviation squared.

A

Variance (σ2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

It came from true differences.

A

True Variance (σ2th)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A variance from irrelevant, random sources.

A

Error Variance (σ2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

It is the difference between a person’s observed score on a test and their true score. It reflects inaccuracies or inconsistencies in the testing process, such as unclear questions, environmental distractions, or test-taker factors like fatigue. It refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured.

A

Measurement Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

It is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.

A

Random error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

It refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.

A

Systematic Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Terms that refer to variation among items within a test as well as to variation among items between tests.

A

Item Sampling or Content Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the sources of error variance?

A

Test construction
Test administration
Test scoring and interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Terms that refer to variation among items within a test as well as to variation among items between tests.

A

Item Sampling or Content Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

It introduces error variance primarily through item or content sampling, meaning that variation in item wording or topic selection can influence test scores. Even tests aiming to measure the same trait or knowledge area may differ significantly based on what and how the content is presented. A test taker’s performance can be boosted or hindered depending on whether the specific items align with what they know or expect. The key challenge for test developers is to minimize error variance and maximize true variance so that scores more accurately reflect the intended construct.

A

Test Construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A source of error variance that can introduce significant error variance through environmental, test taker, and examiner-related factors. Environmental conditions like room temperature, noise, or seating can distract examinees, while physical or emotional discomfort, fatigue, or even current events may impact their performance. Test taker variables such as illness, medications, or personal experiences also affect scores. Additionally, examiner behavior—such as inconsistent administration, physical cues, or personal biases—can unintentionally influence outcomes. Altogether, these factors can distort test results, making them less reflective of the true ability or trait being measured.

A

Test Administration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

It can introduce error variance, especially in assessments requiring human judgment. While computer scoring has minimized errors for objective tests, many assessments—like intelligence, personality, creativity, and behavioral tests—still depend on trained scorers. Subjectivity can lead to variability in scoring, especially when responses fall in gray areas or when raters interpret behaviors differently. Even with detailed scoring guidelines, inconsistencies may arise due to individual differences among scorers. To reduce such errors, rigorous training and clear criteria are essential to ensure scoring reliability and fairness.

A

Test Scoring and Interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

It is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.

A

Test-retest Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

It is a type of test-retest reliability estimate that reflects the consistency of test scores over a long time interval, typically more than six months.

A

Coefficient of Stability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

It refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

A

Parallel Forms Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability.

A

Coefficient of Equivalence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

P_____ ____ of a test exist when, for each form of the test, the means and the variances of observed test scores are equal. Strictly equivalent in means, variances, and reliability

A

Parallel forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

It refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.

A

Alternate Forms Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

These are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” ____ forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty. Different versions intended to be similar; may not be statistically equal.

A

Alternate forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

It is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense).

A

Split-half Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

It is a method of estimating internal consistency by correlating odd vs. even items on a test. It is a type of split-half reliability where you split the test into two halves by assigning all odd-numbered items (1, 3, 5…) to one half, and even-numbered items (2, 4, 6…) to the other half.

A

Odd-even Reliability

23
Q

It allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.

A

Spearman–Brown formula

24
Q

It refers to the degree of correlation among all the items on a scale.

A

Inter-item Consistency

25
H_________ in testing means that all items on a test measure the same trait or construct.
Homogeneity
26
H__________ in testing means that a test measures multiple traits or factors, not just one.
Heterogeneity
27
____ tells you how consistently the test items measure the same construct when the answers are either 0 or 1 (wrong or right).
Kuder–Richardson formula 20 (KR-20)
28
It is a measure of internal consistency reliability—that is, how well the items in a test hang together or measure the same overall concept. It tells you how consistently people respond to items in a test or questionnaire that use scales (like rating scales, Likert scales, etc.).
Cronbach’s alpha
29
This tells you how consistent the responses are across all test items by checking the average proportion of difference between item scores. Smaller differences between items = higher internal consistency Larger differences = lower internal consistency
Average Proportional Distance (APD)
30
It is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure.
Inter-scorer Reliability
31
It is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences.
Dynamic Characteristic
32
It is a trait, state, or ability presumed to be relatively unchanging.
Static Characteristic
33
It refers to a situation in which the full range of possible scores or variability in a dataset is limited or narrowed, often due to the way a sample is selected. This typically leads to lower correlation coefficients because there is less variability to detect relationships between variables.
Restriction of range
34
It occurs when the range of scores is artificially widened, which can inflate the correlation coefficient, potentially exaggerating the strength of relationships between variables
Inflation of range (or inflation of variance)
35
A test designed with a strict time limit where items are typically easy and uniform in difficulty. The main goal is to measure how quickly a test-taker can perform. Most test-takers do not finish all items, and scores are based on the number of items completed correctly within the time limit.
Speed Test
36
A test designed without a strict time limit (or with a time limit generous enough that time is not a factor). Items range in difficulty, and few, if any, test-takers get a perfect score. It measures the level of ability or knowledge rather than speed.
Power Test
37
It is designed to provide an indication of where a test taker stands with respect to some variable or criterion, such as an educational or vocational objective. C________-________ tests tend to contain material that has been mastered in hierarchical fashion.
Criterion-referenced Test
38
It is a method used to measure how reliable and accurate a test is. It says that a person's test score is made up of two parts: their true score (the actual ability or trait being measured) and error (random factors that affect the score).
Classical Test Theory (CTT)
39
It seeks to estimate how much the test score is influenced by specific sources of variation under certain conditions. In this theory, the test's reliability is viewed as a measure of how accurately the test samples the behavior domain it intends to assess.
Domain Sampling Theory
40
It is a modification of domain sampling theory, where a "universe score" replaces the true score. It suggests that a person’s test scores vary due to variables in the testing situation. It considers how facets of the testing situation (e.g., number of items, tester’s experience, test purpose) affect the scores.
Generalizability Theory
41
In generalizability theory, it is the score that represents the true level of the trait being assessed in the "universe" of possible situations, rather than just a true score in a specific test.
Universe Score
42
The different elements of the testing situation that can affect the test score, such as test items, scorer training, or test conditions (e.g., group vs. individual testing).
Facets
43
A study that examines how test scores generalize across different testing situations and contexts, assessing the influence of various facets of the "universe" (e.g., test administration conditions) on the scores.
Generalizability Study
44
These coefficients are similar to reliability coefficients and indicate how much different facets (e.g., test items, test conditions) contribute to the reliability or generalizability of test scores across different situations.
Coefficients of Generalizability
45
A study that applies the results from the generalizability study to determine how useful the test scores are in making decisions (e.g., placement, hiring) and how reliable those scores are in different contexts. It helps assess the practical use of test scores in decision-making.
Decision Study
46
It is a method used to understand how a person's abilities or traits affect their performance on test items. It looks at how difficult the test items are and how well they distinguish between people with different levels of the trait being measured.
Item Response Theory (IRT)
47
A model that explains how unobservable traits (like abilities or personality traits) influence a person’s performance on a test.
Latent-Trait Theory
48
The ability of a test item to differentiate between individuals who have high or low levels of the trait being measured.
Discrimination
49
Test items that have only two possible answers, such as true/false, yes/no, or correct/incorrect.
Dichotomous Test Items
50
Test items that have three or more possible answers, where only one is correct or consistent with the trait being measured.
Polytomous Test Items
51
A specific type of IRT model that assumes all test items measure the same underlying trait in the same way
Rasch Model
52
It provides a measure of the precision of an observed test score, and provides an estimate of the amount of error inherent in an observed score or measurement.
Standard Error of Measurement
53
It is a range around a test score that is likely to contain the person's true score, based on a certain level of confidence (like 95%). It accounts for measurement error in testing.
Confidence Interval
54
It is an estimate of how much two test scores (often from the same person on different tests or subtests) are expected to differ just by chance due to measurement error. It helps determine whether the difference between two scores is statistically meaningful.
Standard Error of the Difference (SED)