Module 2: Norms and Reliability Flashcards by Annie Parish

What is Classical Test Theory? (CCT)

CCT is a model for understanding measurement
CCT is based on the True Score Model…

… for each person, their observed score on a test is comprised of: -	Observed score (X) = True Score (T) + Error (E)

How well did you know this?

Not at all

Perfectly

What is a true score?

True score is a person’s actual true ability level (i.e. measured without error).

How well did you know this?

Not at all

Perfectly

What is error?

Error is a component of observed score unrelated to the test takers rue ability or trait being measured.

True variance and Error variance thus refer to the variability in a collection/population of test scores.

How well did you know this?

Not at all

Perfectly

What is reliability?

Reliability refers to consistency in measurement.

- According to CCT: reliability is the proportion of the total variance attributed to true variance

How well did you know this?

Not at all

Perfectly

What is test administration error?

Test administration: variation due to the testing environment

Testtaker variables (e.g., arousal, stress, physical discomfort, lack of sleep, drugs, medication)
Examiner variables (e.g., physical appearance, demeanour)

How well did you know this?

Not at all

Perfectly

What is test scoring and interpretation error?

Test scoring and interpretation:

Variation due to differences in scoring and interpretation

How well did you know this?

Not at all

Perfectly

What are methodological errors?

Variation due to poor training, unstandardized administration, unclear questions, biased questions.

How well did you know this?

Not at all

Perfectly

CCT True-score Model vs. Alternative

True Score Model of measurement (based on CCT) is simple, intuitive, and thus widely used
Another widely used model of measurement is Item Response Theory (IRT)
CTT assumptions more readily met than IRT, and assures only two components to measurement
But, CTT assumes all items on a test have an equal ability to measure the underlying construct of interest.

How well did you know this?

Not at all

Perfectly

Item Response Theory (IRT)

IRT provides a way to model the probability that a person with X ability level will correctly answer a question that is ‘tuned’ to that ability level.

How well did you know this?

Not at all

Perfectly

What does IRT incorporate and consider?

IRT incorporates considerations of item Difficulty and discrimination
o Difficulty relates to an item not being easily accomplished, solved, or comprehended.
o Discrimination refers to the degree to which and item differentiates among people with higher or lower levels of the trail ability, or construct being measures.

How well did you know this?

Not at all

Perfectly

Reliability estimates

Because a person’s true score is unknown, we use different mathematical methods to estimate the reliability of tests.

Common examples include: -	Test-retest reliability -	Parallel an Alternate forms of reliability  -	Internal consistency reliability o	E.g., split in half, inter item correlation, Cronbach’s alpha -	Interrater/interscorer reliability

How well did you know this?

Not at all

Perfectly

Test-retest reliability

Test-retest reliability is an estimate of reliability over time

Obtained by correlating pairs of scores from same people on administration and same test at different times
Appropriate for stable variables (e.g., personality)
Estimates tend to decrease as time passes

How well did you know this?

Not at all

Perfectly

Parallel and Alternate Forms Reliability

Parallel forms: two versions of a test are parallel if in bother versions the means and variances of test scores are equal
Alternate forms: there is an attempt to create two forms of a test, but they do not meet strict requirement of parallel forms
Obtained by correlating the scores of the same people measured with the different forms.

How well did you know this?

Not at all

Perfectly

Split half reliability

Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.

Entails three steps:

Step 1: Divide the test into two halves
Step 2: Correlate scores on the two halves of the test.
Step 3: Generalise the half-test reliability to the full-test reliability using the Spearman-Brown formula.

How well did you know this?

Not at all

Perfectly

Inter-item/correlating

The degree of relatedness of items on a test. Able to gauge the homogeneity of a test

How well did you know this?

Not at all

Perfectly

Inter-item/correlating

Study These Flashcards

The degree of relatedness of items on a test. Able to gauge the homogeneity of a test

Kuder-Richardson formula 20

Study These Flashcards

statistic of choice for determining the inter-item consistency of dichotomous items

Coefficient alpha

Study These Flashcards

mean of all possible split-half correlations, corrected by the Spearman-brown formula. The most popular approach for internal consistency. Values range from 0 to 1.

Interrater/InterScorer Reliability

Study These Flashcards

The degree of agreement/consistency between two or more scorers (or judges or raters).

Often used with behavioural measures
Guards against biases or idiosyncrasies in scoring
Obtained by correlating scores from different raters:
o Use intraclass correlation for continuous measures
o Use Cohen’s Kappa for categorical measures

Choosing Reliability Estimates

Study These Flashcards

The nature of the test will often determine the reliability metric e.g.,

Are the test items homogenous or heterogeneous in nature
Is the characteristic, ability, trait being measured presumed to be dynamic or static
The range of test scores is or is not restricted
The test is a speed (how many can you do in a certain amount of time) or a power test (increasing difficulty over the item)
The test is or is not criterion-referenced (in order to pass you need to reach a threshold)

Otherwise, you can select whatever you think is appropriate.

How do we account for reliability in a single score?

Study These Flashcards

Our reliability coefficient tells us about error in our test in general
We can use this reliability to estimate to understand how confident we can be in a single observed score for one person.

Standard Error of the Difference (SED)

Study These Flashcards

The SED is a measure of how large a difference in test scores would be to be considered ‘statistically significant’

Helps with three questions (Note: test 1&2 must be on the same scale)

How did Person A’s performance on test 1 compare with own performance on test 2?
How did Person A’s performance on test 1 compare with Person B’s performance on test 1?
How did Person A’s performance on test 1 compare with person B’s performance on test 2?

Standardization

Study These Flashcards

is the process of administering tests to representative samples to establish norms.

Sampling

Study These Flashcards

the selection of an intended population for the test, that has at least one common, observable characteristic.

Stratified-random sampling

is a sampling design that ensures every member of a population has an equal opportunity of being included in a sample.

Purposive sample

is arbitrarily selecting a sample believed to be representative of the population.

Incidental/convenience

sample that is convenient or available for use. May not be representative of the population. o Generalisation of findings from convenience samples must be made with caution.

Process of developing norms:

Have obtained the normative sample: 1. Administer the test with standard set of instructions 2. Recommend a setting for test administration 3. Collect and analyse data 4. Summarize data using descriptive statistics including measures of central tendency and variability 5. Provide a detailed description of the standardization and administration protocol

Types of Norms

Percentiles: the percentage of people in the normative sample whose score was below a particular raw score. - Percentiles are popular because they are easily calculated and interpreted. - Problem: real differences between raw scores may be minimized near ends of distribution and exaggerated in the middle of the distribution. Age norms: average performance of normative sample segmented by age. Grade norms: average performance of normative sample segmented by grade. Subgroup: a normative sample can be segmented by any criteria initially used in selecting sample. National norms: derived from normative sample that was nationally representative of the population. National anchor norms: equivalency table for scores on two different tests. Allows common comparison. Local norms: provide normative information with respect to the local populations performance on some test.

The normal curve

The normal curve is a bell-shaped, smooth, mathematically defined curve t

Standard Scores

Standard score: is a raw score converted from one scale to another that has a predefined scale (i.e., set mean and standard deviation)

Z-score

Z-Score: conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean

T-scores

T-Scores: aka ‘fifty plus or minus ten scale’ – scale has set mean = 50 and standard deviation = 10

Culture and Inference

- In selecting a test for use, responsible test users should research all available norms to check if norms are appropriate for use with your patient - When interpreting test results it helps to know about the culture and era of test-taker - It is important conduct culturally informed assessment

Module 2: Norms and Reliability Flashcards

(34 cards)