chapter 5 Flashcards by Antoinette Cruz

Is a synonym for dependability or consistency.
Refers to consistency in measurement.

Reliability

How well did you know this?

Not at all

Perfectly

Is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.

Reliability coefficient

How well did you know this?

Not at all

Perfectly

That a score on an ability test is presumed
to reflect not only the testtaker’s true score on the ability being measured but also error.

Classical test theory

How well did you know this?

Not at all

Perfectly

Variance from true differences.

True variance

How well did you know this?

Not at all

Perfectly

A statistic useful in describing sources of test score variability.
This statistic is useful because it can be broken into components.
The standard deviation squared.

Variance

How well did you know this?

Not at all

Perfectly

Refers to collectively, all of the factors associated
with the process of measuring some variable, other than the variable being measured.

Measurement error

How well did you know this?

Not at all

Perfectly

Variance from irrelevant, random sources

Error variance

How well did you know this?

Not at all

Perfectly

Is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process.
- Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to
another with no discernible pattern that would systematically raise or lower scores.

Random error

How well did you know this?

Not at all

Perfectly

Refers to a source of error in measuring a
variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.

Systematic error

How well did you know this?

Not at all

Perfectly

Sources of Error Variance:

Sources of error variance include test construction, administration, scoring, and/or
interpretation.

How well did you know this?

Not at all

Perfectly

Terms refer to variation among items within a test as well as to variation among items between tests.
- Under test construction

Item sampling or content sampling

How well did you know this?

Not at all

Perfectly

Sources of error variance that occur during test administration may influence the testtaker’s attention or motivation. The testtaker’s reactions to those influences are the source of one kind of error variance.
- Examples of untoward influences during
administration of a test include factors related to the: room temperature, level of lighting, and amount of ventilation and noise, for instance.

Test environment

How well did you know this?

Not at all

Perfectly

Other potential sources of error variance during test administration are: pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance.

Test-taker variables

How well did you know this?

Not at all

Perfectly

The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here.

Examiner-related variables

How well did you know this?

Not at all

Perfectly

In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences.
However, not all tests can be scored from grids blackened by no. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still require scoring by trained personnel.

Test scoring and interpretation

How well did you know this?

Not at all

Perfectly

Surveys and polls are two tools of assessment commonly used by researchers who study public opinion.
Certain types of assessment situations lend themselves to particular varieties of systematic
and nonsystematic error.

Other sources of error

How well did you know this?

Not at all

Perfectly

Reliability Estimates:

Test-Retest Reliability Estimates
Parallel-Forms and Alternate-Forms Reliability Estimates
Split-Half Reliability Estimates

How well did you know this?

Not at all

Perfectly

Is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.
- Is appropriate when evaluating the reliability of a test that purports to measure something
that is relatively stable over time, such as a personality trait.

Test-retest reliability

How well did you know this?

Not at all

Perfectly

When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as

Coefficient of stability

How well did you know this?

Not at all

Perfectly

The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the

Coefficient of equivalence

How well did you know this?

Not at all

Perfectly

of a test exist when, for each form of the test, the means and the variances of observed test scores are equal.

Parallel forms

How well did you know this?

Not at all

Perfectly

Refers an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

Parallel forms reliability

How well did you know this?

Not at all

Perfectly

Are simply different versions of a test that
have been constructed so as to be parallel.

Alternate forms

How well did you know this?

Not at all

Perfectly

Refers an estimate of the extent to which these different forms of the same
test have been affected by item sampling error, or other error.

Alternate forms reliability

How well did you know this?

Not at all

Perfectly

Deriving this type of estimate entails an evaluation of the internal consistency of the test items. Logically enough, it is referred to as an

Internal consistency estimate of reliability or as an estimate of inter-item consistency

There are different methods of obtaining internal consistency estimates of reliability. One such method is the

Split-half estimate

Is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. - It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense).

Split-half reliability

This method yields an estimate of split-half reliability that is also referred to as

Odd-even reliability

Allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.

Spearman–Brown formula

Refers to the degree of correlation among all the items on a scale.

Inter-item consistency

- Is the degree to which a test measures a single factor. In other words; the extent to which items in a scale are unifactorial. - (derived from the Greek words homos, meaning “same,” and genos, meaning “kind”)

Homogeneity

- Describes the degree to which a test measures different factors. - Test is composed of items that measure more than one trait.

Heterogeneity, heterogeneous

Dissatisfaction with existing split-half methods of estimating reliability compelled G. Frederic Kuder and M. W. Richardson (1937; Richardson & Kuder, 1939) to develop their own measures for estimating reliability. - 20th formula developed in a series

Kuder–Richardson formula 20

A selected assortment of tests and assessment procedures—in the process of evaluation. - typically composed of tests designed to measure different variables

Test battery

- Developed by Cronbach (1951) and subsequently elaborated on by others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967) - May be thought of as the mean of all possible split-half correlations, corrected by the Spearman–Brown formula.

Coefficient alpha

- A relatively new measure for evaluating the internal consistency of a test - is measure that focuses on the degree of difference that exists between item scores; as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores

Average proportional distance (APD) method

- Is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. - often used when coding nonverbal behavior

Inter-scorer reliability

The simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation.

Coefficient of inter-scorer reliability

A source of error attributable to variations in the test-taker’s feelings, moods, or mental state over time.

Transient error

Recall that a test is said to be homogeneous in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability.

Homogeneity versus heterogeneity of test items

Is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences.

Dynamic characteristics

In using and interpreting a coefficient of reliability, the issue variously referred to as restriction of range or restriction of variance (or, conversely, inflation of range or inflation of variance) is important. If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher.

Restriction or inflation of range

When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a

Power test

Generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly.

Speed test

Is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective. - tend to contain material that has been mastered in hierarchical fashion

Criterion-referenced tests

Referred to as the true score (or classical) model of measurement.

Classical test theory (CCT)

As a value that according to classical test theory genuinely reflects an individual’s ability (or trait) level as measured by a particular test. Let’s emphasize here that this value is indeed very test dependent.

True score

Seeks to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score.

Domain sampling theory

Is based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation.

Generalizability theory

Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score. This universe is described in terms of its _____, which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration.

Facets

According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the _____ ________.

Universe score

Examines how generalizable scores from a particular test are if the test is administered in different situations.

Generalizability study

The influence of particular facets on the test score is represented by

Coefficients of generalizability

Developers examine the usefulness of test scores in helping the test user make decisions

Decision study

- Another alternative to the true score model - A synonym for IRT in the academic literature is latent-trait theory

Item response theory (IRT)

In the context of IRT, it signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured.

Discrimination

Test items or questions that can be answered with only one of two alternative responses, such as true–false, yes–no, or correct–incorrect questions

Dichotomous test items

Test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct

Polytomous test items

Is a reference to an IRT model with very specific assumptions about the underlying distribution.

Rasch model

- Provides a measure of the precision of an observed test score; provides an estimate of the amount of error inherent in an observed score or measurement. - Is the tool used to estimate or infer the extent to which an observed score deviates from a true score. - the relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM. - denoted by the symbol σmeas, the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel. In accordance with the true score model, an obtained test score represents one point in the theoretical distribution of scores the testtaker could have obtained.

Standard Error of Measurement

A range or band of test scores that is likely to contain the true score

Confidence interval

A statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant

Standard error of the difference

chapter 5 Flashcards

(62 cards)