- Does it measure what it says it measures? - Overall eval of evidence and degree of trustworthiness - Determine if enough support exists to use the test in a certain way

- Random errors: mood, health, fatigue - Administration differences - Scoring differences - Random guessing

- Give same test twice to same group - Correlation between first and second administration (2-6 weeks away) - Possible influences: shorter gap, high correlation, changes in administration, interventions, practice test - Ex: skills-based test

- Very difficult - Correlation off scores from two equivalent forms of a test - Measures stability (over time) and equivalence (construct similarity) - Use sample of different times from same domain

- One administration - One form of instrument - Divides instrument and correlates the scores from the different portions

- KR-20: heterogeneous items - KR-21: homogenous items - single construct (cannot be used if items are from the same domain or differ in difficulty) - Lower reliability coefficient then split-half - Purpose: Estimate the average of all split-half reliabilities from all ways of splitting the instrument

Reliability & Validity Flashcards by Kate Bowles

Reliability

Are the results consistent?

- Provides an estimate of the proportion of unsystematic error <—need to know the degree of to determine reliability

How well did you know this?

Not at all

Perfectly

Validity

Does it measure what it says it measures?
Overall eval of evidence and degree of trustworthiness
Determine if enough support exists to use the test in a certain way

How well did you know this?

Not at all

Perfectly

Classical Test Theory

Observed score = T + E
T is the true score if the test is completely free from error
E is the error

How well did you know this?

Not at all

Perfectly

Unsystematic Error

Random errors: mood, health, fatigue
Administration differences
Scoring differences
Random guessing

How well did you know this?

Not at all

Perfectly

Systematic Error

Constant errors that occur every time tested, like a typo

How well did you know this?

Not at all

Perfectly

Reliability Related to Validity

High validity can occur if high reliability exists
High validity cannot occur if low reliability
High reliability does not suggest high validity

How well did you know this?

Not at all

Perfectly

Correlation Related to Reliability

Correlation: Statistical technique used to examine consistency
Reliability is often based on consistency between two sets of scores

How well did you know this?

Not at all

Perfectly

Positive Correlation

As one increases, so does the other

How well did you know this?

Not at all

Perfectly

Negative Correlation

As one increases, the other decreases

How well did you know this?

Not at all

Perfectly

Correlation Coefficient (Pearson-Product Moment)

Correlation coefficient: numerical indicator of the relationship between two sets of data
PPM correlation coefficient - most common
-1 to +1: closer to absolute value 1=stronger relationship

How well did you know this?

Not at all

Perfectly

Test-Retest

Give same test twice to same group
Correlation between first and second administration (2-6 weeks away)
Possible influences: shorter gap, high correlation, changes in administration, interventions, practice test
Ex: skills-based test

How well did you know this?

Not at all

Perfectly

Alternate Forms

Very difficult
Correlation off scores from two equivalent forms of a test
Measures stability (over time) and equivalence (construct similarity)
Use sample of different times from same domain

How well did you know this?

Not at all

Perfectly

Internal Consistency

One administration
One form of instrument
Divides instrument and correlates the scores from the different portions

How well did you know this?

Not at all

Perfectly

Split-Half Reliability

Given once then split in half to determine reliability
Need to divide instrument into equivalent halves, like even and odd
Problem: dividing instrument in half makes number of items smaller —> smaller correlation

Doesn’t work if test increases in difficulty and doesn’t quick fix problem

How well did you know this?

Not at all

Perfectly

Kinder-Richardson

KR-20: heterogeneous items
KR-21: homogenous items - single construct (cannot be used if items are from the same domain or differ in difficulty)
Lower reliability coefficient then split-half
Purpose: Estimate the average of all split-half reliabilities from all ways of splitting the instrument

How well did you know this?

Not at all

Perfectly

Pearson-Product Coefficient Alpha

Study These Flashcards

Used for non-dichotomous scoring
Ex: Likert scales
Cronbach’s alpha
Takes into account variance of each item
Conservation estimate of reliability
Most common

Standard Error of Measurement (SEM)

Study These Flashcards

Provides estimate of range of scores if someone were to take instrument repeatedly
Based on idea that if someone takes test multiple times, scores would fall into a normal distribution

SEM v. SD

Study These Flashcards

SD is spread of scores between students
SEM is spread of scores for one student
Uses same estimations

Content-Related Validity

Study These Flashcards

Test items measure the objectives they are supposed to measure
Focus on how content was determined
May be based on test creator’s own analysis of topic or expert analysis
How well do test items reflect the domain of material being tested

Criterion-Related Validity

Study These Flashcards

Test scores related to specific criterion/variable
Sources of criterion scores: academic achievement, level of education, performance in specialized training, job performance, psychiatric diagnosis, ratings by supervisors, correlations with previously available tests

Concurrent Validity (Criterion-Related)

Study These Flashcards

Scores on test and criterion measure are collected at same point
Ex: achievement, certification
Scorer typically higher than predictive
Require reliable and bias-free measures

Predictive Validity (Criterion-Related)

Study These Flashcards

Test is administered first and scores on criterion measure are collected at a later time
Ex: SAT, college GPA
Require reliable and bias-free measures

Construct Validity

Study These Flashcards

What do scores on this test mean or signify
Construct: Grouping of variables that make up observed behavior patterns
Ex: Self-efficacy, personality
Measured by correlation of 2 scores or factor analysis
Often seen in psych tests

Convergent v. Discriminant (Construct Validity)

Study These Flashcards

-Covergent: Positive correlation with other tests measuring the same/similar construct

Threats to Construct Validity

- Too many variables - Under-represented: missing measuring parts of construct - Extra questions - Items are too similar

Overall Threats to Validity

- History: outside events during course of test - Maturation: natural development with age - Testing: repeat testing; changes due to practice - Instrumentation: changes in measurement procedures - Statistical regression: regression to mean after extreme score first time - Interaction: any combo of 2 - Mortality: drop out - Collection of subjects: bias of collecting subjects and assigning to groups

Face Validity

- Not legitimate | - Based on appearance of the measure and its test items

Types of Evidences

- Test content - Response processes - Internet structures - Relations to other variables - Consequences of testing

Item Analysis

- Examine and eval each item in the test ---> get rid of items that don't work - Done during instrument development or revision

Item Difficulty

- Index reflecting proportion of people getting item correct - 0.0= no one got it correct - 1.0= everyone got it correct - 0.5= ideal for differentiation

Item Discrimination

- Degree to which item correctly differentiates among test takers - Extreme group method: 2 groups - high scores, low scores (works with normal distribution) - Correlational method: performance of test v. item

Item Response Theory (IRT)

- Focus on each item -considers mathematical relationship between abilities - 2 major assumptions: unidimensionality, local independence - Most common in testing where there is a right/wrong answer v. preference - Models student ability using each question instead of aggregate score

Unidimensionality

Each item measures one ability or trait

Local Independence

Unrelated to responses on other items

Selecting Tasks

- Determine what info is needed - Consider what info is needed - Search assessment resources - Eval possible instruments

Administering Tests

- Pre-testing procedures - Administration - Scoring: by hand, computer, Internet

Communicating Results

- Simple language - Individual v. Group - Written v. Oral - Communicate test's strengths and limitations - Know the manual - Describe v. Just report cases - Use various results - Involve client - Encourage asking questions - Relate test to a goal

Problems with Reporting Result

- Acceptance - Readiness of client - Negative results - Flat profiles and doesn't show anything - Motivation and attitude

Communicating Test Results for Parents

- Identifying information - Reason for referral - Background info - Test results and interpretation - Diagnostic impressions and summary

Reliability & Validity Flashcards

(39 cards)