Flashcards in Test Psychometrics Overview Deck (25):
the extent to which a test measures what it set out to measure
Types of Validity
does the test appear to be measuring something meaningful?
Three MAIN types of validity
-What do experts believe is being measured? This is the least quantitative form. Does the content fit with the construct?
-Contains some inter-rater reliability (Kappa)
-Important for intelligence/achievement
does the measure appropriately predict aspects that it should? 3 kinds:
(c) Known groups
Types of criterion validity
– includes all forms; the extent to which the construct being suggested is actually being measures. 3 types:
(a) Convergent (and divergent)
(c) Internal Structure
- does it correlate with other measures given at the same time (aka, predict complementary performance)?
– does it predict future performance? (e.g., GRE score)
known groups validity
– using groups with expected, different outcomes (e.g., giving intelligence tests to individuals with MR and giftedness)
– the target test theoretically related to other tests?
measure of depression should be positively correlated with other depression measures
Divergent– does the test NEGATIVELY relate to other tests that it SHOULDN'T be related to positively? negative correlation (e.g., happiness measure should be negatively correlated with depression measure)
relation to a theoretically unrelated construct. Should be uncorrelated with it.
internal structure validity
(aka, Factor Validity) – looks at the factors within the construct. Most tests have bad internal structure. Why? Not theory driven! Only three intelligence tests have good internal structure (according to Dr. MacDonald):
KABC-II(5/4 factor of CHC)
whether the measure will increase the predictive ability of an existing method of assessment. In other words, incremental validity seeks to answer if the new test adds much information that might be obtained with simpler, already existing methods. Example: some have argued that the Rorschach has poor incremental validity since other, more easily administered tests of personality gather the same data, just in a less tedious way.
whether the measure appropriately simulates real-world phenomena. This should not be confused with external validity, which refers to the generalizability of findings to the real world. In other words, an ecologically valid measure should appropriately capture the feel of the corresponding real-world scenario. Example: mock-juries may produce externally valid findings. However, most mock-juries do no include actual court proceedings, instead court transcripts of a trial. Thus, mock-juries could be said to have poor ecological validity.
the consistency of a measure
components of reliability
-Cronbach's alpha (internal consistency measure in statistics)
-Alternate forms (e.g. Blue and Green forms of WRAT-5)
-Split-half (splitting the test in half and comparing it with itself by correlation
-Test-retest (temporal stability of scores)
standard error of measurement
We want this to be low so that R is high and we are accurately assessing and thus obtaining the person's true score, reflective of an accurate assessment of the characteristic(s) in question.
-To accurately interpret test data, to ascertain a person's exact position with reference to a standardized sample, we must have a normative reference group because otherwise a raw score has no meaning. So, we need to see where the person falls in the sample's relative standing.
-Raw score is converted into a derived score (a relative measure), which tells the person relative standing.
-There is a need for cultural/ethnic normative groups, which can be accomplished through stratified random sampling.
Objective test validity
have high face validity, intent is easy to discern and hence participants can fake their responses, tests require person to be introspective and accurately answer truthfully, often resulting in false positives, in addition, defensiveness of person may prevent them from accurately responding
projective test validity
are better predictors for long term behavioral patterns, while self report measures work best when both test items and criterion behaviors are assessed at or near the same time and are matched for specificity, the longer the time interval, the less predictive the test will be, objective measurements are best at predicting short-term behavior patterns, best to use a combination of both objective and projective measures
Reliability Rules of thumb (cut offs)
.90 for decision making tasks
.80 or above for clinical and psychoeducational tasks (moderate)
.70-.79 for subtests are relatively reliable
.60-.69 subtests are marginally reliable
less than .60 are unreliable