Flashcards in Test Constructio Deck (70)
Match test characteristics with their definition:
A. Level of difficulty can attain
B. best possible performance
C. Attain pre established level of acceptable performance.
D. Usually do
E. response rate
A. Level of difficulty can obtain..power test. Either no time limit or one allows to attempt all items. Usually arranged least to most difficult and has some no one can get. Information subtest
B. test of maximum performance
Ie. achievement tests or aptitude
C. Mastery..usually all or none score
Ie. tests of basic skills
D. Test of typical performance
Ie. personality or interest tests
E. speed test..have time limits and consist of items that all or almost all can get correct. Ie digit symbol
A. Uniform procedure for administering and scoring a test.
B. includes providing details for administration.
C. Includes establishing norms
D. Objectivity is a product of the standardization process.
E. standardized test is a sample of behavior that will be representative of whole behavior.
All. A best answer
Compare test scores to norms or representative sample of population on a test.
Norms must include a truly representative sample of population for which the test is designed (and tester belongs).
To be truly representative a sample must be reasonably large.
Often tests have diff norms..kids, males, blacks etc
Adv of norms compare to others in norm. Allows comparison of performance on different tests
Disadv is don't provide an absolute or universal standard of good or bad performance. Degree norm is large and representative , this is less relevent. Always relative not absolute stds.
Objective...independent of subjective judgement. Uniform procedures for administering and scoring so examinees should get the same score regardless of who scores it.
What limits a test when the measure doesn't include an adequate range of items at the extremes?
Ceiling ...not adequately hard items and all smart kids get similar scores
Floor effects...not enough range of easy items so low achieving examinees get similar scores.
Sometimes dc in context of internal validity. Represent an interaction between selection and testing.
Ipsitive vs normative measures?
Ipsitive...individual is the frame of reference in score reporting.
Relative strengths within the individual examinee.
Express preference for one item over another rather than responding to them individually.
Normative...measure of absolute strength of ea attribute measured by the test. Can compare to others.
Defining characteristic of an objective test?
Existence of norms
Stdized set of scoring and administration procedures
Examiner discretion on scoring and interpreting
Reliability and validity
Iq test to grp on oct 1 and same test on nov 1. Interested in:
If test is vulnerable to response sets
Drawback of norm referenced interpretation is
Persons performance is compared to performance of other examinees
Doesn't allow comparisons of individual scores on diff tests
Doesn't indicate where stand in relation to others of same population
Not provide absolute stds of performance
D. Doesn't provide absolute stds of good and poor performance. Must interpret in light of norm grp as a whole.
What are the two ways to think of reliability?
A. Repeatable, dependable results
B. free from error and yields true score
C. Error is minimized
D. Measures what it is supposed to
C may be right..not definitive enough
True score..actual status on attribute being measured
Error (measurement error) refers to factors that are irrelevant to what is being measured. Doesn't effect all examinees the same way. Due to many factors
A. 0 to 1, don't square
B. -1 to 1, square
C. Interpret inversely
D. Can include any number
A. Correct! If 0, not reliable. Due entirely to random factors
If 1, no error, perfect reliability
.90. 90 percent reliable, 10 error
In other words reliability represents the proportion of the total observed variance that is true variance.
Personality tests usually .7
Selection tests in ind org .9
Pair method of estimating reliability with definition
A. Internal consistency
B. alternate forms
C. Coefficient of stability
D. divide test in two and correlate halves
E. single administration of test for internal consistency when dichotomously scored.
F. Single administration of a test for internal consistency with multiple scored items
G. Interscorer reliability
A. Correlations among individual items. 3 kinds..split half, cronbach coefficient alpha, kuder Richardson formula 20
B. coefficient of equivalence equivalent forms or parallel forms..two equivalent forms to same grp of examinees and correlate
C. Test retest reliability...same test to same grp of ppl and correlate scores on first and second admin
D. Split halves
E. kuder Richardson formula
F. Cronbachs coefficient alpha
G. Inter rater reliability...rater judgement. Most often correlate scores between two raters. Kappa used when nominal data.
What are sources of measurement error for the test retest coefficient?
A. Factors of time or time sampling.
B. practice effects
C. Longer interval between administrations decreases error
D. Memory for exam
A. T...Such as changes in exam conditions (noise, weather)
B. t..do better second time
C. Nope. Shorter time between decreases error.
D. Memory for exam...disadvantage ESP w short interval between. Then coefficient will be spuriously hi
Not appropriate for assessing stuff that measures unstable attributes like mood. This would reflect the instability of the attribute rather than the test. Or tests affected by repetition. So few psychological tests.
What are the sources of error for alternate forms reliability?
D. Practice effects
A, b true
C,d. Nope! Reduces these problems. However if content is similar may be a bit of a problem
Two forms can't be administered at the same time. However if administered in immediate succession then not considered a source of error.
Some say best one to use...if hi then it is consistent across time and different content.
Can't be used for unstable traits
Describe the three types of coefficients of internal consistency:
Give error source
Describe what they are good for measuring and not good.
Correlations among individual items
Split half is correlation between the halves as if two shorter tests. Odd vs even numbered. Lowers reliability bc shorter test is lower.
Overcome w spearman and kuder (multiple scored items)
Error sources...content sampling or item heterogeneity. Lowered if items are different in terms of content sampling.
Bad..speed tests..would be spuriously hi near 1.0
Inter rater reliability is increased if:
A. Well trained
B. know being observed
C. Adequate rating scale
Ie on behavioral rating scale items should be mutually exclusive and exhaustive.
Match methods of recording w their definition.
A. Recorder keeps count of number of times the target behavior occurs.
B. observe at intervals and note of engaging or not in behavior
C. Record all behavior subj doing during the time
D. Record elapsed time during which target behavior occurs
A. Frequency..good for short time and when duration doesn't matter
B. interval..good if no fixed beginning or end
C. Continuous. Usually recording all the behavior of target subject during ea observation session...write narrative description in chronological order of everything subject does.
Standard error of measurement is
A. How much error a test contains
B. how much error an individual test score can be expected to have.
C. How much error expected in a criterion score estimated by predictor
D. Measure of internal consistency
A. Reliability coefficient
C. Standard error of estimate
D. Correlations among individual items.
Standard error of measurement used to make confidence interval which gives the range within which an examinees true score is likely to fall, given obtained score.
Formula is sd multiplied by the square root of 1 minus the reliability coefficient.
If reliability is reduced, error increases.
CI 68% fall one
95% fall +/- 1.96
99% fall +/- 2.58
Ie. score100 +/- 4 (measurement error) or between 96 and 104.
95% is score +/- 1.96 or
100+/- 8 or between 92 and 108.
99% is 90 to 110.
If reliability coefficient is 1.0 then the standard error of measurement is:
A. Less than 1
C. Between zero and 1
D. Needs to be calculated
B. perfect reliability then no error
Can use formula and see if reliability coefficient is 0 then std error of measurement will be equal to standard deviation of scores.
Same goes for the standard error of the estimate that is used to interpret an individuals predicted score on a given criterion measure. Validity coefficient is 1. Std error of estimate is 0. No error in prediction.
If validity coefficient is 0 then std error of estimate is equal to std deviation of criterion scores.
What factors impact the reliability coefficient?
A. Decrease in score variability
B. anything that increases error variance
C. Longer tests
D. Homogeneous group
E. type of questions asked. T/F
F. Homogeneous items using stats
A, B true
C. Longer more reliable (spearman brown is applied to estimate effects of lengthening or shortening a test on its reliability.
D. More homogeneous then variability and then reliability coefficient decrease.
E. test items too hard or too easy variability is decreased and so is the coefficient. Floor/ceiling.
F. Lower reliability if can guess correctly. So true/false is less reliable then multiple choice which is less reliable than fill in blank.
G. For particular inter item consistency as measured by kuder Richardson or coefficient alpha, reliability is increased as items more homogeneous.
Would not use kuder Richardson for:
Dichotomously scored test
Measure an unstable trait
Way to improve inter rater reliability of a behavioral observation scale would be to use
Mutually exclusive rating categories
Non exhaustive rating cAtegories
Highly valid rating categories
Empirically derived rating categories
Std error of measurement is
Inversely related to reliability coefficient and inversely to std dev
Positively related to reliability coef
And positively to std dev
Positively related to reliability nd inversely related to std dev
Inversely related to reliability coef and positively related to std dev
When practical, most advisable to use
Alternate forms reliability coef
According to classical test theory, observed test score reflects
True score variance plus systematic error
True score variance plus random variance
True score variance plus random and systematic
True score variance only
B. error is random by definition
Which methods of recording is most useful when target behavior has no fixed beginning or end?
A. During interval decide if behavior is occurring not when begins or ends.
Match the type of validity with its definition:
A. Predicts someone's status on an external criterion measure.
B. measures a theoretical, non observable construct or trait.
C. Measures knowledge of domain designed to measure.
D. Hi correlation w another test measures the same thing.
E. low correlation w test that measures something different
A. Criterion related validity
B. construct validity
C. Content validity
D. Convergent validity
E. Divergent validity
A. Test measures what is is suppose to measure
B. tests usefulness
C. Consistent over time
D. Must consider what it is for
A, b, d correct
No test had validity per se
Content validity: t/f
A. Especially useful for achievement tests
B. extent test items adequately and representatively sample content to be measured
C. Determine via statistical analysis
D. Appears valid to those who take it.
A. Correct and used in industrial settings. Like work samples, license exam
C. False. Judgement and agreement of subject matter experts. Clearly identify domain. do subcategories and select from ea. to make rep. May want it to have hi correlation w tests of same content domain or those successful in the class.
D. False! Face validity not really a type of validity but it is desirable or ppl may not cooperate, lack motivation etc.
Criterion related validity:
A. Scores on predictor test correlated w outside criterion
B. used a correlation coefficient
C. Gather data at same time.
D. Useful at predicting individuals behavior in specific situations.
A. True! Criterion is job performance, school achievement, test scores or that which is predicted.
B. true like Pearson r and called the criterion related criterion coefficient. -1 to 1; 1 is perfect validity; o is none. Few exceed .6; even .3 may be ok. Square to interpret. Proportion of variability in criterion explained or shared by variability of the predictor.
C. Part true. Can gather at same time called concurrent validity (typing test) best for current status.
Less costly, more convenient. Often used for predictive (ie pilot. Can't hire all so do test pick best)
Or can have predictive validation where predictor done first and criterion done later (GRE score and later looked at GPAs and then correlated). Best to predict future status.
D. ESP to select employees , decide admissions, place in special classes.
What is used to determine where the actual criterion score will fall?
Standard error of the mean
Standard error of measurement
Standard error of estimate
A. How much sample mean can be expected to deviate from population mean
B. reliability measurement
D. Allows prediction of unknown value of One variable from the known value of another. Used to get PREDICTED score.
C. Use this to get a confidence interval.
95% chance true criterion score will be w in the predicted score
So predicted score +/- (1.96)(std e estimate).
So iq of 115 put into regression equation and shows math score should be 80. SEEst is 5
95% chance actual score between 70 and 90.