10- Test Construction Flashcards

Question

``` A large monotrait-heteromethod coefficient in a multitrait-multimethod matrix indicates evidence of: Select one: A. convergent validity B. concurrent validity C. predictive validity D. discriminant validity ```

Answer 1

Correct Answer is: A A multitrait-multimethod matrix is a complicated method for assessing convergent and discriminant validity. Convergent validity requires that different ways of measuring the same trait yield the same result. Monotrait-heteromethod coefficients are correlations between two measures that assess the same trait using different methods; therefore if a test has convergent validity, this correlation should be high. Heterotrait-monomethod and heterotrait-heteromethod, both confirm discriminatory validity, and monotrait-monomethod coefficients are reliability coefficients. Additional Information: Convergent and Discriminant (Divergent) Validation

Answer 2

Correct Answer is: B The clue here is the practice effect. That means that if you give a test, just taking it will give the person practice so that next time, he or she is not a naive person. To control for that, we want to eliminate the situation where the person is administered the same test again. So we do not use test-retest. We can use the two other methods listed. We can use split-half since, here, only one administration is used (the two parts are thought of as two different tests). And, in the alternative forms method, a different test is given the second time, controlling for the effects of taking the same test twice.

Answer 3

Correct Answer is: C One of the factors that affect the reliability coefficient is guessing. Guessing correctly decreases the reliability coefficient. The incorrect options ("is not affected," "stays the same," and "increases") are not true in regards to the reliability coefficient. Additional Information: Factors Affecting Reliability

Answer 4

Correct Answer is: C Triangulation is the attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability. calibration Calibration is the attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used. For example, raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (e.g.., defining a "2" for an item dealing with job performance). intraclass correlation (ICC) Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters and may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance. correction for attenuation Correction for attenuation is a method used to adjust correlation coefficients upward because of errors of measurement when two measured variables are correlated; the errors always serve to lower the correlation coefficient as compared with what it would have been if the measurement of the two variables had been perfectly reliable. Additional Information: Factors Affecting Reliability

Answer 5

Correct Answer is: B The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinee's true score is likely to fall, given his or her obtained score. To calculate the 95% confidence interval we simply add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We're told that the examinee's obtained score is 90. 90 +/- 10 results in a confidence interval of 80 to 100. In other words, we can be 95% confident that the examinee's true score falls within 80 and 100. Additional Information: Standard Error of Measurement

Answer 6

Correct Answer is: D Use of a multitrait-multimethod matrix is one method of assessing a test's construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-heteromethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different (hetero) traits using different (hetero) methods. An example might be the correlation between vocabulary subtest scores on the WAIS-IV for intelligence and scores on the Beck Depression Inventory for depression. Since these measures presumably measure different constructs, the correlation coefficient should be low, indicating high divergent or discriminant validity. Additional Information: Convergent and Discriminant (Divergent) Validation

Answer 7

Correct Answer is: C To determine criterion-related validity, scores on a predictor test are correlated with an outside criteria. The criteria is that which is being predicted, or the "predictee." Additional Information: Relationship between Reliability and Validity

Answer 8

Correct Answer is: B A simple way to answer this question is with reference to a chart such as the one displayed under the topic "Criterion-Related Validity" in the Psychology-Test Construction section of your materials. If you look at this chart, you can see that increasing the predictor cutoff score (i.e., moving the vertical line to the right) decreases the number of false positives as well as true positives (you can also see that the number of both true and false negatives would be increased). You can also think about this question more abstractly by coming up with an example. Imagine, for instance, that a general knowledge test is used as a predictor of job success. If the cutoff score on this test is raised, fewer people will score above this cutoff and, therefore, fewer people will be predicted to be successful. Another way of saying this is that fewer people will come up "positive" on this predictor. This applies to both true positives and false positives. Additional Information: Decision-Making

Answer 9

Correct Answer is: A The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinees's true score is likely to fall, given his or her obtained score. To calculate the 68% confidence interval we simply add and subtract one standard error of measurement to the obtained score. impossible to calculate without the reliability coefficient This choice is incorrect because although the reliability coefficient is needed to calculate a standard error of measurement, in this case, we are provided with the standard error. Additional Information: Standard Error of Measurement

Answer 10

Correct Answer is: A The question is just a roundabout way of asking "what is the standard error of measurement?", though it does supply a practical application of the concept. According to classical test theory, an obtained test score consists of truth and error. The truth component reflects the degree to which the score reflects the actual characteristic the test measures, and the error component reflects random or chance factors affecting the score. For instance, on an IQ test, a score will reflect to some degree the person's "true" IQ and to some degree chance factors such as whether the person was tired the day he took the test, whether some of the questions happen to be a particularly good fit with the person's knowledge base, etc. The standard error of measurement of a test indicates the expected amount of error a score on that test will contain. It can be used to answer the question, "given an obtained score, what is the likely true score?" For example, if the test referenced had a standard error of measurement of 5, there would be a 68% chance that the true test score lies within one standard error of measurement of the obtained score (between 128 and 138 in this case), and a 95% chance that the true score lies within two standard errors of measurement (between 123 - 143). So in the example, the parent would be interested to know what the test's standard error of measurement because the higher it is, the greater the possibility that an obtained score of 133 actually reflects a true score of 135 or above. Additional Information: Standard Error of Measurement

Answer 11

Correct Answer is: A The rating scale described by the question has good inter-rater reliability, or consistency across raters. However, it may or may not have good validity; that is, it may or may not measure what it purports to measure. The question illustrates that high reliability is a necessary but not a sufficient condition for high validity. Additional Information: Interscorer Reliability

Answer 12

Correct Answer is: D A test with limited ceiling has an inadequate number of difficult items resulting in few low scores. Therefore the distribution would be negatively skewed. Additional Information: Skewed Distributions, Measures of Central Tendency, Skewed Distributions

Answer 13

Correct Answer is: A Factor analysis is a statistical procedure that is designed to reduce measurements on a number of variables to fewer, underlying variables. Factor analysis is based on the assumption that variables or measures highly correlated with each other measure the same or a similar underlying construct, or factor. For example, a researcher might administer 250 proposed items on a personality test and use factor analysis to identify latent factors that could account for variability in responses to the items. These factors would then be interpreted based on logical analysis or the researcher's theories. If one of the factors identified by the analysis correlated highly with items that asked about the person's happiness, level of energy, and hopelessness, that factor might be labeled "Depressive Tendencies." In factor analysis, rotation is usually the final statistical step. Its purpose is to facilitate the interpretation of data by identifying variables that load (i.e., correlate) highly on one factor and not others. Additional Information: Interpreting and Naming the Factors

Answer 14

Correct Answer is: A All the choices refer to methods of recording behaviors that can be used by observational raters or researchers. In interval recording (the correct answer), the rater observes a subject at given intervals and notes whether or not the subject is engaging in the target behavior during that interval. For instance, a rater might observe a student for 10 seconds every three minutes and record whether on not the student is on-task during those 10 seconds. Interval recording is most useful for behaviors that do not have a fixed beginning or end -- such as being on task. Frequency recording involves keeping count of the number of times a behavior occurs; this would not be practical in keeping track of whether or not a person is on task. Continuous recording involves recording all the behaviors of the target subject during each observation session. Although it's possible to keep track of whether a person is on-task using this method, it is not as practical or meaningful for this purpose as interval recording. Finally, duration recording involves recording the elapsed time during which the target behavior or behaviors occur. This would not be practical for a behavior that has no fixed beginning or end. Additional Information: Interscorer Reliability

Answer 15

Correct Answer is: C In factor analysis, rotating the factors changes the factor loadings for the variables and eigenvalue for each factor although the total of the eigenvalues remains the same. Additional Information: Interpreting and Naming the Factors

Answer 16

Correct Answer is: A A criterion measure is one on which a predictor test attempts to predict outcome; it could be termed the "predictee." For example, if scores on a personality test were used to predict job success as measured by supervisor evaluations, the supervisor evaluations would be the criterion measure. Criterion contamination occurs when a factor irrelevant to what is being measured affects scores on the criterion. When the criterion measure is based on subjective ratings, rater knowledge of predictor scores is a common source of criterion contamination. In our example, if supervisors knew employees' results on the personality test, their evaluations might be biased based on their knowledge of these scores. Additional Information: Criterion Contamination

Answer 17

Correct Answer is: D A percentage score indicates the number of items answered correctly. A percentile rank compares one examinee's score with all other examinee's scores. Additional Information: Percentile Ranks

Answer 18

Correct Answer is: D When a factor analysis produces a series of factors, it is useful to determine how much of the variance is accounted for by each factor. An eigenvalue is based on the factor loadings of all the variables in the factor analysis to a particular factor. When the factor loadings are high, the eigenvalue will be large. A large eigenvalue would mean that a particular factor accounts for a large proportion of the variance among the variables. Additional Information: Explained Variance (or Eigenvalues)

Answer 19

Correct Answer is: D A moderator variable is any variable which moderates, or influences, the relationship between two other variables. If the validity of a job selection test is different for different ethnic groups (i.e. there is differential validity), then ethnicity would be considered a moderator variable since it is influencing the relationship between the test (predictor) and actual job performance (the criterion). A confounding variable is a variable in a research study which is not of interest to the researcher, but which exerts a systematic effect on the DV. Criterion contamination is the artificial inflation of validity which can occur when raters subjectively score ratees on a criterion measure after they have been informed how the ratees scored on the predictor. Additional Information: Factors Affecting the Validity Coefficient

Answer 20

Correct Answer is: D The correlation coefficient for a test and an identified factor is referred to as a factor loading. To obtain a measure of shared variability, the factor loading is squared. This example, the factor loading is .80, meaning that 64% (.80 squared) of variability in the test is accounted for by the factor. The other identified factor(s) probably also account for some variability in Test A, which is why this option is not the best answer: only 64% of variability in Test A is accounted for by the factor analysis. Additional Information: Factor Analysis

Answer 21

Correct Answer is: B According to classical test theory, the reliability of a test indicates the degree to which examinees' scores are free from error and reflect their "true" test score. Reliability is typically measured by obtaining the correlation between scores on the same test, such as by having examinees take then retake the test and correlating both sets of scores (test-retest reliability) or by dividing the test in half and correlating scores on both halves (split-half reliability). Cronbach's alpha, like split-half reliability, is categorized as an internal consistency reliability coefficient. Its calculation is based on the average of all inter-item correlations, which are correlations between responses on two individual items. Mathematically, Cronbach's alpha works out to the average of all possible split-half correlations (there are many possible split-half correlations because there are many different ways of splitting the test in half). Regarding the other choices, the Spearman-Brown formula is used to estimate the effects of lengthening a test on its reliability coefficient. Longer tests are typically more reliable. The Spearman-Brown formula is commonly used to adjust the split-half coefficient to estimate what reliability would have been if the halved tests had as many items as the full test. The chi-square test is used to test predictions about observed versus expected frequency distributions of nominal, or categorical, data; for example, if you flip a coin 100 times, you can use the chi-square test to determine if the distribution of heads versus tails outcomes falls into the expected range or if there is evidence that the coin toss was "fixed." And the point-biserial correlation coefficient is used to correlate dichotomously scaled variables with interval or ratio data; for example, it can be used to correlate responses on test items scored as correct or incorrect with scores on the test as a whole. Additional Information: Internal Consistency Reliability

Answer 22

Correct Answer is: B Based on performance on the predictor and the criterion, individuals may be classified as false positives, true positives, false negatives, or true negatives. False negatives, like the athlete, are not identified as having used substances when, in fact, they have. Conversely, false positives are identified by the drug screening test as having used or having substances present, when they have not. True positives are individuals identified by the screening test as having substances present and they do. True negatives are individuals not identified by the screening test to have substances present and do not. Additional Information: Decision-Making

Answer 23

Correct Answer is: B An item difficulty index indicates the percentage of individuals who answer a particular item correctly. For example, if an item has a difficulty index of .80, it means that 80% of test-takers answered the item correctly. Although it appears that the item difficulty index is a ratio scale of measurement, according to Anastasi (1982) it is actually an ordinal scale because it does not necessarily indicate equivalent differences in difficulty. Additional Information: Item Difficulty

Answer 24

Correct Answer is: A Item response theory is a highly technical mathematical approach to item analysis. Use of item analysis is based on a number of complex mathematical assumptions. One of these assumptions, known as invariance of item parameters, holds that the characteristics of items should be the same for all theoretically equivalent groups of subjects chosen from the same population. Thus, any culture-free test should demonstrate such invariance; i.e., a set of items shouldn't have a different set of characteristics for minority and non-minority subgroups. it cannot be applied in the attempt to develop culture-fair tests. For this reason, item response theory has been applied to the development of culture-free tests, so this choice is not a true statement. The other choices are all true statements about item response theory, and therefore incorrect answers to this question. it's a useful theory in the development of computer programs designed to create tests tailored to the individual's level of ability. Consistent with this choice, item response theory is the theoretical basis of computer adaptive assessment, in which tests tailored to the examinee's ability level are computer generated. one of its assumptions is that test items measure a "latent trait." As stated by this choice, an assumption of item response theory is that items measure a latent trait, such as intelligence or general ability. it usually has little practical significance unless one is working with very large samples. And, finally, research supports the notion that the assumptions of item response theory only hold true for very large samples. Additional Information: Item Difficulty

Answer 25

Correct Answer is: B The purpose of factor analysis is to determine the degree to which many tests or variables are measuring fewer, underlying constructs. For example, factor analyses of the WAIS-III have suggested that four factors -- verbal comprehension, perceptual organization, processing speed, and working memory -- explain, to a large degree, scores on the fourteen subtests. Another way of saying this is that a factor analysis helps to identify the underlying structure in a set of variables. Additional Information: Factor Analysis

Answer 26

Feedback Correct Answer is: A Floor refers to a test's ability to distinguish between examinees at the low end of the distribution, which would be an issue when distinguishing between those with mild versus moderate retardation. Limited floor occurs when the test does not contain enough easy items. Note that "ceiling" would be of concern for tests designed to distinguish between examinees at the high end of the distribution: distinguish between above-average and gifted students. Additional Information: Ceiling and Floor Effects

Answer 27

Correct Answer is: C Variables that affect the validity of a test are moderator variables. When a moderator variable is present a test is said to have differential validity--meaning there would be a different validity coefficient for the New Zealanders group than for the others. Additional Information: Factors Affecting the Validity Coefficient

Answer 28

Feedback Correct Answer is: C You might have been able to guess correctly using the process of elimination. If so, note that R-squared tells you how much your ability to predict is improved using the regression line, compared to not using it. The most possible improvement is 1 and the least is 0. The number of values that are free to vary in a statistical calculation This choice is the definition of degrees of freedom. The variability of scores This is the definition of variance. The relationship between two variables that have a nonlinear relationship And this is a description of the coefficient eta.

Answer 29

Correct Answer is: B In most cases, you would square the correlation coefficient to obtain the answer to this question. However, the reliability coefficient is an exception to this rule: it is never squared. Instead, it is interpreted directly. This means that the value of the reliability coefficient itself indicates the proportion of variance in a test that reflects true variance. Additional Information: Reliability (Shared Variability)

Answer 30

Correct Answer is: C You probably remember that the alternate forms coefficient is considered by many to be the best reliability coefficient to use when practical (if you don't, commit this factoid to memory now). Everything else being equal, it is also likely to have a lower magnitude than the other types of reliability coefficients. The reason for this is similar to the reason why it is considered the best one to use. To obtain an alternate forms coefficient, one must administer two forms of the same test to a group of examinees, and correlate scores on the two forms. The two forms of the test are administered at different times and (because they are different forms) contain different items or content. In other words, there are two sources of error (or factors that could lower the coefficient) for the alternate forms coefficient: the time interval and different content (in technical terms, these sources of error are referred to respectively as "time sampling" and "content sampling"). The alternate forms coefficient is considered the best reliability coefficient by many because, for it to be high, the test must demonstrate consistency across both a time interval and different content. Additional Information: Alternate Forms Reliability

Answer 31

Correct Answer is: C If a test item has an item difficulty level of .50, this means that 50% of examinees answered the item correctly. Therefore, items with this difficulty level are most useful for discriminating between "high scoring" and "low scoring" groups. Additional Information: Item Difficulty

Answer 32

Correct Answer is: C Test-retest reliability, or the coefficient of stability, involves administering the same test to the same group on two occasions and then correlating the scores. split-half Split-half reliability is a method of determining internal consistency reliability. equivalence Alternative forms reliability, or coefficient of equivalence, consists of administering two alternate forms of a test to the same group and then correlating the scores. internal consistency Internal consistency reliability utilizes a single test administration and involves obtaining correlations among individual test items. Additional Information: Test-Retest Reliability

Answer 33

Correct Answer is: A The slope of a regression line for a test is directly related to the test's criterion-related validity: The steeper the slope, the greater the validity. A test has differential validity when it has different validity coefficients for different groups, which is what is suggested by different regression line slopes in a scatterplot. Factorial validity refers to the extent a test or test item correlates with factors expected to be correlated with in a factor analysis. The extent a test does not correlate with measures of an unrelated construct is referred to as divergent validity. Convergent validity refers to the degree a test correlates with measures of the same or a similar construct. Additional Information: Factors Affecting the Validity Coefficient

Answer 34

Correct Answer is: C Confidence interval indicates the range within which an examinees' true score is likely to fall, given his or her obtained score. The standard error of measurement indicates how much error an individual test score can be expected to have and is used to construct confidence intervals. To calculate the 68% confidence interval, add and subtract one standard error of measurement to the obtained score. To calculate the 95% confidence interval, add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We're told that the examinee's obtained score is 70. 70 + 10 results in a confidence interval of 80 to 100. In other words, we can be 95% confident that the examinee's true score falls within 60 and 80. Additional Information: Standard Error of Measurement

Answer 35

Correct Answer is: D Confidence intervals allow us to determine the range within which an examinee's true score on a test is likely to fall, given his or her obtained score. The standard error of measurement is used to construct confidence intervals, not the other way around. Additional Information: Standard Error of Measurement

Answer 36

Correct Answer is: A On a job selection test, a "false positive" is someone who is identified by the test as successful but who does not turn out to be successful, as measured by a performance criterion. If you raise the selection test cutoff score, you will reduce false positives, since, by making it harder to "pass" the test, you will be ensuring that the people who do pass are more qualified and therefore more likely to be successful. By lowering the criterion score, what you are in effect doing is making your definition of success more lax. It therefore becomes easier to be considered successful, and many of the people who were false positives will now be considered true positives. If you understand concepts in pictures better than in words, refer to the Psychology-Test Construction section, where a graph is used to explain this idea. Additional Information: Decision-Making

Answer 37

Correct Answer is: D Use of a multitrait-multimethod matrix is one method of assessing a test's construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have good divergent validity if it had a low correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is not measuring traits that are unrelated to depression. Additional Information: Convergent and Discriminant (Divergent) Validation

Answer 38

Correct Answer is: C A predictor that is highly sensitive will more likely identify the presence of a characteristic; that is, it will result in more positives (true and false). This may be desirable when the risk of not detecting a problem is high. For example, in the detection of cancer, a blood test that results in a high number of false positives is preferable to one that has many false negatives. A positive test result can then be verified by another method, for example, a biopsy. Measurement error is the part of test scores which is due to random factors. Type II error is an error made when an experimenter erroneously accepts the null hypothesis. Additional Information: Decision-Making

Answer 39

Correct Answer is: B This question is difficult because the language of the response choices is convoluted and imprecise. We don't write questions like this because we're sadistic; it's just that you'll sometimes see this type of language on the exam as well, and we want to prepare you. What you need to do on questions like this is bring to mind what you know about the issue being asked about, and to choose the answer that best applies. Here, you should bring to mind what you know about the relationship between reliability and validity: For a test to have high validity, it must be reliable; however, for a test to have high reliability, it does not necessarily have to be valid. With this in mind, you should see that "high validity assumes high reliability" is the best answer. This means that a precondition of high validity is high reliability. The second best choice states that low reliability assumes low validity. This is a true statement if you interpret the word "assume" to mean "implies" or "predicts." But if you interpret the word "assume" to mean "depends on" or "is preconditioned by," the statement is not correct. Additional Information: Relationship between Reliability and Validity

Answer 40

Correct Answer is: A As described in the Federal Uniform Guidelines on Employee Selection, differential prediction is a potential cause of test unfairness. Differential prediction occurs when the use of scores on a selection test systematically over- or under-predict the job performance of members of one group as compared to members of another group. a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion Criterion contamination occurs when a rater's knowledge of ratees' performance on the predictor biases his/her ratings of ratees' performance on the criterion. a predictor's validity coefficient differs for different groups Differential validity, also a possible cause of adverse impact, occurs when a predictor's validity coefficient differs for different groups. a test has differential validity When a test has differential validity, there is a slope bias. Slope bias refers to differences in the slope of the regression line. Additional Information: Adverse Impact

Answer 41

Correct Answer is: A This question is best answered with reference to the formula for the standard error of measurement, which appears in the Psychology-Test Construction section. It is calculated by subtracting 1 by reliability coefficent, and taking the square root of this value; then this is multiplied by the standard deviation of x.You need to know the minimum and maximum values of the reliability coefficient -- 0 and +1.0, respectively.If the reliability coefficient is +1.0, you will find from the above formula that the standard error of measurement is 0, which is its minimum value. And when the reliability coefficient is 0, you find from the formula that the standard error of measurement is equal to the standard deviation of test scores, which is its maximum value. Additional Information: Standard Error of Measurement

Answer 42

Correct Answer is: B The item difficulty index ranges from 0 to 1, and it indicates the number of examinees who answered the item correctly. Items with a moderate difficulty level, typically 0.5, are preferred because it helps to maximize the test's reliability. Additional Information: Item Difficulty

Answer 43

Correct Answer is: C A validity coefficient and the standard error of estimate are both measures of the accuracy of a predictor test. A validity coefficient is the correlation between scores on a predictor and a criterion (outcome) measure. A coefficient of 1.0 reflects a perfect correlation; it would mean that one would always be able to perfectly predict, without error, the scores on the outcome measure. The standard error of estimate indicates how much error one can expect in the prediction or estimation process. If a predictor test has perfect validity, there would be no error of estimate; you would always know the exact score on the outcome measure just from the score on the predictor. Therefore, the closer the validity coefficient is to 1.0, the smaller the value of the standard error of estimate, and if the coefficient were 1.0, the standard error of estimate would be 0. Additional Information: Standard Error of Estimate

Answer 44

Correct Answer is: A There are a number of ways to estimate the interscorer reliability, but the most common involves calculating a correlation coefficient between the scores of two different raters. The Kappa coefficient is a measure of the agreement between two judges who each rate a set of objects using the nominal scales. Additional Information: Interscorer Reliability

Answer 45

Correct Answer is: D In any test used to make a "yes/no" decision (e.g., screening tests, medical tests such as pregnancy tests, and job selection tests in some cases), the term "sensitivity" refers to the proportion of correctly identified cases--i.e., the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic. You can also conceptualize sensitivity in terms of true positives and false negatives. A "positive" on a screening test means that the test identified the person as having the condition, while a "negative" is someone classified by the test as not having the condition. The term true and false in this context refer to the accuracy or correctness of test results. Therefore, sensitivity can be defined as the ratio of true positives (people with the condition whom the test correctly detects) to the sum of true positives and false negatives (all the examinees who have the condition).

Answer 46

Correct Answer is: C Internal consistency is one of several types of reliability. As its name implies, it is concerned with the consistency within a test, that is, the correlations among the different test items. Split-half reliability is one of the measures of internal consistency and involves splitting a test in two and correlating the two halves with each other. Other measures of internal (inter-item) consistency are the Kuder-Richardson Formula 20 (for dichotomously scored items) and Cronbach's coefficient alpha (for multiple-scored items). Test-retest reliability is not concerned with internal consistency, but rather, the stability of a test over time, and uses the correlations of scores between different administrations of the same test. Alternative forms reliability is concerned with the equivalence of different versions of a test. And the kappa coefficient is used as a measure of inter-rater reliability, that is, the amount of agreement between two raters. Additional Information: Internal Consistency Reliability

Answer 47

Correct Answer is: D Tests can be compared to each other in terms of whether they emphasize power or speed. A pure speed test contains relatively easy items and has a strict time limit; it is designed to measure examinees' speed of response. A pure power test supplies enough time for most examinees to finish and contains items of varying difficulty. Power tests are designed to assess examinees' knowledge or ability in whatever content domain is being measured. Many tests measure both power and speed. An internal consistency reliability coefficient measures the correlation of responses to different items within the same test. On a pure speed test, all items answered are likely to be correct. As a result, the correlation between responses is artificially inflated; therefore, for speed tests, other measures of reliability, such as test-retest or alternate forms, are more appropriate. Additional Information: Speed, Power, and Mastery Tests

Answer 48

Correct Answer is: A Perceptual speed tests are highly speeded and are comprised of very easy items that every examinee, it is assumed, could answer correctly with unlimited time. The best way to estimate the reliability of speed tests is to administer separately timed forms and correlate these, therefore using a test-retest or alternate forms coefficient would be the best way to assess the reliability of the test in this question. The other response choices are all methods for assessing internal consistency reliability. These are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. However, they are not appropriate for assessing the reliability of speed tests because they tend to produce spuriously high coefficients. Additional Information: Reliability

Answer 49

Correct Answer is: B In the context of factor analysis, "oblique" means correlated or dependent. ("Orthogonal" means uncorrelated or independent.) Additional Information: Interpreting and Naming the Factors

Answer 50

Correct Answer is: B The question describes the kind of information that is provided in an item response curve, which is constructed for each item to determine its characteristics when using item response theory as the basis for test development. (Note that there is no such thing as "item utility theory.") Additional Information: Item Response Theory and Item Response Curve

Answer 51

Correct Answer is: D An oblique rotation is used when the variables included in the analysis are considered to be correlated. effects of one or more variables have been removed from X and Y. This choice describes semi-partial correlation. effects of one or more variables have been removed from X only. This describes partial correlation. variables included in the analysis are uncorrelated. When the variables included in the analysis are believed to be uncorrelated, an orthogonal rotation is used. Additional Information: Interpreting and Naming the Factors

Answer 52

Correct Answer is: C A test's reliability sets an upper limit on its criterion-related validity. Specifically, a test's validity coefficient can never be higher than the square root of its reliability coefficient. In practice, a validity coefficient will never be that high, but, theoretically, that's the upper limit. Additional Information: Relationship between Reliability and Validity

Answer 53

Correct Answer is: D The item response curve provides information about an item's difficulty; ability to discriminate between those who are high and low on the characteristic being measured; and the probability of correctly answering the item by guessing. The position of the curve indicates its difficulty* and the steeper the slope of the item response curve, the better its ability to discriminate (correct response) between examinees who are high and low on the characteristic being measured. The item response curve does not indicate reliability* or validity* (* incorrect options). Additional Information: Item Response Theory and Item Response Curve

Answer 54

Correct Answer is: A Biographical Information Blanks (BIB) have actually been found to be highly predictive of job success and only slightly less valid than cognitive ability tests for predicting job performance. However, they often lack face validity since some of the questions do not appear to the applicants to have anything to do with job performance. Additional Information: Content Validity

Answer 55

Correct Answer is: D Before we can compare different forms of scores, we must transform them into some form of standardized measure. A Math test which has a mean of 50 and an SD of 10 indicates that a raw score of 70 would fall 2 standard deviations above the mean. Assuming a normal distribution of scores, a percentile rank of 84 on a History test is equivalent to 1 standard deviation above the mean. If you haven't memorized that, you could still figure it out: Remember that 50% of all scores in a normal distribution fall below the mean and 50% fall above the mean. And 68% of scores fall within +/- 1 SD of the mean. If you divide 68% by 2 - you get 34% (the percentage of scores that fall between 0 and +1 SD). If you then add that 34% to the 50% that fall below the mean - you get a percentile rank of 84. Thus, the 84 percentile score is equivalent to 1 SD above the mean. Finally, looking at the T-score on the English test - we know that T-scores always have a mean of 50 and an SD of 10. Thus a T-score of 65 is equivalent to 1½ standard deviations above the mean. Comparing the 3 test scores we find the highest score was in Math at 2 SDs above the mean, followed by English at 1½ SDs above the mean, and History at 1 SD above the mean. Additional Information: Standard Scores

Answer 56

Correct Answer is: D In computerized adaptive testing, the examinee's previous responses are used to tailor the test to his or her ability. As a result, inaccuracy of scores is reduced across ability levels.

Answer 57

Correct Answer is: C Multiple regression is the preferred technique for combining test scores in this situation as it is a compensatory technique since a low score on one test can be offset (compensated for) by high scores on other tests. Multiple baseline* is a research design, not a method for combining test scores. Multiple hurdle* and multiple cutoff* are noncompensatory techniques (* incorrect options). Additional Information: Multiple Correlation and Multiple Regression

Answer 58

Correct Answer is: C Differential prediction is a bit of a technical term, but in a non-technical way, it can be defined as a case where given scores on a predictor test predict different outcomes for different subgroups. Using the example in the question: if the average predicted GPA for men scoring 500 on the verbal SAT was 2.7, the average predicted GPA for females with the same SAT score was 3.3, and this type of difference is statistically significant across scores on the SAT, then use of the SAT would result in differential prediction based on gender. Differential prediction could result in selection bias in favor of one group at the expense of others. In the example under discussion, if 500 were the cutoff score for college admission, the men selected for admission would be less qualified than the women selected, and there would be a number of women not selected for admission who were equally or more qualified than the men who were selected. So use of the test would not be fair to female candidates. Regarding the other choices: differential validity means that a test is more valid for one subgroup but not another, and single-group validity would mean that a test is valid for one subgroup but not another subgroup. In both cases, it means that the validity coefficient, or the correlation between the predictor and criterion, is different for different subgroups. This could be, but is not necessarily, the cause of differential prediction. In our example, it could be that, even though criterion scores are different for men and women at the same SAT score, the SAT predicts those scores at the same accuracy level for both groups (e.g., in our example, the score 500 provides the same level of predictive power for both men and women). In this scenario, the validity coefficients would be the same for both groups. Finally, adverse impact occurs when the use of a selection test results in a substantially lower rate of selection for one subgroup as compared to another subgroup--specifically, when the selection rate of one subgroup is 80% of less of the selection rate of another. For example, if use of the SAT resulted in 80% of males and 50% of females being admitted to college, the test would have adverse impact against females (50/80 = .625 or 62.5%, less than 80%). Since the question contains no information about the selection rates for men and women, this is not the best choice.

Answer 59

Correct Answer is: A The procedure used to determine what items will be retained for the final version of a test is the definition of item analysis. The degree to which items discriminate among examinees is the definition of Item Discrimination. A graph that depicts percentages of people is an item characteristic curve.

Answer 60

Correct Answer is: D Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly. Additional Information: Item Response Theory and Item Response Curve

Answer 61

Correct Answer is: B The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric. Additional Information: Interscorer Reliability

Answer 62

Correct Answer is: B There are several ways an examinee's test score can be interpreted. In this question, a criterion-referenced interpretation, an examinee's test performance is interpreted in terms of an external criterion, or standard of performance. norm-referenced interpretation In a norm-referenced interpretation, an examinee's test performance is compared to the performance of members of the norm group (other people who have taken the test). domain-referenced interpretation Domain-referenced interpretation is used to determine how much of a specific knowledge domain the examinee has mastered. objectives-referenced interpretation Objectives-referenced interpretation involves interpreting an examinee's performance in terms of achievement of instructional objectives. Additional Information: Criterion-Referenced Interpretation

Answer 63

Correct Answer is: D As the name implies, test-retest reliability involves administering a test to the same group of examinees at two different times and then correlating the two sets of scores. This would be most appropriate when evaluating a test that purports to measure a stable trait, since it should not be significantly affected by the passage of time between test administrations. Additional Information: Test-Retest Reliability

Answer 64

Correct Answer is: B A norm-referenced score is one that is interpreted in terms of a comparison to others who have taken the same test. Norm-referenced assessment is the method that compares a student with the age or grade-level expectancies of a norm group. It is generally used to sort students rather than to measure individual performance against a standard or criterion. The GRE and intelligence tests used in determining eligibility for special education programs are examples of norm-referenced measures. Additional Information: Norm-Referenced Interpretation

Answer 65

Correct Answer is: A Pure speed tests and pure power tests are opposite ends of a continuum. A speed test is one with a strict time limit and easy items that most or all examinees are expected to answer correctly. Speed tests measure examinees' response speed. A power test is one with no or a generous time limit but with items ranging from easy to very difficult (usually ordered from least to most difficult). Power tests measure level of content mastered. Additional Information: Speed, Power, and Mastery Tests

Answer 66

Correct Answer is: D If a score falls between 1 and 2 standard deviations in a normal distribution we can readily conclude that it's T-Score is between 60 and 70 and it's z-score is between 1 and 2 (since z-scores are stated in standard deviation units). We can, therefore, eliminate these two choices "T-score is greater than 70" and "z-score is greater than 2." To determine percentile ranks you can do a simple calculation if you know the areas under a normal curve. Remember that 50% of all scores in a normal distribution fall below the mean and 50% fall above the mean. And 68% of scores fall within +/- 1 SD of the mean. If you divide 68% by 2, you get 34% (the percentage of scores that fall between 0 and +1 SD). If you then add that 34% to the 50% that fall below the mean, you get a percentile rank of 84. Thus, the 84 percentile score is equivalent to 1 SD above the mean. The same calculation is used for determining the percentile rank at 2 standard deviations. Since 95% of all scores fall within +/- 2 SD, we divide 95% by 2 which equals 47.5 and add that to the 50% which falls below the mean, which totals 97.5 (rounded off = 98). Thus, the percentile rank is between 84 and 98. Additional Information: Standard Scores

10- Test Construction Flashcards

(90 cards)