3rd exam Flashcards

(128 cards)

1
Q

What is the formula for classical testing theory?

A

X= T + E (x- observed score), t (true score), E (error, systematic and random)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What creates a problem for classical testing theory?

A

Guessing on an achievement test could cause the true score to be wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Do we know when people guess?

A

We never know when someone is guessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Abott’s formula

A

allows you to understand and calculate true score for blind guessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If you are guessing wrong what happens within classical testing theory?

A

the observed score is not reflective of their true score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Abbotts actual math formula

A

R (correct responses) - W (wrong responses) divided by K (number of alternatives) -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

To overcome the influence of blind guessing

A

one should advise examinees to attempt every question– since not all guessing is blind. Guessing one can narrow down and get it correct and the number of times blind guessing goes on tends to be less frequent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an error in multiple choice questions?

A

not the question its self but the responses you chose from

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the error within short-answer questions?

A

the issue is what is the question asking and how do I answer it? this affects reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ebels idea of reliability and response options

A

reliability studies have been done on the number of response options, a better way to increase test reliability is to add more items (responses should be around 5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Speed tests

A

best way to calculate reliability for speeded tests is to do a split half reliability on the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

With speed tests how should you do reliability

A

administer half the test and give half the time to complete the test, also administer 2 weeks apart, better indicator of reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Halo Effect

A

raters tendency to perceive an individual who is high (or low) in one areas is also high (or low) in other areas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2 kinds of halo effects

A

general impression model and salient dimension model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

General impression model

A

tendency of rater to allow overall impressions of an individual influence judgment of a persons performance (ex: person may rate reporter as “impressive” and thus, also rate him/her as her speech as strong)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Salient dimension model

A

take one quality from the person and that affects the rating of another quality of the person (ex: people rated as attractive are also rate as more honest) (make inferences about an individual based on one salient trait or quality)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Simpson paradox

A

aggregating data can change the meaning of the data, can obscure the conclusions because of a third variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Percentages are at the heart of the simpson paradox, why are they bad?

A

because they obscure the relationship between the numerator and denominator (ex: 8/10 is 80% but also 80/100 80% is the same but number of people who reviewed a restaurant is different)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is important in knowing the percentage?

A

you need to know what the numerator and denominator are, or you are misinterpreting the percentages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What happens when you disaggregate the data?

A

you can truly see if the phonomenon is actually occurring in simpson paradox

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Clinical Decision-Making

A

make decisions on own clinical experience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Mechanical decision-making

A

make decisions based on data or statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Clinical psychologists often feel that their decision making is

A

absolute, but it is flawed because there are biases that we pull that affect our decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Robin Dawes

A

asserts that mechanical prediction is better than clinical prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Dawes example
asked faculty to rate students in graduate program from 1964-1967. Asked them to rate each student on a 5pt scale , however was very low correlation between current faculty ratings and ratings by the admissions committee, but ratings were correlated with GRE and Undergrad GPA
26
quantitative data (mechanical decisions) were
more predictive than clinical judgment
27
When can mechanical or quantitative prediction work?
when people highlight what variable to examine to determine prediction-people are necessary to choose what variables to examine
28
dawes crude mechanical decision making
ex: marital relationship satisfaction was determined based on higher sex versus argument rations-people tend to rate relationships higher if have more sex and less fights
29
People are not good at what with the data according to Dawes?
integrating the data in unbiased ways
30
There is resistance to what prediction
mechanical prediction, our belief in prediction is reinforced by isolated incidents we can access (we rely on testing which is quantitative data)
31
Always need to know the base rate?
to make sure to not make clinical judgment errors
32
Clinical decision making always has to be balanced by
Mechanical decision making
33
When people seek out treatment, they seek it out when they are most
Severe, or something is really impacting them
34
When you are severe, you generally don't get more severe, which relates to the
Regression to the mean, which relates to the middle
35
Why is mechanical better than clinical prediction?
Dawes says that humans make errors in judgment because they ignore base rates, ignore third variable, ignore regression to the mean
36
Third variable examples
ice cream sales go up, same as crime does in the summer, the third variable is heat
37
Representative thinking
we tend to make decisions based on the information we readily have access to. we use this as shortcuts to live our life, but with diagnosis we need to do more.
38
Using representative thinking
can sometimes cause errors in thinking.
39
Heuristic
simple rule to make decisions
40
Factor analysis goes under
Nondichotomous scoring systems
41
Item response theory goes under both
Item analysis for both dichotomous and nondichotomous
42
Generalize ability theory goes under the
Overall test
43
Factor analysis
determine which items are associated with latent constructs, these are constructs that cannot be measured directly, we do this mathematically (allows us to look at item quality).
44
Anxiety as a latent construct
3 buckets (overarching constructs): physical, emotional/psychological and cognitive (every disorder has buckets)
45
Within anxiety the latent construct, what would the 3 overarching constructs contain?
Physical (heart rate, sweating, shaking, GI distress), Emotional/psychological (irritability, worry, nervousness), Cognitive (poor concentration, rumination)
46
3 necessary conditions to write a factor analysis
1. factor structure represents what we know about the construct 2. factor structure can be replicated 3. factor structure is clearly interpretable with precise scaling
47
what type of sample does a factor analysis require?
need a an over-inclusive larger sample between 200-500 subjects
48
facets
defined-homogenous item clusters that directly map onto the larger order factors
49
What happens when there are more items in a factor analysis?
created ability to tap into the constructs that you may have not anticipated, it can also produce facets or sub-constructs
50
With item format, where can you not do it?
cannot use dichotomous item response formats because it can cause a serious disturbance in the correlation matrix
51
why do authors suggest having rating scales or likert scales from 5 to 7 points?
more response items greater amount of variance can be captured
52
Who should you sample for factor analysis?
Heterogeneity is needed, researchers should get a sample that can represent all trait dimensions
53
one of the reasons for conducting a factor analysis
develop and identify a hierarchical factor structure
54
Hierarchical factor structure
allows us to statistical identify those items that appear to be relevant to the construct, may identify another area or construct that was not thought of before putting together the items
55
Major criticism of factor analysis
develop these items on constructs that may or may not have a measurable criterion
56
the second reason for conducting factor analysis
improving psychometric properties of a test
57
how to improve psychometric properties of a test?
factor analysis can help developers determine items to remove, revise, or add more items to improve the internal consistency reliability of items
58
all tests with sound items should have a strong?
Internal consistency
59
with the sample size if the factors are well defined you can use a
smaller sample of between 100-200
60
The third reason for conducting a factor analysis is developing items that discriminate between samples
some items maybe endorsed by certain groups and them you may need to revise those same items so they are more discriminating for another group
61
The fourth reason for conducting factor analysis, developing more unique items- decreasing redundancy
having identical items are inefficient- whatever error is present will be associated with both items
62
Why are short forms good?
more efficient, less time consuming, easier for examinee and assessor
63
2 primary objections to short form development
1) can the short form give the appropriate information for an appropriate assessment 2) is the short form accurate and valid
64
General problems for short forms
1) there is an assumption that all the reliability and validity of the long form automatically applies to the abbreviated short form (due to the reduced coverage can not assume there is similar reliability and validity) 2) there is an assumption that the new shorter measure requires less validity evidence (primary problem when you have less items and content coverage you will compromise the validity of the measure as well)
65
Empirical evidence of short forms (Smith, McCarthy & Andersen)
Examined 12 short forms to examine equivalence to longer original form, -found that if large measure does not have good validity, how can a short one? -by reducing the items the content coverage maybe compromised -significant reduction in reliability coefficients -many researchers do not run another factor analysis on short forms -need to administer short form to an independent sample to determine validity -need to use short form to classify clinical populations and compare to long form -need to establish genuine time and money savings with a short form
66
Item response theory 2 types
difficulty and discriminability
67
Item Response Theory
a mathematical and statistical tool to determine item quality, to see how items look differently based on specific groups or individuals who are apart of a group
68
Classical testing theory is limited because
all error is lumped together in one term E (in formula), we can't determine error at the individual item level
69
Item Response theory relating to error from Classical Testing theory
allows to examine error at the item level using a hiearachial mathematical modeling to observe scoring patterns.
70
Two types of item analysis
item difficulty and discriminability
71
How do we know what a good item is on a test
First we did factor analysis, but sometimes problems with this, according to IRT we do item difficulty or discriminability
72
Item difficulty Dichotomous
defined by the number of people who get a particular item correct ex: if 84% of people get item #24 correct than the difficulty level for that is .84
73
Item difficulty levels based on higher or lower Dichotomous
the higher the number the easier the item, the lower the number the harder the item
74
Item difficulty is based on
Chance
75
What should item difficulty be set at?
should be set at a moderate level of difficulty by whose average difficulty should equal .50
76
When deciding difficulty levels need to consider what
depends on who you are testing ex: medical students should be .2 vs. disabled students .7-.9 (level of skill set is limited)
77
What are the best level of difficulty?
best tests choose items that are between .3-.7 in difficulty
78
Test floor
you should have a sufficient amount of easy items for disabled, testing the floor
79
Test ceiling
sufficient amount of hard items (for doctoral level students, medical students)
80
item discriminability Dichotomous
determines whether people who have done well on a particular item have also done well on the entire test
81
extreme group method
compares people who have done very well with those who have done very poorly on a test
82
How is discrimination found? Dichotomous
discriminating between the upper group and the lower group means its a very good item, because its able to discriminate between groups
83
difference between higher and lower numbers for discrimination Dichotomous
the higher the number the more discrimination, the lower the number the less discrimination
84
overthinking the problem
when there is a negative number in discrimination
85
D= index of discrimination
number of persons passing in Upper and Lower limits are expressed in percentages and the difference between those percentages is the index of discrimination
86
how do we know it is dichotomous?
whenever we have the word correct, because dichotomous is right or wrong
87
Point Biserial Method
find the correlation between the performance on the item and compare it with the entire test
88
Point Biserial positive meaning
ranges from -1 to +1, if the number is positive or closer to one, it tells us that it discriminates in that those that scored higher on the test also got this particular question or item correct
89
Point Biserial negative meaning
ranges from -1 to +1 if there is a negative point biserial, it indicates that their may be a problem with the item
90
Point Biserial chart explanation
showing higher number relationship to difficulty of question, amount of those individuals are getting it correct
91
item characteristic curves are
dichotomous and let you know if the item is good
92
overthinking representation
when the item starts going up and goes down on a item characteristic curve ex: upper group goes up and goes down
93
we focus on which group?
the upper group for dichotomous
94
item response function
a mathematical function describing the relation between where an individual falls on the continuum of a given construct such as depression and the probability that he/she will give a particular response to a scale item designed to measure that construct
95
difficulty for non-dichotomous
is symptom severity, looking at L= mild, M= moderate and U= severe, the farther away from y-axis more severe
96
Discriminability nondichotomous
means that the item discriminates between individuals that have severe symptoms and mild symptoms
97
item difficulty curve non-dichotomous
the curve that is furtherest from y-axis is considered the most difficult
98
item difficulty and discrimination for non-dichotomous will always provide
mathematical model will always provide a curve to show these
99
item discrimination curve for non-dichotomous
the curve that has the steepest slope is most discriminating item
100
advantages of IRT over CTT
IRT can look at the probability of getting an item correctly based on test takers ability, qualities. Can adapt to computer administration to give specific items related to ability level, IRT lets us better test those at higher and lower abilities and it lets us compare different groups (ethnicities, gender) on same items to examine patterns of responding, allows us to move away from bias questions and greater accuracy at the item level
101
generalizability theory is based on what aspects of the test?
the overall test, a new understanding of reliability
102
Why is generalizability theory moving away from Classical Testing Theory?
to understand how reliability is affected by various sources of error
103
Classical Testing theory only assumes 2 sources of error
random and systematic error
104
measurement error
error thats associated when we try to quantify a specific construct or concept
105
measurement error is associated with 3 errors
procedural error, instrumental error and evaluator error
106
procedural error
a non-standardized administration, this is not chance based because the more you practice the less you will commit this error
107
instrumental error
error associated with the instrument or the items on the test
108
evaluator error
any error that is committed by the assessor, one could be making problematic interpretations about the data, or not scoring correctly
109
measurement error is similar to circumscribed error
accounting for all possibilities
110
2 compononets of generalizability theory
generalizability and dependability
111
generalizability
can we generalize this observed test score to all the possible universal scores to that person ex: husband and wife test drove a prius one time, said it was great, they are generalizing saying that all prius's are good -when testing someone one time does their observed score represent their true score after testing.
112
dependability
will the observed score remain constant even if we change the testing parameters ex: they have a new prius, it does great without crazy weather but it doesn't work well when its raining, will it remain constant in how it drives if the aspect of the road changes
113
generalizability closer to 1 means
the closer it is to 1, it means that we are more confident that the observed score can be generalizable to all the possible scores for that particular person.
114
dependability closer to 1 means
the closer to 1, the observed score will remain constant irrespective of the testing parameters
115
within generalizability theory it allows us to look measurement error which could be
items on test, raters, setting, assessment, time ex: setting in a prison, could give different responses
116
problems with classical testing theory is that they only recognize
two sources of variance (test-retest and internal consistenty)
117
variance and error in classical testing theory according to generalizability theory is that these are
synonmous words
118
how does the generalizability theory extends the true score model?
by acknowledging that multiple factors may affect the error associated with measurement of one' true score
119
rater is another way of saying
assessor
120
Sources of error
noisy room, specific items, examinee fatigue, administrator of the test (some people will have minimal experience, some will have a lot) all of these we could not address in CTT
121
Fundamental equation
reliability= variance of T divided by variance of x (which is variance T + variance E) the larger the variance of T in relation to X, the higher the reliability
122
sources of variance
p= person taking the test, i= items on the test, e= random error, pi= interaction b/w person taking the test and the items on the test
123
the bigger circle on the vinnediagram says what about error
there is more error of it
124
adding another source of variance in the vindiagram
j= judge (evaluator) pj= person interacting with the judge ij= item and judge interaction (some judges might favor certain items vs. other items) pij= interaction with the person taking the test, the items on the test and the judge
125
Norm oriented perspective
tend to be associated with generalizability coefficients. only uses indices that have p or person involved
126
Domain-oriented perspective
associated with the dependability coefficient, and they look at all the indices
127
whenever you see a T in the formula, what is it equal to?
T is equivalent to P true score if equivalent to person
128
What do we use to understand item discriminability with dichotomous scoring
Extreme group & Point Biserial