Ch. 12: Data-Based and Statistical Reasoning Flashcards

1
Q

defn: measures of central tendency

A

those that describe the middle of a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how do we find the mean + aka?

A

aka: average, arithmetic mean

add up all the individual values within the data set and divide the result by the number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

when are means good indicators of central tendency?

A

when all of the values tend to be fairly close to one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

defn + impact: outlier

A

an extremely large or extremely small value compared to the other data values (can shift the mean toward one end of the range)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

defn + how to find: median

A

the midpoint of a set of data (half of data points are greater than the value and half are smaller)

in data sets with an odd number of values, the median will be one of the data points

in data sets with an even number of values, the median will be the mean of the two central data points

to calculate, first organize the data in increasing fashion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

when is the median a good tool to use? when is it not helpful?

A

GOOD FOR: it is the least susceptible to outliers

BAD FOR: may not be useful for data sets with large ranges or multiple modes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what does it mean if the mean and median are far from each other? if they are close to each other?

A

IF FAR: this implies the presence of outliers or a skewed distribution

IF CLOSE: implies a symmetrical distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

defn: mode

A

the number that appears the most often in a set of data

there may be multiple modes (or even no mode!)

peaks represent modes in a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

is the mode a measure of central tendency?

A

no, but the number of modes and their distance from one another is informative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what does it mean to “solve” a normal distribution?

A

we can transform any normal distribution to a STANDARD distribution with a mean of zero and a standard deviation of one and then use the newly generated curve to get information about probability or percentages of populations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the basis of the bell curve?

A

the normal distrubition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what % of the distribution (normal) is within one SD? within 2 SD? within 3 SD?

A

1 SD: 68%

2 SD: 95%

3 SD: 99%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

defn: skewed distribution

A

one that contains a tail on one side or the other of the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why are skewed distributions often confusing?

A

the VISUAL shift in the data appear OPPOSITE the direction of the skew

the direction of a skew in a sample is determined by its TAIL, not the bulk of the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

defn: negatively vs. positively skewed distribution

A

NEGATIVELY = tail on left (negative) side

POSITIVELY = tail on right (positive) side

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

why is the mean of a negatively skewed distribution lower than the median?

why is the mean of a positively skewed distribution higher than the median?

A

because the mean is more susceptible to outliers than the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

defn: bimodal

A

a distribution containing 2 peaks with a valley

note: it might only have one actual MODE if one peak is slightly higher than the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

in what circumstances can (but don’t have to be!) we analyze bimodal distributions as two separate distributions?

A

if there is sufficient separation of the two peaks, or a sufficiently small amount of data within the valley region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

can measures of central tendency and measures of distribution be applied to bimodal distributions?

A

Yes!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

defn: range

A

the difference between its largest and smallest values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is range affected heavily by?

A

the presence of data outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is an estimate of the SD based on the range when it is not possible to calculate the SD?

A

SD is approx. 1/4 range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

defn: quartile

A

divide data (when placed in ascending order) into groups that comprise one-fourth the entire set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what are the 4 steps to calculating the quartiles?

A
  1. to find the position of Q1 in a set of data sorted in ascending order, multiply n by 0.25
  2. if this is a whole number, the quartile is the mean of the value at this position and the next highest position
  3. if this is a decimal, round up to the next whole number and take that as the quartile position
  4. to calculate the position of Q3 multiply the value of n by 0.75. Again, if this is a whole number, take the mean of this position and the next. If it is a decimal, round up to the next whole number, and take that as the quartile position
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

how do you calculate the interquartile range?

A

IQR = Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what is the IQR helpful for determining? + how?

A

outliers

any value that falls more than 1.5 IQRs below the first quartile or above the third quartile is considered an outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

what is the most informative measure of distribution?

A

standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

how is std dev calculated (in words)?

A

by taking the difference between each data point and the mean, squaring this value, dividing the sum of all of these squared values by the number of points in the data set minus one, and then taking the square root of the result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

how can you use the std dev to determine whether a data point is an outlier?

A

if the data point. falls more than 3 SD’s from the mean, it is considered an outlier

30
Q

what are the three main causes of outliers + examples?

A
  1. a true statistical anomaly (a person who is over 7 feet tall)
  2. a measurement error (reading the tape measure wrong)
  3. a distribution that is not approximated by the normal distribution (a skewed distribution with a long tail)
31
Q

how do you approach the data set across each of the three causes of outliers?

A
  1. measurement error –> exclude the data from the analysis
  2. true measurement, but not representative –> weight to reflect its rarity, include normally, or excluded (should be decided ahead of the study, not after the outlier is found)
  3. not normal distribution? repeated or larger samples will demonstrate the truth
32
Q

defn: independent vs. dependent events

A

independent events: have no effect on one another

dependent events: do have an impact on one another, such that the order changes the probability

33
Q

defn: mutually exclusive

A

outcomes that cannot occur at the same time

the probability of two mutually exclusive outcomes occurring together is 0%

34
Q

does the term mutually exclusive apply to events? or only outcomes?

A

only outcomes

35
Q

defn + example: exhaustive

A

a group of outcomes is said to be exhaustive if there are no other possible outcomes

example: flipping heads or tails are the exhaustive outcomes of a coin flip

36
Q

how do you calculate the probability of two or more independent events occurring at the same time?

A

P(A and B) = P(A) x P(B)

37
Q

how do you calculate the probability of one of two independent events occurring?

A

P (A or B) = P(A) + P(B) - P(A and B)

38
Q

what do hypothesis testing and confidence intervals allow us to do?

A

to draw conclusions about populations based on our sample data

39
Q

defn: null hypothesis

A

a hypothesis of equivalence

says that two populations are equal or that a single population can be described by a parameter equal to a given value

40
Q

what are the two options for the alternative hypothesis?

A

nondirectional: the populations are not equal

directional: example, the mean of population A is greater than the mean of population B

41
Q

what distributions do z-tests and t-tests rely on?

A

z-tests: standard distribution

t-tests: t-distribution

42
Q

defn: test statistic/p-value

A

calculated from the data collected and compared to a table to determine the likelihood that that statistic was obtained by random chance under the assumption that our null hypothesis is true

43
Q

func + most common value + greek letter + meaning in words: significance level

A

func: compare to our p-value

greek letter: alpha

common value: 0.05

meaning: the level of risk we are willing to accept for incorrectly rejecting the null hypothesis

44
Q

how do we respond to the null hypothesis if the p-value is greater than alpha?

if the p-value is less than alpha?

AND what does it mean?

A

p-value > alpha: we fail to reject the null hypothesis = there is not a statistically significant difference between the two populations

p-value < alpha: we reject the null hypothesis = there is a statistically significant difference between the two groups

45
Q

defn: type I vs. type II error

A

type I error = the likelihood that we report a difference between two populations when one does not actually exist

type II error = occurs when we incorrectly fail to reject the null hypothesis = the likelihood that we report no difference between two populations when one actually exists

46
Q

what is the greek letter for a type II error?

A

beta

47
Q

defn + eqn: power

A

the probability of correctly rejecting a false null hypothesis

= 1- B

48
Q

defn: confidence

A

the probability of correctly failing to reject a true null hypothesis = reporting no difference between two populations when one does not exist

49
Q

defn + calc: confidence intervals

A

the reverse of hypothesis testing

we determine a range of values from the sample mean and standard deviation. rather than finding a p-value, we begin with a desired confidence level (95% is standard) and use a table to find its corresponding z- or t-score

when we multiply the z or t score by the standard deviation, and then add and subtract this number from the mean, we create a range of values

50
Q

defn: charts

A

present information in a visual format and are frequently used for categorical data

51
Q

func + downside: pie/circle charts

A

used to represent relative amounts of entities and are especially popular in demographics

downside: as the number of represented categories increases, the visual representation loses impact and becomes confusing

52
Q

func + benefit: bar charts

A

used for categorical data

likely to contain significantly more info than a pie chart for the same amount of page space

length of a bar is generally proportional to the value it represents

53
Q

what should you be wary of in bar charts?

A

graphs that contain breaks! they may be enlarging the difference between bars

54
Q

func: histogram + benefit

A

present numerical rather than discrete categories

useful for determining the mode of a data set because they are used to display the distribution of a data set

55
Q

func: box plots

A

used to show the range, median, quartiles, and outliers for a set of data

56
Q

defn: box-and-whisker

A

a labeled box plot

the box is bounded by Q1 and Q3
Q2 (median) is the line in the middle of the box
ends of the whiskers correspond to max and min values of the data set

OR outliers can be presented as individual data points with the ends of the whiskers corresp. to the largest and smallest values in the data set that are within 1.5IQR of the median

57
Q

what are box-and-whisker plots useful for and why?

A

comparing data because they contain a large amount of data in a small amount of space, and multiple plots can be oriented on a single axis

57
Q

how to approach a graph on test day? (3)

A
  1. attempt to draw rough conclusions immediately
  2. do not spend time analyzing all the details of the graph unless asked to do so by a question
  3. look at the axes first
58
Q

what makes good map data?

A

examining one or at most two pieces of information simultaneously

any further data may inhibit clarity

59
Q

defn + char (2): linear graphs

A

show the relationships between two variables

typically involve two direct measurements

do not have to be a straight line

60
Q

what are the 4 shapes of curves on linear graphs?

A
  1. linear
  2. parabolic
  3. exponential
  4. logarithmic
61
Q

defn: slope (m)

A

change in the y-direction divided by the change in the x-direction for any two points

62
Q

defn + benefit + char (2): semilog graphs

A

a specialized representation of a logarithmic data set

may be easier to interpret because the otherwise curved nature of the log data is made linear by a change in the axis ratio

one axis (usually x) maintains traditional unit spacing

other axis assigns spacing based on a ratio (the multiples may be of any number as long as there is consistency in the ration from one point on the axis to the next)

63
Q

defn + char: log-log graph

A

both axes use a constant ratio from point to point on the axis

64
Q

how should you approach a table on test day? (4)

A
  1. take a brief moment to glance at the title of a table
  2. tables that do not have unusual data values should be approached especially briefly
  3. when a table DOES contain significant organization, this structure is likely to be relevant to answering questions (e.g. a trend that suddenly appears or disappears)
  4. you should be able to convert to rough graph or linear equation
65
Q

defn: correlation

A

refers to a connection (direct relationship, inverse, or otherwise) between data

66
Q

def: positive vs. negative correlation

A

POSITIVE: if two variables trend together (as one increases so does the other)

NEGATIVE: if two variables trend in opposite directions (one increases as the other decreases)

67
Q

defn + meaning of values: correlation coefficient

A

a number between -1 and +1 that represents the strength of a relationship

+1 = strong positive relationship
-1 = strong negative relationship
0 = no apparent relationship

68
Q

does correlation imply causation?

A

no, not necessarily. avoid this assumption

69
Q

what are the 6 things we should do when + after interpreting data?

A
  1. state the apparent relationships between data
  2. begin to draw connections to other concepts in science and to our background knowledge
  3. consider the impact of the new data on the existing hypothesis
  4. ideally, the new data would be integrated into all future investigations on the topic
  5. develop a plausible rationale for the results
  6. make decisions about our data’s impact on the real world and determine whether or not our evidence is substantial and impactful enough to necessitate changes in understanding or policy