Midterm 1: Ch 1-10 Flashcards

(159 cards)

1
Q

What is statistics?

A

quantitative technology for empirical science – logic and methodology for the measurement of uncertainty, and for an examination of that uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the goals of statistics (2)?

A
  • estimate the values of important parameters

- test hypotheses about those parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data?

A

measurements of one or more variables made on a collection of individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a variable?

A

characteristic measured on individuals drawn from a population under study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two types of variables?

A
  • response variable (dependent variable)

- explanatory variable (independent variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a response variable?

A

(dependent variable – y-axis) variable that we try to predict or explain from the explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an explanatory variable?

A

(independent variable – x-axis) variable used to predict or explain the response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are parameters?

A

descriptive measures of an entire population

  • population parameters are constants
  • ie. mean length of salmon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are estimates?

A

descriptive measures of a sample

  • random variables – change from one random sample to the next, from the same population
  • ie. mean of some sample of salmon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Do samples look exactly like the population?

A

no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a sample of convenience?

A

collection of individuals that happen to be available at the time – biased

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is bias?

A

systematic discrepancy between estimates and the true population characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the goals of estimation? (2)

A
  • accuracy

- precision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is accuracy?

A

on average gets the correct answer

  • accurate = unbiased
  • inaccurate = biased
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is precision?

A

gives a similar answer repeatedly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some determinants of precision (when unbiased)?

A
  • sample size

- precision of instrument

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Unbiased and Precise

A
  • on average, answer is correct

- repeated samples/estimates have very similar results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Unbiased and Imprecise

A

on average, anwer is accurate, BUT each individual estimate is off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Biased and Precise

A
  • most dangerous – may not even realize there’s a problem, and may have a lot of false confidence in the answer
  • repeated samples/estimates have very similar results, BUT average value of estimates is off
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Biased and Imprecise

A
  • on average, answer is incorrect

- unconfident in the answer, but best guess would be wrong anyways – not as deadly as being confident and wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are properties of a good sample? (3)

A
  • independent selection of individuals
  • random selection of individuals
  • sufficiently large
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a random sample?

A

each member of a population has an equal and independent chance of being selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is independent sampling?

A

chance of an individual being included in the sample does NOT depend on who else is sampled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is sampling error?

A

difference between the estimate and average value of the estimate

measurement of precision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Do smaller or larger samples have smaller sampling error?
larger samples → smaller sampling error | on average
26
What is high sampling error?
every new measurement is different each time we do it low precision – large difference
27
What is low sampling error?
- higher precision – small differences | - low variance between different estimates (each time we do a study)
28
What are the two types of data?
- categorical variables (class or nominal variables) | - numerical variables (quantitative variables)
29
What are categorical variables?
fall into categories
30
What are the 2 types of numerical variables?
continuous: can be measured – ie. arm length, height, weight, age* discrete: can be counted – ie. number of limbs, number of offspring, number of petals
31
What is a frequency table?
frequency is NOT a variable – not measuring, just gathering data
32
What graph do you use for graphing categorical variables?
bar graph
33
What graph do you use for graphing numerical variables?
- histogram | - cumulative frequency distribution (CDF)
34
What data do histograms graph?
continuous numerical variable - no gaps between bars – conveys that these are continuous variables running together - widths are the same
35
What is cumulative frequency of a value?
proportion of individuals equal to or less than that value - 0 = none of the individuals are less than that value - 1 = all individuals are less than that value
36
What is a contingency table?
describes association between two (or more) categorical variables by displaying frequencies of all combinations of categories
37
What graphs are used for graphing the association between two categorical variables?
- contingency table - grouped bar graph - mosaic plot
38
What data do mosaic plots use?
relative frequencies scaled to 1 – does NOT use discrete numebrs width of bars indicates number of individuals in the treatment
39
What data do stacked bar plots use?
discrete numbers or frequency
40
What graphs are used for graphing the association between a categorical (x-axis) and numerical (y-axis) variable?
- multiple histogram - cumulative frequency distribution (CDF) - box plot
41
What graphs are used for graphing the association between two numerical variables?
scatter plot`
42
What are two common descriptions of data?
- location: central tendency | - width: spread – how variable the data is
43
What are 3 measures of location?
- mean - median - mode
44
What is the mean (or average)
add all numbers together and divide by total amount of data points – centre of gravity
45
What is the median?
odd number: middle measurement in a set of ordered data even number: average of two middle numbers in a set of ordered data
46
What is the mode?
most frequent measurement
47
Why might the mean and median be different?
skewed data – lot of the weight is on one side of the distribution
48
Why might the mean and median be the same or similar?
symmetrical distribution of data – bell-shaped
49
Mean vs. Median
- mean has nice statistical properties, can be quantified easily using theories - mean has good predictive behaviours
50
What are the 4 measures of width?
- range - variance - standard deviation - coefficient of variation
51
What is the range?
maximum minus minimum - poor measure of distribution width – useless in statistics
52
Is sample range a biased estimator of the true population range?
yes, smaller sample → lower estimates of range - sample range is not expected to match population range
53
In the equation for variance, why do we square the value
if we took unsquared value, negative and positive deviations cancel out
54
What is sample variance?
unbiased estimator of population variance – used to try to learn about population variance
55
What is standard deviation?
positive square root of the variance σ: true standard deviation s: sample standard deviation – unbiased estimator of population standard deviation
56
What is the coefficient of variation (CV)?
good for comparing distributions of different magnitudes
57
What is skew?
measurement of asymmetry – refers to pointy tail of distribution right-skewed: pointy tail is on the right left-skewed: pointy tail is on the left
58
Mean – Nomenclature
population parameter: µ sample statistic: Ȳ
59
Variance – Nomenclature
population parameter: σ^2 sample statistic: s^2
60
Standard Deviation – Nomenclature
population parameter: σ sample statistic: s
61
Manipulating Means Mean of Sum of Two Variables
E[X + Y] = E[X] + E[Y]
62
Manipulating Means Mean of Sum of Variable and Constant
E[X + c] = E[X] + c ie. temperature conversions
63
Manipulating Means Mean of Product of Variable and Constant
E[c X] = c E[X] ie. measurement conversions
64
Manipulating Variance Variance of Sum of Two Variables
Var[X + Y] = Var[X] + Var[Y] ONLY if X and Y are independent
65
Manipulating Variance Variance of Sum of Variable and Constant
Var[X + c] = Var[X] spread of data has not changed – variance is the same ie. adding 10 cm to every measurement
66
Manipulating Variance Variance of Product of Variable and Constant
Var[c X] = c^2 Var[X] variance in units^2, therefore multiply by constant^2
67
What happens every time we take a sample from a population?
every sample will look different
68
What happens if we take many samples from a population?
samples will look similar to each other
69
How does variance change with sample size?
larger sample size = smaller variance of the sampling distribution of the mean
70
What is standard error of an estimate?
standard deviation of its sampling distribution predicts the sampling error of the estimate
71
What is the problem with the equation for the standard error of the mean?
in most cases, we don’t know 𝜎 – we only have a sample
72
What is the estimate of the standard error of the mean?
gives some knowledge of the likely difference between sample mean and true population mean
73
What is the 95% confidence interval?
provides a plausible range for a parameter - all values for the parameter within the interval are plausible - all values for the parameter outside the interval are unlikely
74
What is the 2SE rule-of-thumb?
interval that provides a rough estimate of 95% CI for the mean assuming normally distributed population and/or sufficiently large sample size
75
Correct or Incorrect: "we are 95% confident that the population mean lies within the 95% CI"
correct
76
Correct or Incorrect: "there is a 95% probability that the population mean is within a particular 95% CI"
incorrect
77
What is pseudoreplication?
error that occurs when samples are not independent, but they are treated as though they are ie. taking multiple measurements from one individual and using each as an individual of the sample EXAMPLE: - taking 10 measurements from each climber (6) to get 60 measurements - to avoid pseudoreplication: take mean blood pressure for each climber, so that you have 6 pulse rates, one for each climber (n = 6)
78
What is the probability of an event?
its true relative frequency – proportion of times event would occur if we repeated same process over and over again
79
What does mutually exclusive mean?
when two events cannot both be true Pr(A and B) = 0
80
What does independent mean?
when the occurrence of one event gives no information about whether the second event will occur
81
What is the probability distribution?
describes the true relative frequency of all possible values of a random variable all probabilities have to sum to 1
82
What is the addition principle?
if two events A and B are mutually exclusive Pr[A or B] = Pr[A] + Pr[B]
83
What is the probability of a range?
Pr[number ≥ 6] = Pr[6] + Pr[7] + Pr[8]...
84
What is the probability of 'not'?
Pr[not rolling a 2] = 1 – Pr[rolling a 2] = 5/6
85
What is the general addition principle?
Pr[A or B] = Pr[A] + Pr[B] - Pr[A and B] need to subtract Pr[A and B], otherwise it’ll be counted twice
86
What is the multiplication principle?
if two events A and B are independent Pr[A and B] = Pr[A] x Pr[B]
87
What is the general multiplication principle?
Pr[A and B] = Pr[A] Pr[B | A] Pr[A and B] = Pr[B] Pr[A | B] therefore, Pr[A] Pr[B | A] = Pr[B] Pr[A | B]
88
What are dependent events?
probability of one event depends on the outcome of another event
89
Are variables always independent?
no
90
What is the conditional probability of an event?
probability of that event occurring given that a condition is met Pr[X|Y] probability of X given Y (if Y is true)
91
Law of Total Probability
92
When is Bayes' Theorom used?
when you want to flip conditional probability
93
What is hypothesis testing?
asks how unusual it is to get data that differ from the null hypothesis - if the data would be quite unlikely under H0, we reject H0 - assumes random sampling - about populations, but are tested with data from samples
94
What is the null hypothesis?
specific statement about a population parameter made for the purposes of argument - simplest statement - specific - good H0 would be interesting if proven wrong
95
What is an alternate hypothesis?
represents all other possible parameter values except that stated in the null hypothesis - statement of greatest interest - non-specific
96
Steps of Hypothesis Testing
population → sample → estimate → test statistic null hypothesis → test statistic null hypothesis → construct a new population under H0 → imagined repeated sampling → sample from H0 and calculate test statistic → null distribution of test statistic test statistic + null distribution of test statistic - how weird would these data be if the null hypothesis were true? - compare distribution from H0 to observed sample - how likely would it be to obtain our data sample if H0 were true?
97
What is a statistic?
number calculated to represent/summarize the match between a set of data and the null hypothesis can be compared to a general distribution to infer probability – for any given value for a test statistic, we can say how likely those possible outcomes are
98
What is a null distribution (sample distribution)?
probability distribution of alternative outcomes for a test statistic when a random sample is taken from a population corresponding to the null expectation
99
If H0 is true, do we expect variance between samples?
yes need to evaluate range and distribution of possible test statistics we have sampled, if we sampled repeatedly
100
What is the P-value?
probability of getting the data, or something as or more unusual/extreme, if the null hypothesis were true NOT probability H0 is true NOT probability HA is true
101
How can we find P-values? (3)
- simulation - parametric tests - permutation
102
What is the significance level?
acceptable probability of rejecting a true null hypothesis 𝜶 = usually 0.05
103
What does it mean to be statistically significant?
if p-value for a test is ≤ 𝜶, then H0 is rejected
104
What does it mean to be statistically insignificant?
if p-value for a test is > 𝜶, then H0 is NOT rejected
105
How does sample size influence the range of test statistic we see under the null hypothesis?
larger sample → estimate has smaller confidence interval larger sample → more power to reject a false null hypothesis
106
What is a Type I error?
rejecting a true null hypothesis probability of Type I error is 𝜶 (significance level)
107
What is a Type II error?
not rejecting a false null hypothesis probability of Type II error is 𝜷 - what the real world looks like - our sample size - our 𝜶 smaller 𝜷 = larger power a test has
108
What is power?
ability of a test to reject a false null hypothesis – how likely we will reject it 1 – 𝜷
109
How are power and sample size related?
larger power = larger sample size (more information) increase sample size → decrease standard deviation of null distribution → increase power to reject H0
110
What is a two-tailed test?
deviation in either direction would reject the null hypothesis - most tests are two-tailed - normally 𝜶 is divided into 𝜶/2 on one side, and 𝜶/2 on the other
111
What is a one-tailed test used?
only used when the other tail is nonsensical ie. comparing grades on multiple choice test to that expected by random guessing
112
What is a critical value?
value of a test statistic beyond which the null hypothesis can be rejected - we never ‘accept the null hypothesis’
113
Where in the 95% CI is the value proposed by the null hypothesis rejected?
in general, if a hypothesis test rejects a null hypothesis test (p < 0.05), the value proposed by the null hypothesis is outside the 95% confidence interval
114
2-Sample T-tests
115
2-Sample T-tests The more different the sample means are....
(when taking into account sample spread and size, and assuming we’ve randomly sampled), the less likely it is they were drawn from populations with the same mean
116
2-Sample T-tests What would a P-value of 0.03 mean?
there is a 3% chance of getting means that are at least this different if they’re drawn from populations with the same mean
117
2-Sample T-tests Higher vs. Lower P-values
higher p-values: - higher probability of 2 sample means being at least this different, if drawn from populations with same mean - less evidence of differences between population means lower p-values: - lower probability of 2 sample means being at least this different, if drawn from populations with same mean - more evidence of differences between population means
118
What is a confounding variable?
unmeasured variable that may be the cause of both X and Y
119
What is a proportion?
fraction of individuals having a particular attribute
120
What is a binomial distribution?
describes the probability of a given number of ‘successes’ from a fixed number of independent trials
121
What are the 2 properties of binomial distributions?
- mean of number of successes | - variance of number of successes
122
What is the estimate of a proportion?
number of ‘successes’ over total sample size
123
What are the 2 properties of sample proportions?
- mean | - variance
124
How does the standard error of the estimate of a proportion change with sample size?
larger sample → lower standard error
125
What is the name for the 95% CI for a proportion?
Agresti-Coull confidence interval
126
How does the Agresti-Coull confidence interval change with sample size?
larger sample → more symmetrical distribution
127
For the Agresti-Coull confidence interval, what are the +2 and +4 factors?
‘fudge factors’ that are there for more asymmetrical distributions
128
What is Murphy's law?
anything that can go wrong will go wrong
129
What is the binomial test?
uses data to test whether a population proportion (p) matches a null expectation for the proportion H0: relative frequency of successes in the population is p0 HA: relative frequency of successes in the population is not p0
130
What is a goodness-of-fit test?
compares count data to a probability distribution (expected frequencies) of a set of categories
131
What are the hypotheses for a 𝜒2 goodness-of-fit test?
H0: data come from a particular probability distribution HA: data do NOT come from that distribution
132
What is the test statistic for 𝜒2 goodness-of-fit test?
𝜒2
133
How does the number of categories affect 𝜒2 goodness-of-fit test?
the more categories you have, the more opportunities to deviate from expectations
134
What is the degree of freedom of a test?
specifies which of a family of distributions to use
135
What is the equation for degrees of freedom for 𝜒2 test?
df = (number of categories) - (number of parameters estimated from the data) - 1
136
What is the critical value?
value of the test statistic where P = 𝛼 if observed 𝜒2 > 𝜒2 corrected for df, we reject the null hypothesis if observed 𝜒2 < 𝜒2 corrected for df, we DO NOT reject the null hypothesis
137
What is a test statistic?
number calculated from the data and the null hypothesis that can be compared to a standard distribution to find the P-value of the test
138
Can 𝜒2 goodness-of-fit test substitute for binomial test?
yes, because it works even when there are only two categories - very useful if the number of data points is large - BUT not recommended if binomial test is possible – two categories (success and failure)
139
What are the assumptions of the 𝜒2 goodness-of-fit test?
- no more than 20% of categories have Expected < 5 - no category with Expected ≤ 1 (if needed, combine categories to satisfy these requirements)
140
What is a discrete distribution?
probability distribution describing a discrete numerical random variable ie. number of heads from 10 flips of a coin ie. number of flowers in a square meter ie. number of disease outbreaks in a year
141
What is the Poisson distribution?
describes the probability that a certain number of events occur in a block of time or space, when those events happen independently of each other and occur with equal probability at every point in time or space used to ask questions about random events (by chance)
142
What is contingency analysis?
test the independence of two or more categorical variables
143
What is the equation for degree of freedom for 𝜒2 contingency analysis?
df = (# of columns - 1) (# of rows - 1)
144
What are the assumptions for 𝜒2 contingency analysis?
this test is just a special case of the 𝜒2 goodness-of-fit test, therefore the same rules apply - no more than 20% of categories have Expected < 5 - no category with Expected ≤ 1
145
What is the Fisher's exact test?
for 2 x 2 contingency analysis - does not make assumptions about the size of expectations R (or other programs) will do it, but difficult to do by hand
146
What are odds?
probability of success divided by the probability of failure
147
What is odds ratio?
odds of success in one group divided by the odds of success in another group OR < 1 means treatment helps OR > 1 means treatment makes things worse
148
What is a normal distribution?
- distribution fully described by its mean and standard deviation - symmetric around its mean - mean, median, and mode are all the same - 67% of random draws from a normal distribution are within one standard deviation of the mean - 95% of random draws from a normal distribution are within two (1.96) standard deviations of the mean
149
For a standard normal distribution, what is the mean?
mean (μ) = 0
150
For a standard normal distribution, what is the standard deviation?
standard deviation (σ) = 1
151
What is a standard normal table?
gives probability of getting a random draw from a standard normal distribution greater than a given value
152
Is a standard normal symmetric?
yes Pr[Z > x] = Pr[Z < -x]
153
What is the total area under the curve of a standard normal distribution?
1 Pr[Z < x] = 1 – Pr[Z > x]
154
Are all normal distributions shaped the same?
yes, just with different means and variances
155
Can any normal distribution be converted to a standard normal distribution?
yes, by Z: standard normal deviate - Z tells us how many standard deviations Y is from the mean - probability of getting a value > Y is the same as probability of getting a value > Z from a standard normal distribution
156
Are sample means normally distributed?
yes, if the variable itself is normally distributed - mean of the sample means - standard deviation of the sample means
157
What is the standard error of an estimate of a mean?
the standard deviation of the distribution of sample means
158
What is the central limit theorem?
sum or mean of a large number of measurements randomly sampled from any population is approximately normally distributed
159
Why do we “fail to reject H0” rather than “accept H0” after a test in which the P-value is calculated to be greater than α?
failing to reject H0 does not mean H0 is correct, because the power of the test might be limited null hypothesis is the default and is either rejected or not rejected