Statistics Presentation Notes Flashcards

(80 cards)

1
Q

In statistics, the same inputs and process should have only one output.
What term describes this?

A

deterministic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the opposite something deterministic – in other words, what term refers to a process where the same inputs and factors produce multiple outputs?

A

stochastic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Multiple outputs may arise from what factor, in which the same results are obtained but the technology used to document the observation is imprecise?

A

measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What stochastic factor describes the variation which exists between subjects of study that gives rise to different results?

A

natural heterogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What stochastic factor includes variables like the disappearance of funding or poor weather?

A

uncontrollable factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What part of data analysis refers to visually displaying observations, removing the outliers, and subsetting the data?

A

preprocess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two goals of exploratory data analysis (EDA)?

A
  1. identifying potential issues with the observed data
  2. taking note of tends which intuition of the scientist doesn’t observe
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the long-term frequency of an event taking place known as?

A

probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What word refers to probabilities associated with integers and categories (i.e., number of oranges on a tree)?

A

discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do statisticians employ to analyze discrete outcomes?

A

probability mass functions (PMFs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four common kinds of distribution associated with probability mass functions?

A
  1. J. Bernoulli distribution
  2. S. Poisson distribution
  3. binomial distribution
  4. multinomial distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What word refers to probabilities associated with non-integer numbers (i.e., likelihood that someone was born exactly 250 years after Horatio Nelson)?

A

continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do statisticians employ to analyze continuous outcomes?

A

probability density functions (PDFs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the three common kinds of distribution associated with probability density functions?

A
  1. beta distribution
  2. gamma distribution
  3. normal distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What kinds of distributions do statisticians employ for continuous outcomes without either positive or negative constraints?

A

normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normal distributions form the cornerstone of what three kinds of data analyses?

A
  1. t-tests
  2. ANOVAs
  3. regression analysis (simple/multiple)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The t-test, ANOVA, simple regression analysis and multiple regression analysis are all considered what?

A

linear models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What theorem says that even if data is not technically normally distributed, the samples which are very large will move towards normality?

A

central limit theorem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula used to express a normal distribution?

A

y ~ N(mu, sigma^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In the formula for the null distribution, what do the following variables refer to:
1. “y”
2. “N”
3. “mu”
4. “sigma^2”

A
  1. outputs based on inputs
  2. normal distribution
  3. mean/median of the data
  4. frequencies around the center
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The t-distribution is widely used in many statistical models and looks like the normal distribution, but becomes more divergent with smaller sample sizes due to the influence of what parameter?

A

degrees of freedom (df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The normal distribution is useful with what thing, which assumes the mean/median (mu) varies linearly?

A

regression models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does an analysis of variance show about the treatment?

A

whether the treatment effects the results relative to the control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a necessary characteristic of a valid hypothesis?

A

falsifiability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is an example of a non-falsifiable hypothesis (HINT: it remains a popular idea in most people's heads nonetheless)?
God created the Universe
26
Information collected by hypothesis testing can cause what three things to subsequently occur?
1. rejection of original claim 2. modification of original claim 3. confirmation of original claim
27
What are the four steps statistical hypothesis testing is often broken between?
1. development of null and alternative hypotheses 2. calculation of a test statistic 3. converting the test statistic to a P-value 4. deriving a conclusion
28
A (1)__________ hypothesis may be defined as the theory of no (2)_____________ or the absence of any (3)________________; it contradicts the notion of the (4)____________________ relationship.
(1) null (2) difference (3) pattern (4) cause-and-effect
29
It is thought the "Ghostbuster" eggplant is larger than the "Night Shadow" variety. What would be the null (Ho) and alternative (Ha) hypotheses?
Ho = "Ghostbuster" and "Night Shadow" eggplants are the same size Ha = "Ghostbuster" eggplants are larger than "Night Shadow" fruits
30
Considering the question of whether or not "Ghostbuster" eggplants are larger or the same size as "Night Shadow" fruits, how might one go about establishing a histogram which shows the distribution curve?
to establish the distribution curve, we could measure 1,000 "Ghostbuster" and 1,000 "Night Shadow" eggplants and take their average masses (mu[G] and mu[NS]), then take their difference (mu[G] - mu[NS]) next, we mix the 2,000 observations and pull 1,000 of them at random and assign them as the average mass for a hypothetical "Ghostbuster" group (mu[g1]), giving the other 1,000 the distinction of a hypothetical "Night Shadow" average (mu[ns1]), after which we take their averages (mu[g1] - mu[ns1]) repeat the previous step 999 times (mu[g2] - mu[ns2], mu[g3] - mu[ns3]), ... mu[g999] - mu[ns999], mu[g1000] - mu[ns1000]) and plot the frequency of the differences as a histogram
31
Considering the question of whether or not "Ghostbuster" eggplants are larger or the same size as "Night Shadow" fruits, how might one go about converting the distribution curve to a useful P-value?
plot the true difference in mass (mu[G] - mu[NS]) on the histogram with the frequency of simulated mass differences count the number of observations which are larger than the true difference; the P-value will be the number of observations divided by 1,000
32
What reason does the textbook (Gotelli & Ellison, 2013) give for the establishment of 0.05 as the orthodox critical P-value?
"... after many decades of custom, tradition and vigilant enforcement by editors and journal reviewers"
33
While the P-value is important, what are three aspects of the data which need also be considered?
[1] sample size (n) [2] the measurement under investigation (such as difference in size between two cultivars) [3] level of variation (sigma^2)
34
What causes larger sample sizes to result in smaller P-values than smaller sample sizes, even if both groups have a high number of observations (i.e., comparing a 1,000 size sample to a 10,000 size sample)?
larger sample sizes are more reflective of the overall population or true value
35
Which groups will produce smaller P-values, those with a large amount of variation between them or a small amount of variation?
the lower the amount of variation between groups, the lower the resulting P-values
36
What is the downside of the inferences made from hypothesis testing, even with very large sample sizes?
all conclusions are based on incomplete information, which may not actually reflect the true situation
37
What are the two possible correct decisions when testing a null hypothesis?
(1) failure to reject a null hypothesis which is true (2) rejection of a null hypothesis which is false
38
What occurs when someone commits a Type I Error (alpha)?
false rejection of a null hypothesis which is true
39
When is the definition of a Type I Error in simple English?
something is thought to be occurring when nothing actually is
40
What occurs when someone commits a Type II Error (beta)?
failure to reject a null hypothesis which is actually false
41
When is the definition of a Type II Error in simple English?
nothing is thought to be occurring when something actually is
42
What is statistical power?
the probability of correctly rejecting a null hypothesis which is false
43
What is the relationship between statistical power and the Type II Error?
statistical power = 1.0 - beta (probability of a type II error)
44
If one fails to find significant results in their data, what might be the cause, which is related to statistical power?
the study may have been under-powered, meaning it had too few samples
45
What is the relationship between Type I and Type II Errors?
they are inverse -- as one grows, the other shrinks
46
Why are the stakes of a Type II Error less than a Type I Error, generally speaking?
a type I error proclaims there is a phenomenon where none actually exists, which can lead people to do foolish things for bad reasons a type II error, as a false negative, doesn't generally mobilize people, and can often be corrected in the future with more sensitive tech
47
What statistical projection represents the simplest starting place for describing some relationship between one more more explanatory variables (x1, x2, ... x[n]) and the response variable (y)?
linear model
48
Some non-linear relationships can be approximately linear if what is done to them (think a relationship with a demi-circle curve)?
over a very narrow range of x values, the function can appear more linear
49
What assumption for a linear model refers to the necessary existence for explanatory (x) and response (y) variables to have a linear relationship between them?
linearity
50
What assumption for a linear model refers to the necessary existence of normally distributed errors for a given value in the explanatory (x) variable?
normality
51
What assumption for a linear model refers to the necessary existence for variance in the response variable which is constant across the explanatory variables?
homogeneity of variance
52
What key assumption of linear models says that, for any given value of explanatory variables, the responses will have independent errors?
independence
53
What is the difference between the observed value of the response variable (y) and the value of response predicted by the linear model (y-hat) known as?
residual
54
When testing linearity with the residual~scatter-plot method, we expect that residuals should be (1)___________________ about the line (2)_________, and that the values are (3)____________ with predictions.
(1) evenly distributed (2) y = 0 (3) uncorrelated
55
If a non-linear model is a better representation of a phenomenon which is erroneously being described with a linear model, what can be a consequence?
inflated estimate of variance
56
What kind of graph is invoked to check the normality assumption?
normal qq-plot
57
What allows the statistician to apply a relative significance to a set of values in order to make errors standard to test for normality?
weighted standard deviation
58
What are the steps necessary to establish a qq-plot to check for normality?
(1) calculate the residuals (e) for each value of the response variable (y - y-hat) (2) make each residual (e) standardized by dividing it against the weighted standard deviation (sigma^2) (3) for the scatter-plot, set the theoretical quantiles of the differences on the x-axis and the standardized residuals on the y-axis
59
In a normal qq-plot, what does the hypothetical linear abline (which we code for separately and is not based on the real values) represent?
the linear abline shows where standardized residuals would fall if they were perfectly normal
60
If weighted residuals are normally distributed, what should their associated qq-plot look like?
all the values fall nicely on the abline
61
If weighted residuals are skewed to the left, what should their associated qq-plot look like?
all the values are linear early on and curve upward later (think of the curve of a circle with center at (0, 0) which is being looked at in Quadrant IV)
62
If weighted residuals are skewed to the right, what should their associated qq-plot look like?
all the values curve upward early on and become more linear later (think of the curve of a circle with center at (0, 0) which is being looked at in Quadrant II)
63
What type of distribution in the errors produces a qq-plot what roughly looks like an "S"?
distribution of errors with fat-tailed residuals
64
What type of distribution in the errors produces a qq-plot what roughly looks like a backwards "S"?
distribution of errors with thin-tailed residuals
65
Normality in the data is important because it allows use to use (1)_______________________ to construct (2)_______________________ and to engage in (3)___________________________.
(1) parametric theory (2) confidence intervals (3) hypothesis testing
66
To examine homogeneity of variance, what kind of graph do we establish?
t-plot of residual values versus the fitted data
67
In a t-plot of predicted values (y-hat) against the residuals (e), what kind of behavior do we want the values to appear as across the line e = 0?
homoscedastic
68
What is the kind of distribution in the residual values we do not wish to see when testing for homogeneity of variance?
heteroscedastic
69
How can statisticans best ensure that errors are independent of one another?
maintain sampling design with subjects that are indepedent spatially and/or temporally from one another
70
What are the four common kinds of data transformation used by statisticians to normalize the data?
(1) base-10 ("common") logarithmic (2) base-e ("natural") logarithmic (3) square-root (4) arcsine square root
71
Log transformations are useful if the ratio between the smallest and largest outputs is what?
orders of magnitudes in size
72
Square-root transformations can be used when all of the values are greater than what?
values need to be greater than zero
73
What is the term used for the kind of data square-root transformations are used with, which is related to the Poisson distribution?
count data
74
Under what circumstances would someone transform the data with the arcsine square root?
outputs are compressed between (0, 1.0)
75
How many kinds of t-test are there? What are they called? What do they have in common?
three kinds of t-test: (1) one sample t-test (2) two-sample t-test (3) paired t-test all t-tests compare the means of the data
76
To derive a t-statistic (z[obs]), what is the relationship between the observed data (x-bar), average variation (sigma / n^0.5), and the expected value under the null hypothesis (mu)?
z[obs] = (x-bar - mu) / (sigma / n^0.5) (In English, the test statistic is equal to the difference of the means of the samples minus the expectation under the null hypothesis, divided by the average amount of variation between the samples)
77
What is the difference between a t-statistic and a z-score?
the t-statistic is used when the sample size is small or the standard deviation of the population is not known
78
Both the two-sample t-test and the paired t-test compare the means of two sample groups. What are the two main reasons to use a paired t-test over a two-sample t-test?
(1) paired t-tests are used if two measurements are taken on the same unit (2) paired t-tests can remove variability between units
79
What is the underlying reason for which random block design is undertaken?
control of sources of variation
80