terms and definitions Flashcards

1
Q

the problem of multiple comparisons

A

beware of whenever someone does many tests and picks one that looks good

eg we flip 1000 fair coins 100 times each; we then select the 10 “best” coins that came up heads the most, claiming these coins are “lucky”–we’ve no causal claim to the top 10’s favoritism of heads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

prior probability

A

in Bayesian statistical inference, prior probability is the probability of an event based on established knowledge, before empirical data is collected; “what is originally believed before new evidence is introduced”

e.g. consider classifying an illness, knowing only that a person has a fever and a headache. These symptoms are indications of both influenza and of Ebola virus. But far more people have the flu than Ebola (the prior probability of influenza is much higher than that of Ebola) so based on those symptoms, you would classify the illness as the flu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

posterior probability

A

in Bayesian statistics, the posterior probability is the probability after we’ve applied the conditioning event, ie after the desired “new” information has come in, to eg refine the probability distribution–to make an “adjusted guess”

a posterior can, in turn, become a prior, if we have, in turn, newer information that leads to its (the posterior-cum-prior) revision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

margin of error (in context of samples)

A

Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. [eg in inferences on population means]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

t-statistic

A

usually in context of hypothesis testing between means (eg between two sample means, or between a sample mean and a population mean)

a general form for t statistics is,
t(ŷ) = (ŷ-y)/s.e.(ŷ),

ie the t statistic for point estimate ŷ is the recentered ŷ divided by the standard error for point estimate ŷ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

probability triplet; event space, sample space

A
  • a probability space, (O,A,P)
    • O the sample space (eg the real line)
    • A a sigma algebra of subsets of O (eg Borel sigma algebra)
    • P a measure, normalized as P(O)=1
  • subsets of O are called events; elements of A are called random events (ie can be measured)
  • if O is countable, we generally call A the event space
  • a random variable then maps events in the probability space to some associated state space (ie assigns values to events)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

state space

A

this involves the “separation” of random events and values assigned to those events

for each outcome in the sample space (eg the result of 10 coin tosses), we can assign a value; a random variable’s state space consists of these values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

variance

A

as the second centered moment:

var(X) = E( [X-E(X)]^2 ) = E(X^2)-(E(X))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

linear functions of a random variable

A

let f(X) = aX+b; then:

  • E[f(x)] = aE(X) + b
  • var[f(x)] = a^2 var(X)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

stastical inference

A

deducing properties of an underlying probability distribution from a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

statistic

A

property of a sample from the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

point estimate (and bias, relative efficiency, and MSE)

A
  • a point estimate, x_e, of a population parameter, x_a, is a best guess of the value of x_a
  • the bias of a point estimate is, bias = E(x_e)-x_a
    standard unbiased estimates include:
    • sample mean for the mean of any distribution
    • p_e=X/n for binomial B(n,p_a)
    • sum_i (x_i-E(x_i))^2 / (n-1) for the variance of any distribution
  • sampling distribution is the distribution of the point estimate derived from samples
  • relative efficiency for two different point estimates, var(x_e1) / var(x_e2)
  • mean square error for point estimate, E((x_e-x_a)^2) = var(x_e) + bias^2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

standard error (and eg means)

A
  • standard error for a point estimate is the standard deviation of the sampling distribution
    • ie we repeatedly pull n samples from the population, and consider the s.d. of the resulting distribution of parameter values
    • eg for s.e. on the mean for n samples from a population with variance sig^2, s.e. = sig / sqrt(n)
  • standard error is often estimated from a sample, and called the standard error estimate (or just standard error);
    eg for estimted s.e. on the mean for n samples from a population, sig_e / sqrt(n-1), where sig_e the “biased” sample standard deviation (sqrt(n) in denominator) (and where the sqrt(n-1) “applies” the Bessel correction)
  • typically for point estimates, the s.e. is estimated from a sample, and a special distribution (fitted to the case, such as Student’s t assuming the population is approximately normal) is used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

error vs residual

A

error–an error term accounts for inherent randomness in a stastical model for a given population’s data; the parameters for the statistical model are generally not known

residual–after using sample data to form estimates of statistical model parameters, the difference between the predictive model’s output and the sample observation is the residual

eg

  • a very simple statistical model supposes the population of people heights is, mean height + error (note we do not necessarily know the mean height of the population)
  • we take a sample of people, take the average height, and our fitted model is now, mean sample height; when we use this to make predictions (note this has no dependencies on predictor variables), the difference between sample observation and prediction is the residual
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

law of large numbers

A

the sample mean converges to the population mean as sample size increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

error types (I and II) and power of a test

A
  • a table with rows H_0 accepted, H_0 rejected, and columns H_0 true, H_0 false
  • type I error corresponds to LLC–H_0 was falsely rejected; can occur with the problem of “too many tests”
  • type II error corresponds to URC
    • H_0 was falsely accepted; can occur when sample sizes are too small
    • in context of H_A, we’ve determined the H_0 is plausible, when in reality the H_A is so plausible as to be true
  • power of a hypothesis test = 1 - probability of type II error
    • the probability the hypothesis was not falsely accepted (higher is better)
    • with respect to an H_A:
      • a higher power test ensures some degree of complementarity, with better discriminative abilities
      • with a lower power test, if H_0 seems plausible, then H_A may also be true–this is weak
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bonferonni (multiple comparisons)

A
  • for large numbers of comparisons, and arguably very conservative, amethod for reducing type I error (false rejection of the null, so eg for comparing population means, falsely assigning a difference in means significance)
  • simply divide the desired alpha level for the p-values (eg 0.05) by the number of tests run
  • see also Holm-Bonferroni, FDR, and FWER
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Simpson’s paradox

A

an association between two random variables appears in the population, but the association disappears when dividing the population into subgroups

eg graduate admissions at a university are overall lower for women (total relationship), while in-department, the effect reverses (partial relationship)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

confidence interval

A
  • the range of possible values that the population’s result would be at the confidence level of the study–sample result +/- the margin of error
  • in most general terms, given point estimate x_e for population parameter x_a, and standard error estimate s (i.e. allowing a t-statistic)
    • the point estimate, normalized by s, will follow some distribution, determined from the population distribution and parameter in question
    • the normalized p.e. distribution is then used to translate one- or two-sided confidence at some level (95%, etc.) into allowed range for for x_e-x_a
  • eg 29 +/- 18 with 95% confidence
    • point estimate is 29
    • margin of error is 18 (half-width, in true units)
    • confidence level is 95%
20
Q

sample variance (and e.g. normal populations)

A
  • sig^2 = sum_i (x_i-E(x_i))^2 / n
  • for unbiased estimate of population variance (Bessel correction)
    sig^2 = sum_i (x_i-E(x_i))^2 / (n-1)
  • if X is normally distributed, then
    (x1-E(xi))^2 + … + (xn-E(xi))^2 = sig_p^2 [N(0,1)^2 + … + N(0,1)^1] = sig_p^2 Chi_{n-1}where sig_p^2 is the population variance, and Chi is a Chi-squared distribution
21
Q

moments of a distribution (and e.g. Fourier transform)

A
  • the nth raw moment is E[X^n]
  • the nth central moment is E[(X-E(X))^n]
  • link to Fourier series
    • assume continuous random variable X with pdf f(x)
    • then the Fourier transform of f(x), F(s), can be expressed as a Taylor-like series with raw-moment coefficients:
      F(s) = sum E[X^n]((-2πis)^n / n!) [sic re 2πis–think of as terms of exp(-2πis)]
22
Q

moment generating function

A
  • the moment generating function, MGF, is a special function that generates the moments of pdf/pmf f(x)
  • the MGF is bijective with f(x)
  • MGF of r.v. X = E[e^{tX}] = 1 + tE(X) + t^2 E(X^2) / 2! + …
  • generally, the MGF is a two-sided Laplace transform of the pdf / pmf
  • kth order differentiation of MGF allows extraction of the kth moment
  • c.f. characteristic function
23
Q

characteristic function

A
  • a special function that is bijectively determined from a pdf/pmf f(x)
  • CF of r.v. X = E[e^{itX}]; componentize by Taylor series components of the exponential to obtain terms with raw moments
  • in the case of a pdf, the CF is the Fourier transform of f(x)
  • the characteristic function of a distribution always exists, even when the probability density function or moment-generating function do not
  • c.f. moment generating function
24
Q

maximum likelihood estimate

A
  • given unknown parameter y, observations {xi}_i, and pdf f(x,y), the likelihood of event [x1,…,xn] is,
    f(x1,y)…f(xn,y)
  • find y so that the likelihood function is maximized
  • note the log likelihood is often easier to work with (ie take log of l.f.)
25
Q

method of moments

A
  • assume we’ve a population under a specific distribution, with parameters a_1,…,a_k: pdf = f(x,a_1,…,a_k)
  • suppose further, we can express the population moments (mean, variance, skewness, kurtosis, …) as a function of the parameters a_1,…,a_k;
    a simple example would be a uniform distribution with endpoint parameters a,b: ie f=1/(b-a) on the interval [a,b]; then mu = (a+b)/2; var = (b-a)^2 / 12
  • we then can create a system of equations, k equations in k unknowns, by setting the population parameter-moments equal to respecitve moment estimates derived from a population sample
  • solving this system will produce an estimate for the distribution
26
Q

stochastic process

A
  • eg discrete case, half-infinite, have a family of random variables, (X0,X1,…), indexed over the set N (naturals), defined over the common probability space (O,B,P) (O the sample space, B the event space), and all mapping into the same measurable space (O1,B1) (O1 the state space)
  • a random variable in the process may be considered dependent on both the index set (eg “time”), and some measurable set in O
  • a sample function aka realization aka sample path aka trajectory aka path function aka path is a single outcome of a stochastic process–ie a set of single-possible instances of each random variable; note that though all Xi are based in the same (O,B,P) space, they are not necessarily independent
  • stationary stochastic process–the joint distribution of subsets is invariant to shifts in the time index
27
Q

Chebyshev inequality

A
  • for random variable X with mean mu and variance sig,
    P(mu-c*sig < = X < = mu+c*sig) >= 1-1/c^2
  • note, this inequality is referenced to the true population variance (which we may not know)
  • may be useful for non-normal distributions, for rough inequality
28
Q

effect size

A
  • this is a general term, refering to the “strength of the relationship between two variables”
  • categories of measurements of effect size include
    • correlation–eg Pearson’s r
    • differences between means–eg Cohen’s d (i.e. some underlying variable is related to the means, so larger difference means a stronger relation)
    • categorical–for effect sizes among categorical variables, eg odds ratio (for comparing two binary variables)
  • may be considered in context of statistical significance (eg highly significant, but with small effect size)
29
Q

coefficient of determination

A

ANOVA techniques can be applied to obtain measures and statistics on linear regression fits, extensible to regression model output in general

r^2 = SSM/SST = 1 - SSE/SST

  • where SST = SSM + SSE (T=total, M=model, E=error)
  • ie this is a literal proportion of variability, between the intra-model variability, and the total (data) variability
  • SSE comes from the residuals–sum (yi-yi_hat)^2
  • SST is related to total sample variance–sum(yi-mu)^2

relation to Pearson’s r: if fitting a linear regression model Y~X, then the coefficient of determination of the fit equals Pearson’s r (squared) on variables X and Y

30
Q

counterfactual

A
  • something that didn’t actually happen, and generally can’t be observed
  • eg (to test effects of capital punishment on crime) the number of crimes that go “uncommitted” because capital punishment exists
  • eg how much would someone have earned on the job if, all else being equal, they were a different gender
31
Q

statistics (Kaplan)

A

(variance centric) the explanation of variation in the context of what remains unexplained

32
Q

census

A

a “sample” equal to the entire population

33
Q

unalikeability

A

for a factor with k levels, a measure of how homogenous the samples are
procedure for n samples:

  • consider all pairs, (x_i, x_j), and total the number of pairs with x_i, x_j from different classes
  • divide result by n(n-1)
  • will be in [0,1] (=0 means all in same class)
34
Q

sampling variability

A

the degree of variation in a sample-based parameter estimate; related to confidence intervals

35
Q

sampling distribution

A
  • in context of sample-based parameter estimate, this amounts to the “pdf” of the point estimate
  • this can be created simply by repeatedly sampling n samples from the population
36
Q

statistical bootstrapping

A
  • given a large enough sample, estimate what the sampling distribution (for a parameter estimate) looks like, allowing especially eg confidence interval estimates
  • for n samples, repeatedly choose n, with replacement
37
Q

partial vs total relationship

A
  • partial relationship–a relationship between predictor and outcome variables, with one or more covariates / confounders / nuisance variables held constant
  • total relationship–a relationship between predictor and outcome variables, letting any other explanatory / independent / predictor variables change as they will (aka mutatis mutandis)
38
Q

frequentist vs situationist

A
  • frequentist–a perception of probability that eg suggests for a coin toss that:
    • you base a probability model of the coin as having a probability of it coming up heads, and
    • the value you assign to that probability is based on running lots of experiments (coin flips) and assigning the resulting frequency of heads to the probability parameter
  • subjectivist–the probabilities of events encode the modeler’s assumptions and beliefs
    • eg tomorrow’s forecast calls for 10% chance of rain: the subjectivist would interpret the 10% as the forecaster’s way of imparting some information, based on their experience, available data, etc.
    • useful for encoding beliefs, but the probability calculus should be used to work through the consequences of these beliefs (as with eg Bayes)
39
Q

percentile vs quantile

A
  • percentile–the input argument is a measured value, what could be the output of a single draw from the probability distribution (eg someone’s IQ, it being in some percentile)
  • quantile–the input is a percentile, while the output is on the scale of the measured varaible (eg what is the 25th quantile for home prices on the local market)
40
Q

deductive vs inductive reasoning

A
  • deductive
    • a series of rules that bring you from given assumptions to the consequences of those assumptions
    • eg a syllogism or symbolic algebra
  • inductive
    • generalizes or extrapolates from a set of observations to conclusions
    • can be wrong
41
Q

significance level

A

the significance level of a hypothesis test is the accept/reject threshold for the p-value; this is the conditional probability, P(reject the null | the world is such that the null is true)

42
Q

alternative hypothesis

A

the pet idea of what the world is like if the null hypothesis is wrong; this usually plays the role of the thing you’d like to prove
complementarity and links to Type II error

  • “What is the probability that, in such a world [as the alternate hypothesis being true], you would end up failing to reject the null hypothesis [under the just-the-H_0 test]?”–such a “mistake” is called Type II error
  • in a sense with a high-power test, the modeler makes rejecting the null more informative–H_A might be true; conversely, failure to reject the null suggests H_A is not true
43
Q

test statistic

A

the value used that is the focus of the study or hypothesis test; eg a sample mean, or model coefficient

44
Q

F statistic / F value

A
  • a “first principle” test in the context of ANOVA, indicating how significant a fitted model’s R^2 is
  • this is a ratio of Chi-square distributions, each element divided by respective degrees of freedom
  • in ANOVA for models:
    • the SSM (model output variance) and SSE (residuals variance) are obtained, and each is divided by respective degrees of freedom
    • the ratio of the results forms an F value, which, under some normality assumptions, can be checked against an F distribution
45
Q

the logic of a hypothesis test (Kaplan)

A
  • don’t try to reason about the real world, and build a model from that, but set up a hypothetical world that is completely understood, base a model on that, then compare the result of the model to the “observed patterns of the data”
  • accept / reject
    • accepting the null as plausible does not tell us much (many other possibilities could explain this) (converse)
    • rejecting the null gives salient information, that something with our assumptions is wrong (contrapositive)