terms and definitions Flashcards
the problem of multiple comparisons
beware of whenever someone does many tests and picks one that looks good
eg we flip 1000 fair coins 100 times each; we then select the 10 “best” coins that came up heads the most, claiming these coins are “lucky”–we’ve no causal claim to the top 10’s favoritism of heads
prior probability
in Bayesian statistical inference, prior probability is the probability of an event based on established knowledge, before empirical data is collected; “what is originally believed before new evidence is introduced”
e.g. consider classifying an illness, knowing only that a person has a fever and a headache. These symptoms are indications of both influenza and of Ebola virus. But far more people have the flu than Ebola (the prior probability of influenza is much higher than that of Ebola) so based on those symptoms, you would classify the illness as the flu
posterior probability
in Bayesian statistics, the posterior probability is the probability after we’ve applied the conditioning event, ie after the desired “new” information has come in, to eg refine the probability distribution–to make an “adjusted guess”
a posterior can, in turn, become a prior, if we have, in turn, newer information that leads to its (the posterior-cum-prior) revision
margin of error (in context of samples)
Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. [eg in inferences on population means]
t-statistic
usually in context of hypothesis testing between means (eg between two sample means, or between a sample mean and a population mean)
a general form for t statistics is,
t(ŷ) = (ŷ-y)/s.e.(ŷ),
ie the t statistic for point estimate ŷ is the recentered ŷ divided by the standard error for point estimate ŷ
probability triplet; event space, sample space
- a probability space, (O,A,P)
- O the sample space (eg the real line)
- A a sigma algebra of subsets of O (eg Borel sigma algebra)
- P a measure, normalized as P(O)=1
- subsets of O are called events; elements of A are called random events (ie can be measured)
- if O is countable, we generally call A the event space
- a random variable then maps events in the probability space to some associated state space (ie assigns values to events)
state space
this involves the “separation” of random events and values assigned to those events
for each outcome in the sample space (eg the result of 10 coin tosses), we can assign a value; a random variable’s state space consists of these values
variance
as the second centered moment:
var(X) = E( [X-E(X)]^2 ) = E(X^2)-(E(X))^2
linear functions of a random variable
let f(X) = aX+b; then:
- E[f(x)] = aE(X) + b
- var[f(x)] = a^2 var(X)
stastical inference
deducing properties of an underlying probability distribution from a dataset
statistic
property of a sample from the population
point estimate (and bias, relative efficiency, and MSE)
- a point estimate, x_e, of a population parameter, x_a, is a best guess of the value of x_a
- the bias of a point estimate is, bias = E(x_e)-x_a
standard unbiased estimates include:- sample mean for the mean of any distribution
- p_e=X/n for binomial B(n,p_a)
- sum_i (x_i-E(x_i))^2 / (n-1) for the variance of any distribution
- sampling distribution is the distribution of the point estimate derived from samples
- relative efficiency for two different point estimates, var(x_e1) / var(x_e2)
- mean square error for point estimate, E((x_e-x_a)^2) = var(x_e) + bias^2
standard error (and eg means)
- standard error for a point estimate is the standard deviation of the sampling distribution
- ie we repeatedly pull n samples from the population, and consider the s.d. of the resulting distribution of parameter values
- eg for s.e. on the mean for n samples from a population with variance sig^2, s.e. = sig / sqrt(n)
- standard error is often estimated from a sample, and called the standard error estimate (or just standard error);
eg for estimted s.e. on the mean for n samples from a population, sig_e / sqrt(n-1), where sig_e the “biased” sample standard deviation (sqrt(n) in denominator) (and where the sqrt(n-1) “applies” the Bessel correction) - typically for point estimates, the s.e. is estimated from a sample, and a special distribution (fitted to the case, such as Student’s t assuming the population is approximately normal) is used
error vs residual
error–an error term accounts for inherent randomness in a stastical model for a given population’s data; the parameters for the statistical model are generally not known
residual–after using sample data to form estimates of statistical model parameters, the difference between the predictive model’s output and the sample observation is the residual
eg
- a very simple statistical model supposes the population of people heights is, mean height + error (note we do not necessarily know the mean height of the population)
- we take a sample of people, take the average height, and our fitted model is now, mean sample height; when we use this to make predictions (note this has no dependencies on predictor variables), the difference between sample observation and prediction is the residual
law of large numbers
the sample mean converges to the population mean as sample size increases
error types (I and II) and power of a test
- a table with rows H_0 accepted, H_0 rejected, and columns H_0 true, H_0 false
- type I error corresponds to LLC–H_0 was falsely rejected; can occur with the problem of “too many tests”
- type II error corresponds to URC
- H_0 was falsely accepted; can occur when sample sizes are too small
- in context of H_A, we’ve determined the H_0 is plausible, when in reality the H_A is so plausible as to be true
- power of a hypothesis test = 1 - probability of type II error
- the probability the hypothesis was not falsely accepted (higher is better)
- with respect to an H_A:
- a higher power test ensures some degree of complementarity, with better discriminative abilities
- with a lower power test, if H_0 seems plausible, then H_A may also be true–this is weak
Bonferonni (multiple comparisons)
- for large numbers of comparisons, and arguably very conservative, amethod for reducing type I error (false rejection of the null, so eg for comparing population means, falsely assigning a difference in means significance)
- simply divide the desired alpha level for the p-values (eg 0.05) by the number of tests run
- see also Holm-Bonferroni, FDR, and FWER
Simpson’s paradox
an association between two random variables appears in the population, but the association disappears when dividing the population into subgroups
eg graduate admissions at a university are overall lower for women (total relationship), while in-department, the effect reverses (partial relationship)