terms and definitions Flashcards
statistical model
a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population)
a statistical model represents, often in considerably idealized form, the data-generating process
a statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables
the problem of multiple comparisons
beware of whenever someone does many tests and picks one that looks good
eg we flip 1000 fair coins 100 times each; we then select the 10 “best” coins that came up heads the most, claiming these coins are “lucky”–we’ve no causal claim to the top 10’s favoritism of heads
prior probability
in Bayesian statistical inference, prior probability is the probability of an event based on established knowledge, before empirical data is collected; “what is originally believed before new evidence is introduced”
e.g. consider classifying an illness, knowing only that a person has a fever and a headache. These symptoms are indications of both influenza and of Ebola virus. But far more people have the flu than Ebola (the prior probability of influenza is much higher than that of Ebola) so based on those symptoms, you would classify the illness as the flu
posterior probability
in Bayesian statistics, the posterior probability is the probability after we’ve applied the conditioning event, ie after the desired “new” information has come in, to eg refine the probability distribution–to make an “adjusted guess”
a posterior can, in turn, become a prior, if we have, in turn, newer information that leads to its (the posterior-cum-prior) revision
margin of error (in context of samples)
Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. [eg in inferences on population means]
t-statistic
usually in context of hypothesis testing between means (eg between two sample means, or between a sample mean and a population mean)
a general form for t statistics is,
t(ŷ) = (ŷ-y)/s.e.(ŷ),
ie the t statistic for point estimate ŷ is the recentered ŷ divided by the standard error for point estimate ŷ
probability triplet; event space, sample space
- a probability space, (O,A,P)
- O the sample space (eg the real line)
- A a sigma algebra of subsets of O (eg Borel sigma algebra)
- P a measure, normalized as P(O)=1
- subsets of O are called events; elements of A are called random events (ie can be measured)
- if O is countable, we generally call A the event space
- a random variable then maps events in the probability space to some associated state space (ie assigns values to events)
state space
this involves the “separation” of random events and values assigned to those events
for each outcome in the sample space (eg the result of 10 coin tosses), we can assign a value; a random variable’s state space consists of these values
variance
as the second centered moment:
var(X) = E( [X-E(X)]^2 ) = E(X^2)-(E(X))^2
linear functions of a random variable
let f(X) = aX+b; then:
* E[f(x)] = aE(X) + b
* var[f(x)] = a^2 var(X)
stastical inference
deducing properties of an underlying probability distribution from a dataset
statistic
property of a sample from the population
point estimate
- a point estimate, x_e, of a population parameter, x_a, is a best guess of the value of x_a
- the bias of a point estimate is, bias = E(x_e)-x_a
standard unbiased estimates include:- sample mean for the mean of any distribution
- p_e=X/n for binomial B(n,p_a)
- sum_i (x_i-E(x_i))^2 / (n-1) for the variance of any distribution
- sampling distribution is the distribution of the point estimate derived from samples
- relative efficiency for two different point estimates, var(x_e1) / var(x_e2)
- mean square error for point estimate, E((x_e-x_a)^2) = var(x_e) + bias^2
standard error
- standard error for a point estimate is the standard deviation of the sample estimate’s probability distributioneg for s.e. on the mean for n samples from a population with variance sig^2, s.e. = sig / sqrt(n)
- standard error is often estimated from a sample, and called the standard error estimate (or just standard error)eg for estimted s.e. on the mean for n samples from a population, sig_e / sqrt(n-1), where sig_e the sample standard deviation (has Bessel correction)
- typically for point estimates, the s.e. is estimated from a sample, and a special distribution (fitted to the case, such as Student’s t) is used
error vs residual
error–an error term represents the way observed data differs from the actual population (eg the mean)
residual–a residual represents the way observed data differs from estimates based on sample population data (eg the sample mean), including a model’s prediction