CHAPTER 3: Useful ideas and methods for inference Flashcards
Theorem 10 (CLT for iid variables).
If random variables X1, . . . , Xn are independent and
identically distributed with mean µ and variance σ^2 < ∞, then
[{Σ from i=1 to n of [X_i] } − nµ]/σ√n
→ Z ∼ N (0, 1), as n →∞
Theorem 11 (CLT for iid random vectors).
If random vectors bold(X1), . . . , bold(Xn) are independent
and identically distributed with mean vector bold(µ) and variance-covariance matrix bold(Σ), finite,
then
[{Σ from i=1 to n of [X_i] } − nµ]/√n
→ Z ∼ N (0, Σ), as n → ∞
vectors in the normal dist
CLT notes:
- The → in these theorems denotes convergence in distribution.
- converges in distribution to a Normal distribution it is often said to be asymptotically Normal as n → ∞.
- The Central Limit true for dependent and/or non-identically distributed random variables/vectors under suitable conditions.
Likelihood and inference
conclusions about unknown parameter θ, one- or multi-dimensional on basis of x and model f_X
sample observations x1, . . . , xn = bold(x) are modelled as the values of random variables X1, . . . , Xn = bold(X)
probability density function (probability function in the
discrete case) f_X of X depends on an unknown parameter θ,
Definition 12:
LIKELIHOOD
The likelihood of θ based on observed data x is defined to be the function
of θ:
L(θ) = L(θ; x) = f_X(x; θ).
*In the discrete case, for each θ, L(θ) gives the probability of observing the data x if θ is
the true parameter (provided f is from the correct family of distributions).
- L(θ) as a measure of how plausible θ is as the value that generated the observed
data x - in continuous case measurements are made only to a bounded precision, prob density funct is proportional to the probability of finding the RV in a small interval
Ratio of likelihoods
The ratio L(θ_1)/L(θ_2) measures how plausible θ_1 is relative to θ_2 as the value generating
the data
maximum likelihood
If θˆ is the most plausible value; that is, the value of θ for which
L(θˆ) = max_θ [L(θ)]
maximum likelihood estimate
Relative Likelihood
all values of θ for which the Relative Likelihood
RL(θ) = L(θ)/L(θˆ)
is not too much different from 1 are plausible in the light of the observed x.
(L(θ_1)/L(θ_2) when θ_2 is the parameter maximizing likelihood )
log-likelihood
convenient to plot the likelihood on a log scale
log-likelihood is defined to be
l(θ) = log L(θ).
- independence - multiplications - log transforms into +
- Statements about relative likelihoods become statements about differences of log-likelihoods.
- exp dist easier
likelihood regions
Thus values of θ plausible in the light of the data (or consistent with the data) are those
contained in sets of the form
{θ : l(θ) > l(θˆ) − c}
for suitable constants c
*1-D case interval
- value θˆ is the maximum likelihood estimator (mle) of θ: the value within
the parameter space – the set of permissible values of the parameter – maximizing
L(θ). dependence on data x : θˆ(x).
*For inferences about θ, only relative values of the likelihood matter,
can neglect constants (factors not dep on θ) and use whatever version of L or
l is convenient.
*If we re-parametrize to φ = g(θ) where g is a continuous invertible function, then the likelihood L changes in the obvious way: if L1 denotes likelihood with respect to φ, then L1(φ) = L(g^−1(φ)). Also, most usefully, φˆ = g(θˆ). expect same likelihood estimator ~transformation
Indep and log
independent X_i. Then
L(θ) = ∏ from i=1 to n [f_{Xi}(xi; θ)
and
l(θ) = Σ from i=1 from n [log f_{Xi}(xi; θ)
where f_{Xi} denotes the density function of X
likelihood equation(s)
θˆ may be found as the solution of the likelihood equation(s)
∂L(θ)/∂θ= 0
or equivalently,
∂l(θ)/∂θ = 0
ie max of functs
EXAMPLE:
random sample obs x_1,.. x_n from exp dist with unknown mean θ ≥ 0. (For example, we could observe a Poisson process until
we have n occurrences, and let xi be the ith inter-occurrence time.)
The probability density function for each observation is
f_{Xi}(x; θ) =
{(1/θ)e^{−x/θ} x ≥ 0
{0 x < 0
so that
l(θ) =
{n (log θ - ¯x/θ) if min xi ≥ 0
{−∞ otherwise
Since
∂l/∂θ = n(−1/θ + ¯x/θ^2),
the maximum likelihood estimator is ˆθ = ¯x.
Recall that the usual parametrization of the exponential distribution uses the rate parameter λ = 1/θ so that replaced by f_{Xi}(x; θ) = {λ e^{−λx} x ≥ 0 {0 x < 0
If we write down the log likelihood for λ, we get
l(λ) = n(log λ − λx¯), and maximizing this
gives λˆ = 1/x¯ = 1/ˆθ as expected.
A likelihood interval
would be found in this case by finding the values of θ for which
l(ˆθ) − l(θ) =
n (¯x/θ − 1 − log(¯x/θ)) < c.
Evidently numerical or graphical solution would be needed.
example plot of l(θ) based on a sample of size n = 10 for which x¯ = 2.3. shows skewed hump at 2.3, likelihoods within 2 of max are good est para given small sample
Example 9. Markov chain
We consider a two state Markov chain (Xn), as in Example 4 but with state space S = {1, 2},
with transition matrix
[1 − θ θ ]
[ φ 1 − φ]
We assume that the chain is in equilibrium, and we consider finding the likelihood for the
parameters θ = (θ, φ).
The stationary distribution here is ( φ/(θ+φ) θ/(θ+φ) )
Imagine we observe X_0 = 2, X_1 = 1. Because we assume the chain is in equilibrium, we have
P(X_0 = 2) = θ/(θ+φ)
so
P(X_0 = 2, X_1 = 1) =
[θ/(θ + φ)]φ
Hence this expression also gives us the likelihood of (θ, φ) given our observation, and we can
write
L(θ, φ; x) = [θφ/(θ + φ)]
.
OR imagine d observe the sequence of states 2, 1, 1, 2, 2, 2. Then our likelihood
becomes
L(θ, φ; x) =
[θ/(θ + φ)]φ(1 − θ)θ(1 − φ)(1 −φ) = [θ^2φ(1 − θ)(1 − φ)^2]/(θ + φ)
plotting
* plotting θ against φ and showing varying values of likelihood
likelihood increases as θ increases and φ increases. Start in state 2 and to 1 prob is high, implies θ is high
graph similar to reciprocal graph for varying values
**contour plot, 6 states so more info 0.57≈ θ^0.26≈φ^
(FOUND BY stationary distribution probability *probabilities of successive states)
Approximating the log likelihood
Taylor series about max and at max first deriv disappears
It turns out that in many cases it can usefully be
approximated by a quadratic function of θ, so can be summarized by the position of the
maximum and the curvature there.
Example 10. Exponential sample continued
(from max gen data: less plausible that theta a fixed distance away)
Figure: θ against log relative likelihood trajectory like curves narrower with peaks
shows the log relative likelihoods from samples of sizes n = 10, 20, 40 and 80 from
the exponential distribution .
Each sample had mean
x¯ = 2.3.
Evidently as n increases
the log-likelihood becomes more peaked around its maximum. Thus it becomes less and less plausible that values of θ a fixed distance away from the maximum generated the data.
The curvature of l at ˆθ is measured by minus the second derivative −∂^2l/∂θ^2:
−∂^2l(θ)/∂θ^2=
n(2x¯/θ^3−1/θ^2)
which reduces at θ = ˆθ to n/x¯^2
increasing with n.
(neg as peak)
DEF 13 observed info
For 1-dimensional θ the function J(θ) = −∂^2l/∂θ^2
is called the observed
information about θ in the sample.
For p-dimensional θ the observed information is a matrix with components
J(θ)_rs =
−∂^2l(θ)/(∂θ_r∂θ_s)
*in 1-d case we will usually find J(θ^) bigger than 0, multi-d then matrix which is positive definite
log
likelihood approximated by a quadratic
For most likelihoods, not just the one in the example, it’s true that close to ˆθ the log
likelihood is well approximated by a quadratic function of θ:
l(θ) − l(ˆθ) ≈
(1/2)(θ − ˆθ)^2 ∂^2l(ˆθ)/∂θ^2
= −1/2(θ − ˆθ)^2 J(ˆθ)
This is only useful if ‘close’ includes the values of θ that are plausible. Usually, this is
increasingly true as the amount of information increases, for example as n increases in the i.i.d. case.
*increasingly true as the amount of info increases i e as n increases for the i i d vars
How uncertain are findings from the likelihood?
The mle θˆ will
generally take different values for different data x
SAMPLING VARIABILITY
CONSIDER five samples, each of size 20, from the exponential distribution with mean θ = 2.3
SAMPLING VARIABILITY addressed by thinking of ˆθ as RV ˆθ(X) (funct of R vector)
Let θ_0 denote the value of θ in the dist from which the X_i were generated- true value of theta and denote the max likelihood estimated of θ based on a random sample of size n by
θˆ_n = θˆ(X_1, . . . , X_n).
Under repeated sampling θˆ differs from θ0 by an amount which for large n is
Normally distributed. Moreover we can find its variance from the log likelihood
DEF 14 expected information funct/matrix
For 1-dimensional θ define the expected information function1 about
θ by
I(θ) = −E(∂^2l(θ; X)/∂θ^2)
and for vector θ correspondingly define the
expected information matrix as the matrix
with components
I(θ)rs = −E(∂^2l(θ; X)/(∂θ_r∂θ_s)
The expectations here are with respect to the variation in X
For 1-dimensional θ define the expected information function1 about
θ by
I(θ) = −E
∂
2
l(θ; X)
∂θ2
, (12)
and for vector θ correspondingly define the expected information matrix as the matrix
with components
I(θ)rs = −E
∂
2
l(θ; X)
∂θr∂θs
, (13)
The expectations here are with respect to the variation in X
Key Fact 1 (Asymptotic Normality of mles in the iid case)
for θ of dimension p ≥ 1 we have the following result:
In the random sample case, under mild conditions, as sample size n → ∞,
I(θ_0)^{1/2}(θˆn(X) − θ_0) → N_p(0, 1p)
in distribution, where
N_p(0, 1_p) denotes the multivariate Normal distribution with covariance matrix the p-dimensional unit matrix 1_p.
ie large n,
θˆ .∼ N_p(θ_0, I(θ_0)^−1)
( dot over ~)
*variants θˆ.∼ N_p(θ_0, I(θˆ)−1)
θˆ.∼ N_p(θ_0, J(θˆ)^−1).
These follow from continuity of I or J and the fact that θˆ approximates θ0 more and
more closely as n increases. They are useful since θˆ replaces the unknown θ0 in the (co)variances, simplifying calculations. For example, in the 1-dimensional case,
is just ˆθ − θ_0.∼ N (0, 1/J(ˆθ)),
giving 95% CI
95% CI
the approximate 95% confidence interval for θ_0
(ˆθ − 1.96√(1/J(ˆθ), ˆθ + 1.96√(1/J(ˆθ))
There is some evidence that this interval based on observed information has better coverage properties than the corresponding interval based on expected information I.
Example 11.
Exponential sample continued From Examples 8 and 10, we have ∂l/∂θ = n(−1/θ + ¯x/θ^2) and −∂^2l(θ)/∂θ^2= n(2(x¯/θ^3)−(1/θ^2)
Hence J(θ) = n(2(x¯/θ^3 −1/θ^2)
Since the expected value of X¯ is also θ, we have
I(θ) = n(2(θ/θ^3)−(1/θ^2))
=
n/(θ^2)
and as ˆθ = ¯x both
I(ˆθ) and J(ˆθ) are equal to n/(x¯^2)
Hence an approximate 95% confidence interval for θ is
(x¯ − 1.96 (x¯/√n)
, x¯ + 1.96 (x¯/√n))