Chapter 7: The posterior Flashcards
(40 cards)
Often we describe a distribution by its summary characteristics. For example, we often want to know the mean value of a parameter. Essentially, what is this?
This is essentially a weighted mean (where the weights are provided by the values of the probability density function).
If we have the mathematical formula for a continuous distribution, how do we calculate this?
We calculate this by the following integral:
E[θ | data] = ∫(1,0) p(θ | data)θdθ
A point estimate is dangerous to use without what?
A point estimate is dangerous to use without some measure of our confidence in the value.
Describe how to calculate one useful measures of uncertainty and name an alternative approach
One useful measure of uncertainty is a parameter’s variance:
var(θ | data) = ∫(1,0) p(θ | data)(θ − (E[θ | data])^2 )dθ.
An alternative way to summarise uncertainty is by specifying an interval rather than a point estimate.
It is usually easier to understand an uncertainty if it is expressed in the same units as the mean. How is this achieved?
By taking the square root of the variance
Bayesian inference satisfies a property known as data order invariance. What does this mean?
If have two sets of data and want to use one as a prior in order to calculate a posterior which will become a prior for analysing the other set if data; it does not matter the order in which these are carried out, it produces the same result.
How do we investigate how changes in the prior distribution p(θ) affect the posterior?
Suppose that we find that two individuals in our sample of 10 people are disease-positive. We can use Bayes’ rule to write down an expression for the posterior diseased proportion (using a binomial model):
p(θ|X=2,N=10)= p(X=2|θ,N=10)×p(θ) / p( X = 2 | N = 10)
∝ p( X = 2 | θ , N = 10) × p(θ )
{ likelihood} {prior}
This tells us that the posterior is a sort of weighted geometric average of the likelihood and prior. This means that the posterior peak will be situated somewhere between the peaks of the likelihood and prior, so any changes to the prior will be mirrored by changes in the posterior
see figure 7.3 in docs
How do we investigate how changes in the likelihood affect the posterior?
Using the same expression. As we increase the numbers of disease-positive individuals, from X = 0 (left column) to X = 5 (middle column) to X = 10 (right column), we see that the likelihood shifts to the right and, correspondingly, the posterior peak shifts to give more weight to higher disease prevalences.
How does sample size affect the shape of the curve?
As the sample size increases, the likelihood function becomes narrower and much smaller in value, since the probability of generating a larger data set with any particular characteristics diminishes. Maintaining the proportion of disease-positive individuals in our sample at 20%, we can demonstrate how the posterior changes as we increase the sample size in figure 7.5 in docs
Why does the sample size have such an effect on the shape of the curve?
Since the posterior is related to the product of the likelihood and prior, it is sensitive to small values of either part. This means that as the sample size increases, and the likelihood function becomes smaller and narrower, the position of the posterior shifts towards the location of the likelihood peak.
While we can estimate the full posterior distribution for a parameter, we are often required to present point estimates. Why is this and what point does the book make regarding this topic?
This is sometimes to facilitate direct comparison with Frequentist approaches, but more often it is to allow policy makers to make decisions. They argue that, even if we are asked to provide a single estimated value, it is crucial that we provide a corresponding measure of uncertainty.
What are the three predominant point estimates in bayesian statistics?
There are three predominant point estimators in Bayesian statistics:
• the posterior mean
• the posterior median
• the maximum a posteriori (MAP) estimator
As described earlier, the posterior mean is just the expected value of the posterior distribution. For a univariate con- tinuous example, this is calculated by an integral:
E[θ|data]= ∫θ×p(θ|data)dθ
How is this calculated for discrete cases?
For the discrete case, we replace the above integral with a sum
What is the posterior median?
The posterior median is the point of a posterior distribution where 50% of probability mass lies on either side of it.
What is the MAP estimator?
The MAP estimator is simply the parameter value that corresponds to the highest point in the posterior and consequently is also referred to as the posterior mode
While each of these three estimators can be optimal in different circumstances, the authors believe that there is a clear hierarchy among them.
What estimator is at the top of the hierarchy? What is the reason for this?
At the top of the hierarchy is the posterior mean. This is our favourite for two reasons: first, it typically yields sensible estimates which are representative of the central position of the posterior distribution; second, and more mathematically, this estimator makes sense from a measure-theoretic perspective, since it accounts for the measure (don’t need to fully understand this.)
While each of these three estimators can be optimal in different circumstances, the authors believe that there is a clear hierarchy among them.
What estimator is at the middle of the hierarchy? What is the reason for this and when is it preferable to use this estimator in comparison to the mean?
This is usually pretty close to the mean (see Figure 7.6 in docs) and is often indicative of the centre of the posterior distribution. It is sometimes preferable to use a median if the mean is heavily skewed by extrema, although the choice between the two estimators depends on circumstance.
While each of these three estimators can be optimal in different circumstances, the authors believe that there is a clear hierarchy among them.
What estimator is at the bottom of the hierarchy? Why do people use this estimator?
At the bottom of the hierarchy, we have the MAP estimator. Proponents argue that the simplicity of this estimator is a benefit. It is simple to calculate because the denominator does not depend on the parameter, meaning that, to find the posterior mode, we can simply find the parameter value that maximises the numerator.
Why do the authors recommend against using the MAP estimator?
Its simplicity is misleading. The mode of a distribution often lies away from the bulk of probability mass and is hence not a particularly indicative central measure of the posterior. This estimator also does not make sense mathematically because it is based on the density, which depends on the particular parameterisation in question. You should not use the MAP estimator unless you have a very good reason for doing so.
What does the book describe as ‘teh mainstay of frequentist statistics’ and what is its bayesian equivalent?
The mainstay of the Frequentist estimation procedure is the confidence interval. The bayesian equivalent is the credible interval
In applied research, these intervals often form part of the main results of a paper. For example:
“From our research, we concluded that the percentage of penguins with red tails, RT, has a 95% confidence interval of 1% ≤ RT ≤ 5%.”
How should this be interpreted?
This is often incorrectly taken as having an implicit meaning: ‘There is a 95% probability that the true percentage of penguins with red tails lies in the range of 1% to 5%.’ However, what it actually captures is uncertainty about the interval we calculate, rather than the parameter in question. In the Frequentist paradigm we imagine taking repeated samples from a population of interest, and for each of the fictitious samples, we estimate a confidence interval. A 95% confidence interval means that across the infinity of intervals that we calculate, the true value of the parameter will lie in this range 95% of the time.
How is this (confidence intervals) different to how we interpret credible intervals?
A confidence interval indicates uncertainty about the interval we obtain, rather than a statement of probability about the parameter of interest. The uncertainty is quantified in terms of all the samples we could have taken, not just the one we observe. Bayesian credible intervals, in contrast to confidence intervals, describe our uncertainty in the location of the parameter values, estimated using the current sample. They are calculated from the posterior density. In particular, a 95% credible region satisfies the condition that 95% of the posterior probability lies in this parameter range.
Therefore how would you interpret the following statement?
From our research, we concluded that the percentage of penguins with red tails, RT, has a 95% credible interval of 0% ≤ RT ≤ 4%
It can be interpreted straightforwardly as ‘From our research, we conclude that there is a 95% probability that the percentage of penguins with red tails lies in the range 0% ≤ RT ≤ 4%.’
What allows for this more straightforward interpretation of credible intervals?
An arbitrary credible interval of X% can be constructed from the posterior density by finding a region whose area is equal to X / 100.