Chapter 7: The posterior Flashcards by oisin mcelwain

Often we describe a distribution by its summary characteristics. For example, we often want to know the mean value of a parameter. Essentially, what is this?

This is essentially a weighted mean (where the weights are provided by the values of the probability density function).

How well did you know this?

Not at all

Perfectly

If we have the mathematical formula for a continuous distribution, how do we calculate this?

We calculate this by the following integral:

E[θ | data] = ∫(1,0) p(θ | data)θdθ

How well did you know this?

Not at all

Perfectly

A point estimate is dangerous to use without what?

A point estimate is dangerous to use without some measure of our confidence in the value.

How well did you know this?

Not at all

Perfectly

Describe how to calculate one useful measures of uncertainty and name an alternative approach

One useful measure of uncertainty is a parameter’s variance:

var(θ | data) = ∫(1,0) p(θ | data)(θ − (E[θ | data])^2 )dθ.

An alternative way to summarise uncertainty is by specifying an interval rather than a point estimate.

How well did you know this?

Not at all

Perfectly

It is usually easier to understand an uncertainty if it is expressed in the same units as the mean. How is this achieved?

By taking the square root of the variance

How well did you know this?

Not at all

Perfectly

Bayesian inference satisfies a property known as data order invariance. What does this mean?

If have two sets of data and want to use one as a prior in order to calculate a posterior which will become a prior for analysing the other set if data; it does not matter the order in which these are carried out, it produces the same result.

How well did you know this?

Not at all

Perfectly

How do we investigate how changes in the prior distribution p(θ) affect the posterior?

Suppose that we find that two individuals in our sample of 10 people are disease-positive. We can use Bayes’ rule to write down an expression for the posterior diseased proportion (using a binomial model):

p(θ|X=2,N=10)= p(X=2|θ,N=10)×p(θ) / p( X = 2 | N = 10)
∝ p( X = 2 | θ , N = 10) × p(θ )
{ likelihood} {prior}

This tells us that the posterior is a sort of weighted geometric average of the likelihood and prior. This means that the posterior peak will be situated somewhere between the peaks of the likelihood and prior, so any changes to the prior will be mirrored by changes in the posterior

see figure 7.3 in docs

How well did you know this?

Not at all

Perfectly

How do we investigate how changes in the likelihood affect the posterior?

Using the same expression. As we increase the numbers of disease-positive individuals, from X = 0 (left column) to X = 5 (middle column) to X = 10 (right column), we see that the likelihood shifts to the right and, correspondingly, the posterior peak shifts to give more weight to higher disease prevalences.

How well did you know this?

Not at all

Perfectly

How does sample size affect the shape of the curve?

As the sample size increases, the likelihood function becomes narrower and much smaller in value, since the probability of generating a larger data set with any particular characteristics diminishes. Maintaining the proportion of disease-positive individuals in our sample at 20%, we can demonstrate how the posterior changes as we increase the sample size in figure 7.5 in docs

How well did you know this?

Not at all

Perfectly

Why does the sample size have such an effect on the shape of the curve?

Since the posterior is related to the product of the likelihood and prior, it is sensitive to small values of either part. This means that as the sample size increases, and the likelihood function becomes smaller and narrower, the position of the posterior shifts towards the location of the likelihood peak.

How well did you know this?

Not at all

Perfectly

While we can estimate the full posterior distribution for a parameter, we are often required to present point estimates. Why is this and what point does the book make regarding this topic?

This is sometimes to facilitate direct comparison with Frequentist approaches, but more often it is to allow policy makers to make decisions. They argue that, even if we are asked to provide a single estimated value, it is crucial that we provide a corresponding measure of uncertainty.

How well did you know this?

Not at all

Perfectly

What are the three predominant point estimates in bayesian statistics?

There are three predominant point estimators in Bayesian statistics:
• the posterior mean
• the posterior median
• the maximum a posteriori (MAP) estimator

How well did you know this?

Not at all

Perfectly

As described earlier, the posterior mean is just the expected value of the posterior distribution. For a univariate con- tinuous example, this is calculated by an integral:

E[θ|data]= ∫θ×p(θ|data)dθ

How is this calculated for discrete cases?

For the discrete case, we replace the above integral with a sum

How well did you know this?

Not at all

Perfectly

What is the posterior median?

The posterior median is the point of a posterior distribution where 50% of probability mass lies on either side of it.

How well did you know this?

Not at all

Perfectly

What is the MAP estimator?

The MAP estimator is simply the parameter value that corresponds to the highest point in the posterior and consequently is also referred to as the posterior mode

How well did you know this?

Not at all

Perfectly

While each of these three estimators can be optimal in different circumstances, the authors believe that there is a clear hierarchy among them.

What estimator is at the top of the hierarchy? What is the reason for this?

At the top of the hierarchy is the posterior mean. This is our favourite for two reasons: first, it typically yields sensible estimates which are representative of the central position of the posterior distribution; second, and more mathematically, this estimator makes sense from a measure-theoretic perspective, since it accounts for the measure (don’t need to fully understand this.)

How well did you know this?

Not at all

Perfectly

While each of these three estimators can be optimal in different circumstances, the authors believe that there is a clear hierarchy among them.

What estimator is at the middle of the hierarchy? What is the reason for this and when is it preferable to use this estimator in comparison to the mean?

Study These Flashcards

This is usually pretty close to the mean (see Figure 7.6 in docs) and is often indicative of the centre of the posterior distribution. It is sometimes preferable to use a median if the mean is heavily skewed by extrema, although the choice between the two estimators depends on circumstance.

While each of these three estimators can be optimal in different circumstances, the authors believe that there is a clear hierarchy among them.

What estimator is at the bottom of the hierarchy? Why do people use this estimator?

Study These Flashcards

At the bottom of the hierarchy, we have the MAP estimator. Proponents argue that the simplicity of this estimator is a benefit. It is simple to calculate because the denominator does not depend on the parameter, meaning that, to find the posterior mode, we can simply find the parameter value that maximises the numerator.

Why do the authors recommend against using the MAP estimator?

Study These Flashcards

Its simplicity is misleading. The mode of a distribution often lies away from the bulk of probability mass and is hence not a particularly indicative central measure of the posterior. This estimator also does not make sense mathematically because it is based on the density, which depends on the particular parameterisation in question. You should not use the MAP estimator unless you have a very good reason for doing so.

What does the book describe as ‘teh mainstay of frequentist statistics’ and what is its bayesian equivalent?

Study These Flashcards

The mainstay of the Frequentist estimation procedure is the confidence interval. The bayesian equivalent is the credible interval

In applied research, these intervals often form part of the main results of a paper. For example:
“From our research, we concluded that the percentage of penguins with red tails, RT, has a 95% confidence interval of 1% ≤ RT ≤ 5%.”

How should this be interpreted?

Study These Flashcards

This is often incorrectly taken as having an implicit meaning: ‘There is a 95% probability that the true percentage of penguins with red tails lies in the range of 1% to 5%.’ However, what it actually captures is uncertainty about the interval we calculate, rather than the parameter in question. In the Frequentist paradigm we imagine taking repeated samples from a population of interest, and for each of the fictitious samples, we estimate a confidence interval. A 95% confidence interval means that across the infinity of intervals that we calculate, the true value of the parameter will lie in this range 95% of the time.

How is this (confidence intervals) different to how we interpret credible intervals?

Study These Flashcards

A confidence interval indicates uncertainty about the interval we obtain, rather than a statement of probability about the parameter of interest. The uncertainty is quantified in terms of all the samples we could have taken, not just the one we observe. Bayesian credible intervals, in contrast to confidence intervals, describe our uncertainty in the location of the parameter values, estimated using the current sample. They are calculated from the posterior density. In particular, a 95% credible region satisfies the condition that 95% of the posterior probability lies in this parameter range.

Therefore how would you interpret the following statement?

From our research, we concluded that the percentage of penguins with red tails, RT, has a 95% credible interval of 0% ≤ RT ≤ 4%

Study These Flashcards

It can be interpreted straightforwardly as ‘From our research, we conclude that there is a 95% probability that the percentage of penguins with red tails lies in the range 0% ≤ RT ≤ 4%.’

What allows for this more straightforward interpretation of credible intervals?

Study These Flashcards

An arbitrary credible interval of X% can be constructed from the posterior density by finding a region whose area is equal to X / 100.

Why was the term 'arbitrary' used in describing how you can construct a credible interval?

There are usually a large number of regions which represent an X% credible interval. For example, all three of the posterior intervals shown in Figure 7.8 (in figures) are 50% credible intervals.

To reduce the number of possible credible intervals, there are industry standards that are followed in most applied research. Describe these two summary measures

The central posterior interval: Takes the X% most central of the distribution The highest density interval: Takes the X% of the distribution with the highest density

When would these two intervals be the same and when might they be different?

For a unimodal, symmetric distribution, the central posterior density and highest density intervals will be the same. However, for more complex distributions, this may not be true. Given a bimodal (twin peaked distribution) the dip between the two curves may have less density than the peripheral areas surrounding the peaks (see figures)

In most practical situations, what is the most sensible interval to report?

highest density interval.

How do we calculate the upper and lower bounds of an X% central posterior interval?

We find the (100 − X ) / 2 and X + (100 − X ) / 2 quantiles of the posterior distribution. This results in an interval that is centred on the median parameter value.

How do we calculate the X% highest density interval?

We find the set of values which encompasses this percentage of the posterior probability mass, with the property that the probability density in this set is never lower than outside.

It is easy to jump on the Bayesian bandwagon and favour credible intervals, dismissing Frequentist confidence intervals as misleading. However, in doing so, we are guilty of zealotry. Why is this?

The two concepts really just represent different measures of uncertainty. Frequentists view data sets as one of an infinite number of exactly repeated experiments, and hence design an interval which contains the true value X% of the time across all these repetitions. The Frequentist confidence interval represents uncertainty in terms of the interval itself. By contrast, Bayesians view the observed data sample as fixed and assume the parameter varies, and hence calculate an interval where X% of the parameter’s estimated probability mass lies.

It is easy to jump on the Bayesian bandwagon and favour credible intervals, dismissing Frequentist confidence intervals as misleading. However, in doing so, we are guilty of zealotry. Why is this?

What are the two sources of uncertainty in prediction?

1) We do not know the true value of the parameters | 2) There is sampling variability

What is the first of these sources (We do not know the true value of the parameters) represented by?

The posterior distribution

What is the fact that there is sampling variability represented by?

our choice of likelihood

What do we typically do to account for both these sources of variation?

Derive an approximate distribution that represents our uncertainty over future data by iterating the following steps: 1 Sample θi ∼ p(θ | data), that is the posterior distribution. 2 Sample data′ ∼ p(data | θ ), that is the sampling distribution (likelihood).

What is obtained through carrying out these steps?

By repeating these steps a large number of times (keeping each sampled data value), we eventually obtain a reasonable approximation to the posterior predictive distribution.

What does the posterior predictive distribution represent?

This distribution represents our uncertainty over the outcome of a future data collection effort, accounting for our observed data and model choice.

Is the uncertainty of the bayesian predictive distribution or the frequentist equivalent greater? Why?

Because of the two sources of uncertainty included in our model – the parameter uncertainty and sampling variability – the uncertainty of the Bayesian predictive distribution is typically greater than the Frequentist equivalent. This is because the Frequentist approach to forecasting typically makes predictions based on a point estimate of a parameter (typically the maximum likelihood value).

By ignoring any uncertainty in the parameters value, what error do the authors claim frequentists make?

By ignoring any uncertainty in the parameter’s value, the Frequentist approach produces predictive intervals that are overly confident.

Chapter 7: The posterior Flashcards

(40 cards)