Chapter 5: Priors Flashcards by oisin mcelwain

What part of Bayes’ theorem comprises the prior?

The p(θ) or the equation:
p(θ| data) = p(data |θ) x p(θ) / p(data)

is the prior

How well did you know this?

Not at all

Perfectly

Chapter 4 introduced the concept of a likelihood and how this can be used to derive Frequentist estimates of parameters using the method of maximum likelihood.

What does this presuppose? Comment on this

This presupposes that the parameters in question are immutable, fixed quantities that actually exist and can be estimated by methods that can be repeated, or imagined to be repeated, many times

How well did you know this?

Not at all

Perfectly

Is it reasonable to assume the parameters in question are fixed?

Gill (2007) indicates this is unrealistic for the vast majority of social science research, It is simply not possible to rerun elections, repeat surveys under exactly the same conditions, replay the stock market with exactly matching market forces or re-expose clinical subjects to identical stimuli.

Furthermore, since parameters only exist because we have invented a model, we should be suspicious of any analysis which assumes they have a single ‘true’ value.

How well did you know this?

Not at all

Perfectly

Gelman et al. (2013) suggest that there are two different interpretations of parameter probability distributions

What are these and why are they relevant?

The subjective state of knowledge interpretation, where we use a probability distribution to represent our uncertainty over a parameter’s true value; and the more objective population interpretation, where the parameter’s value varies between different samples we take from a population distribution

In both viewpoints, the model parameters are not viewed as static, unwavering constants as in Frequentist theory

How well did you know this?

Not at all

Perfectly

If we adopt the state of knowledge viewpoint, what does our prior probability distribution represent?

The prior probability distribution represents our pre-data uncertainty for a parameter’s true value.

For example, imagine that a doctor gives their probability that an individual has a particular disease before the results of a blood test become available. Using their knowledge of the patient’s history, and their expertise on the particular condition, they assign a prior disease probability of 75%

Alternatively, imagine we want to estimate the proportion of the UK population that has this disease. Based on previous analyses we probably have an idea of the underlying prevalence, and uncertainty in this value. In this case, the prior is continuous and represents our beliefs for the prevalence
See figures for graphs of these examples

How well did you know this?

Not at all

Perfectly

If we adopt the state of knowledge viewpoint, what does our prior probability distribution represent?

Adopting the population perspective, we imagine the value of a parameter is drawn from a population distribution, which is represented by our prior.

For the disease prevalence example, we imagine the observed data sample is partly determined by the characteristics of the subpopulations from which the individuals were drawn. The other variability is sampling variation within those subpopulations. Here we can view the individual subpopulation characteristics as drawn from an overall population distribution of parameters, representing the entirety of the UK.

How well did you know this?

Not at all

Perfectly

Is the prior always a valid probability distribution?

The prior is always a valid probability distribution and can be used to calculate prior expectations of a parameter’s value.

How well did you know this?

Not at all

Perfectly

Why do we even need priors at all?

Bayes’ rule is really only a way to update our initial beliefs in light of data:

intitial belief ={Bayes rule + data}=> new beliefs

How well did you know this?

Not at all

Perfectly

Another question that can be asked is: Why can’t we simply let the prior weighting be constant across all values of θ?

Firstly how would we achieve this?

Set p(θ ) = 1 in the numerator of Bayes’ rule, resulting in a posterior that takes the form of a normalised likelihood:

p(data | θ ) / p(data)

How well did you know this?

Not at all

Perfectly

This would surely mean we can avoid choosing a prior and, hence, thwart attempts to denounce Bayesian statistics as more subjective than Frequentist approaches. So why do we not do just that? Give two reasons

1) There is a pedantic, mathematical, argument against this, which is that p(θ ) must be a valid probability distribution to ensure that the posterior is similarly valid. If our parameter is unbounded and we choose p(θ ) = 1 (or in fact any positive constant), then the integral (for a continuous parameter) is infinity , and so p(θ ) is not a valid probability distribution.
2) Another perhaps more persuasive argument is that assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn.

How well did you know this?

Not at all

Perfectly

What if you use a prior which is not a valid probability distribution? Can you still get a valid probability distribution in the posterior?

Even if the prior is not a valid probability distribution, the resultant posterior can sometimes satisfy the properties of one. However, take care using these distributions for inference, as they are not technically probability distributions, because Bayes’ rule requires us to use a valid prior distribution. Here the posteriors should be viewed, at best, as limiting cases when the parameter values of the prior distribution tend to ±∞.

How well did you know this?

Not at all

Perfectly

Name a, perhaps more persuasive, more intuitive, argument against normalising the constant by assuming a unity prior

Assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn.

How well did you know this?

Not at all

Perfectly

Demonstrate that assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn with an example of a coin flip

Suppose we want to determine whether a coin is fair, with an equal chance of both heads and tails occurring, or biased, with a very strong weighting towards heads. If the coin is fair, θ = 1 , and if it is biased, θ = 0. Imagine that coin is flipped twice, with the result {H,H}. Assuming a uniform prior results in a strong posterior weighting towards the coin being biased. This is because, if we assume that the coin is biased, then the probability of obtaining 2 heads is high. Whereas, if we assume that the coin is fair, then the probability of obtaining this result is only 1/4 . The maximum likelihood estimate (which coincides with the posterior mode due to the flat prior) is hence that the coin is biased.

How well did you know this?

Not at all

Perfectly

Why is the bayesian approach of choosing a prior seen as more honest in the eyes of some bayesians?

All analysis involves a degree of subjectivity, particularly the choice of a statistical model. This choice is often viewed as objective, with little justification for the underlying assumptions necessary to arrive there. The choice of prior is at least explicit, leaving this aspect of Bayesian modelling subject to the same academic examination to which any analysis should be subjected. The statement of pre-experimental biases actually forces the analyst to self-examine and perhaps also reduces the temptation to manipulate the analysis to serve one’s own ends.

How well did you know this?

Not at all

Perfectly

Describe the structure of a Bayes’ box with the following example:

Imagine a bowl of water covered with a cloth, containing five fish, each of which is either red or white. We want to estimate the total number of red fish in the bowl after we pick out a single fish, and find it to be red. Before we pulled the fish out from the bowl, we had no strong belief in there being a particular number of red fish and so suppose that all possibilities (0 to 5) are equally likely, and hence have the probability of 1/6 in our discrete prior. Further, suppose that the random variable X∈{0,1} indicates whether the sampled fish is white or red. As before we choose a bernouli prior:

Pr(X = 1| Y = a) = a/5

where α ∈{0,1,2,3,4,5} represents the possible numbers of red fish in the bowl, and X = 1 indi- cates that the single fish we sampled is red.

We start by listing the possible numbers of red fish in the bowl in the leftmost column. In the second column, we then specify our prior probabilities for each of these numbers of red fish. In the third column, we calculate the likelihoods for each of these outcomes using Pr(X = 1| Y = a) = a/5. In the fourth column, we then multiply the prior by the likelihood (the numerator of Bayes’ rule), which when summed yields Pr( X = 1) = 1/2; the denominator of Bayes’ rule that normalises the numerator to yield the posterior distribution is shown in the fifth column. See this table in the doc

How well did you know this?

Not at all

Perfectly

What does it mean for the poisterior if the prior or likelihood is 0?

Study These Flashcards

If either the prior or the likelihood is 0, as for the case of zero red fish being in the bowl (impossible since we sam- pled a red fish), then this ensures that the posterior distribution is 0 at this point.

Explain the shape of the posterior acquired (in figures) in terms of Bayes’ rule

Study These Flashcards

To explain its shape we resort to Bayes’ rule:

p(data|θ) x p(θ) / p(data)
∝ p(data|θ) x p(θ) {likelihood x prior}

where we obtained the second line because the denominator contains no θ dependence. Viewed in this light, the posterior is a sort of weighted (geometric) average of the likelihood and the prior. Because, in the above example, we specify a uniform prior, the poster- ior’s shape is entirely determined by the likelihood.

Imagine that we believe that the game-maker likes fish of all colours, and tends to include comparable numbers of both fish, so we modify our prior accordingly

How could this look and what affect would it have on the posterior

Study These Flashcards

You could use a normal distribution over 0-5. Again, because the posterior is essentially a weighted average of the likelihood and prior, this new prior results in a posterior that is less extreme, with a stronger posterior weighting towards more moderate numbers of red fish in the bowl.

Suppose that we substitute our fish bowl sample of 100 individuals taken from the UK population. We assume the independence of individuals within our sample and also that they are from the same population, and are therefore identically distributed. We want to conclude about the overall proportion of individuals within the population with a disease, θ. Suppose that in a sample of 10 there are 3 who are disease- positive

What would our likelihood function look like?

Study These Flashcards

We have a binomial likelihood of the form:

Pr(Z=3|θ)= (|10,3|) θ^B (1 - θ)^10 - 3

Why can we no long use a Bayes box for this example as done previously?

Study These Flashcards

Since the parameter of interest is now continuous, it appears that we cannot use Bayes’ box, as there would be infinitely many rows (corresponding to the continuum of possible θ values) to sum over.

How and why may we use a Bayes box for a continuous example?

Study These Flashcards

We can still use it to approximate the shape of the posterior if we discretise the prior and likelihood at 0.1 intervals across the [0,1] range for θ

Does the structure of the Bayes box differ in the discretised example?

Study These Flashcards

The method to calculate the exact continuous posterior is identical to that in the discretised Bayes’ box except now we multiply two functions – one for the prior, the other for the likelihood.

What effect does choosing a flat prior have on the shape of the posterior?

Study These Flashcards

The impact of using a flat prior is that the posterior is peaked at the same value of θ as the likelihood.

if we were uncertain about the proportion of individuals in a population with a particular disease, then we might specify a uniform prior. The use of a prior that has a constant value, p(θ ) = constant , is attractive.

Why is it attractive?

Study These Flashcards

because, in this case:
p(θ|data)= p(data|θ)×p(θ) / p(data)
∝ p(data|θ)×p(θ) (5.7)
∝ p(data|θ),

and the shape of the posterior distribution is determined by the likelihood function. This is seen as a merit of uniform priors since they ‘let the data speak for itself’ through the likelihood.

The flatness of the uniform prior distribution is often termed uninformative Why is this misleading?

Assuming the same model as described in previously, the probability that one individual is disease-positive is θ, and the probability that two randomly chosen individuals both have the disease is θ^2. If we assume a flat prior for θ, then this implies a decreasing prior for θ^2. Furthermore, considering the probability that in a sample of 10 individuals all are diseased, a flat prior for θ implies an even more accentuated prior for this event. So even though a uniform prior for an event appears to convey no information, it actually confers quite considerable information about other events.

Why is this aspect of choosing flat priors is swept under the carpet for most analyses?

Because we usually care most about the particular event (parameter) for which we create a prior. All priors contain some information, so we prefer the use of the terms vague or diffuse to represent situations where a premium is placed on drawing conclusions based only on observed data.

While uniform priors are straightforward to specify for a bounded parameter – as in the previous example, where θ ∈[0,1], or in the case of discrete parameters – we run into issues with parameters which have no predefined range. A naive solution is to use a prior for μ ~ U(0,∞) for the onset of lung cancer after starting to smoke. Why is this Naive?

This solution, although at first appearing reasonable, is not viable for two reasons: one statistical, the other practical. The statistical reason is that μ  U(0,∞) is not a valid probability density, because any non-zero constant value for the density implies infinite total probability because the μ axis stretches out for ever. The common sense argument is that it is impossible for humans to develop the disease after 250 or 2500 years!

WHat would be a more appropriate prior for the onset of lung cancer after smoking?

A better choice of prior would be a density that ascribes zero probability to negative values of μ allocates most weight towards values of μ that we believe are most reasonable (e.g 0-120)

While many analyses assume a discontinuous uniform prior of the type shown by the grey line in figure, the book advises against it. Why is this?

Due to the arbitrary, and often nonsensical, lower and upper bounds. There are also good computational reasons for using gentler, less discontinuous priors

What term is given to a choice of prior with a density that ascribes zero probability to negative values of μ allocates most weight towards values of μ that we believe are most reasonable?

Weakly informative priors

What method is often deployed when data are available from previous studies?

The construction of a prior can proceed by a method known as moment-matching. We might assume that test scores could be modelled as having come from a normal distribution. We characterise normal distributions by two parameters: their mean, μ, and standard deviation, σ. In moment-matching a normal prior to this previous data, we choose the mean and standard deviation to equal their sample equivalents

Why choose the mean and the standard deviation in moment matching?

While this simple methodology produces priors that closely approximate pre-experimental data sets, it was an arbitrary choice to fit the first two moments (the mean and the standard deviation, respectively) of the sample. We could have used, for instance, the skewness and kurtosis.

What are eliciting priors?

A different sort of informative prior is sometimes necessary, which is not derived from prior data, but from expert opinions. These priors are often used in clinical trials, where clinicians are interviewed before the trial is conducted.

How are eliciting priors often calculated?

Suppose that we ask a sample of economists to provide estimates of the 25th and 75th percen- tiles, wage25 and wage75, of the wage premium that one extra year of college education commands on the job market. If we assume a normal prior for the data, then we can relate these two quantiles back to the corresponding quantiles of a standardised normal distribution for each expert: z25/75 = (wage25/75 - μ) / σ where z25 and z75 are the 25th and 75th percentiles of the standard normal distribution, respectively. These two simultaneous equations could be solved for each expert, giving an esti- mate of the mean and standard deviation of a normal variable. These could then be averaged to determine the mean and standard deviation across all the experts

How can these wage25/75 equations be rearranged to form a better method of calculating eliciting priors?

A better method relies on linear regression. They can be rearraged to give the following: Wage25/75 = μ + σz25/75 We recognise that each equation represents a straight line y = mx + c in (z,wage) space where in this case c = μ and m = σ . If we fit a linear regression line to the data from the whole panel, the values of the y intercept and gradient hence estimate the mean and standard deviation (see Figures).

What relationship does the prior, likelihood and posterior have with sample size?

The effect of the prior on the posterior density decreases as we collect more data. By contrast, the influence of the likelihood – the effect of current data – increases along with sample size.

Explain this relationship the prior, likelihood and posterior has with sample size mathematically

Remember that the posterior is essentially a weighted average of the likelihood and prior. Because it is a product, its behaviour is determined by whichever of the terms is smallest. In general, as the amount of data we collect increases, the likelihood of that data becomes smaller (intrinsically there are many more different ways in which a larger data set could be generated) and more peaked. This means that the posterior peak becomes increasingly closer to the likelihood peak.

What happens if the prior or likelihood is zero?

then the above ensures that the posterior is also zero.

What is meant by sensitivity analysis?

A field called sensitivity analysis actually allows a range of priors to be specified and combined to produce a single posterior.

Chapter 5: Priors Flashcards

(39 cards)