Chapter 5: Priors Flashcards
What part of Bayes’ theorem comprises the prior?
The p(θ) or the equation: p(θ| data) = p(data |θ) x p(θ) / p(data)
is the prior
Chapter 4 introduced the concept of a likelihood and how this can be used to derive Frequentist estimates of parameters using the method of maximum likelihood.
What does this presuppose? Comment on this
This presupposes that the parameters in question are immutable, fixed quantities that actually exist and can be estimated by methods that can be repeated, or imagined to be repeated, many times
Is it reasonable to assume the parameters in question are fixed?
Gill (2007) indicates this is unrealistic for the vast majority of social science research, It is simply not possible to rerun elections, repeat surveys under exactly the same conditions, replay the stock market with exactly matching market forces or re-expose clinical subjects to identical stimuli.
Furthermore, since parameters only exist because we have invented a model, we should be suspicious of any analysis which assumes they have a single ‘true’ value.
Gelman et al. (2013) suggest that there are two different interpretations of parameter probability distributions
What are these and why are they relevant?
The subjective state of knowledge interpretation, where we use a probability distribution to represent our uncertainty over a parameter’s true value; and the more objective population interpretation, where the parameter’s value varies between different samples we take from a population distribution
In both viewpoints, the model parameters are not viewed as static, unwavering constants as in Frequentist theory
If we adopt the state of knowledge viewpoint, what does our prior probability distribution represent?
The prior probability distribution represents our pre-data uncertainty for a parameter’s true value.
For example, imagine that a doctor gives their probability that an individual has a particular disease before the results of a blood test become available. Using their knowledge of the patient’s history, and their expertise on the particular condition, they assign a prior disease probability of 75%
Alternatively, imagine we want to estimate the proportion of the UK population that has this disease. Based on previous analyses we probably have an idea of the underlying prevalence, and uncertainty in this value. In this case, the prior is continuous and represents our beliefs for the prevalence
See figures for graphs of these examples
If we adopt the state of knowledge viewpoint, what does our prior probability distribution represent?
Adopting the population perspective, we imagine the value of a parameter is drawn from a population distribution, which is represented by our prior.
For the disease prevalence example, we imagine the observed data sample is partly determined by the characteristics of the subpopulations from which the individuals were drawn. The other variability is sampling variation within those subpopulations. Here we can view the individual subpopulation characteristics as drawn from an overall population distribution of parameters, representing the entirety of the UK.
Is the prior always a valid probability distribution?
The prior is always a valid probability distribution and can be used to calculate prior expectations of a parameter’s value.
Why do we even need priors at all?
Bayes’ rule is really only a way to update our initial beliefs in light of data:
intitial belief ={Bayes rule + data}=> new beliefs
Another question that can be asked is: Why can’t we simply let the prior weighting be constant across all values of θ?
Firstly how would we achieve this?
Set p(θ ) = 1 in the numerator of Bayes’ rule, resulting in a posterior that takes the form of a normalised likelihood:
p(data | θ ) / p(data)
This would surely mean we can avoid choosing a prior and, hence, thwart attempts to denounce Bayesian statistics as more subjective than Frequentist approaches. So why do we not do just that? Give two reasons
1) There is a pedantic, mathematical, argument against this, which is that p(θ ) must be a valid probability distribution to ensure that the posterior is similarly valid. If our parameter is unbounded and we choose p(θ ) = 1 (or in fact any positive constant), then the integral (for a continuous parameter) is infinity , and so p(θ ) is not a valid probability distribution.
2) Another perhaps more persuasive argument is that assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn.
What if you use a prior which is not a valid probability distribution? Can you still get a valid probability distribution in the posterior?
Even if the prior is not a valid probability distribution, the resultant posterior can sometimes satisfy the properties of one. However, take care using these distributions for inference, as they are not technically probability distributions, because Bayes’ rule requires us to use a valid prior distribution. Here the posteriors should be viewed, at best, as limiting cases when the parameter values of the prior distribution tend to ±∞.
Name a, perhaps more persuasive, more intuitive, argument against normalising the constant by assuming a unity prior
Assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn.
Demonstrate that assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn with an example of a coin flip
Suppose we want to determine whether a coin is fair, with an equal chance of both heads and tails occurring, or biased, with a very strong weighting towards heads. If the coin is fair, θ = 1 , and if it is biased, θ = 0. Imagine that coin is flipped twice, with the result {H,H}. Assuming a uniform prior results in a strong posterior weighting towards the coin being biased. This is because, if we assume that the coin is biased, then the probability of obtaining 2 heads is high. Whereas, if we assume that the coin is fair, then the probability of obtaining this result is only 1/4 . The maximum likelihood estimate (which coincides with the posterior mode due to the flat prior) is hence that the coin is biased.
Why is the bayesian approach of choosing a prior seen as more honest in the eyes of some bayesians?
All analysis involves a degree of subjectivity, particularly the choice of a statistical model. This choice is often viewed as objective, with little justification for the underlying assumptions necessary to arrive there. The choice of prior is at least explicit, leaving this aspect of Bayesian modelling subject to the same academic examination to which any analysis should be subjected. The statement of pre-experimental biases actually forces the analyst to self-examine and perhaps also reduces the temptation to manipulate the analysis to serve one’s own ends.
Describe the structure of a Bayes’ box with the following example:
Imagine a bowl of water covered with a cloth, containing five fish, each of which is either red or white. We want to estimate the total number of red fish in the bowl after we pick out a single fish, and find it to be red. Before we pulled the fish out from the bowl, we had no strong belief in there being a particular number of red fish and so suppose that all possibilities (0 to 5) are equally likely, and hence have the probability of 1/6 in our discrete prior. Further, suppose that the random variable X∈{0,1} indicates whether the sampled fish is white or red. As before we choose a bernouli prior:
Pr(X = 1| Y = a) = a/5
where α ∈{0,1,2,3,4,5} represents the possible numbers of red fish in the bowl, and X = 1 indi- cates that the single fish we sampled is red.
We start by listing the possible numbers of red fish in the bowl in the leftmost column. In the second column, we then specify our prior probabilities for each of these numbers of red fish. In the third column, we calculate the likelihoods for each of these outcomes using Pr(X = 1| Y = a) = a/5. In the fourth column, we then multiply the prior by the likelihood (the numerator of Bayes’ rule), which when summed yields Pr( X = 1) = 1/2; the denominator of Bayes’ rule that normalises the numerator to yield the posterior distribution is shown in the fifth column. See this table in the doc
What does it mean for the poisterior if the prior or likelihood is 0?
If either the prior or the likelihood is 0, as for the case of zero red fish being in the bowl (impossible since we sam- pled a red fish), then this ensures that the posterior distribution is 0 at this point.
Explain the shape of the posterior acquired (in figures) in terms of Bayes’ rule
To explain its shape we resort to Bayes’ rule:
p(data|θ) x p(θ) / p(data)
∝ p(data|θ) x p(θ) {likelihood x prior}
where we obtained the second line because the denominator contains no θ dependence. Viewed in this light, the posterior is a sort of weighted (geometric) average of the likelihood and the prior. Because, in the above example, we specify a uniform prior, the poster- ior’s shape is entirely determined by the likelihood.
Imagine that we believe that the game-maker likes fish of all colours, and tends to include comparable numbers of both fish, so we modify our prior accordingly
How could this look and what affect would it have on the posterior
You could use a normal distribution over 0-5. Again, because the posterior is essentially a weighted average of the likelihood and prior, this new prior results in a posterior that is less extreme, with a stronger posterior weighting towards more moderate numbers of red fish in the bowl.
Suppose that we substitute our fish bowl sample of 100 individuals taken from the UK population. We assume the independence of individuals within our sample and also that they are from the same population, and are therefore identically distributed. We want to conclude about the overall proportion of individuals within the population with a disease, θ. Suppose that in a sample of 10 there are 3 who are disease- positive
What would our likelihood function look like?
We have a binomial likelihood of the form:
Pr(Z=3|θ)= (|10,3|) θ^B (1 - θ)^10 - 3
Why can we no long use a Bayes box for this example as done previously?
Since the parameter of interest is now continuous, it appears that we cannot use Bayes’ box, as there would be infinitely many rows (corresponding to the continuum of possible θ values) to sum over.
How and why may we use a Bayes box for a continuous example?
We can still use it to approximate the shape of the posterior if we discretise the prior and likelihood at 0.1 intervals across the [0,1] range for θ
Does the structure of the Bayes box differ in the discretised example?
The method to calculate the exact continuous posterior is identical to that in the discretised Bayes’ box except now we multiply two functions – one for the prior, the other for the likelihood.
What effect does choosing a flat prior have on the shape of the posterior?
The impact of using a flat prior is that the posterior is peaked at the same value of θ as the likelihood.
if we were uncertain about the proportion of individuals in a population with a particular disease, then we might specify a uniform prior. The use of a prior that has a constant value, p(θ ) = constant , is attractive.
Why is it attractive?
because, in this case:
p(θ|data)= p(data|θ)×p(θ) / p(data)
∝ p(data|θ)×p(θ) (5.7)
∝ p(data|θ),
and the shape of the posterior distribution is determined by the likelihood function. This is seen as a merit of uniform priors since they ‘let the data speak for itself’ through the likelihood.