Chapter 3: Probability Flashcards

1
Q

What are meant by random variables?

A

In probability theory, we describe the behaviour of random variables. This is a statistical term for variables that associate different numeric values with each of the possible outcomes of some random process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is meat by the term random in random variable?

A

By random here we do not mean the colloquial use of this term to mean something that is entirely unpredictable. A random process is simply a process whose outcome cannot be perfectly known ahead of time (it may nonetheless be quite predictable).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Imagine that we enter a lottery, where we select a number from 1 to 100, to have a chance of winning $1000. We suppose that in the lottery only one ball is drawn and it is fair, meaning that all numbers are equally likely to win.

Describe what this function would look like

A

A discrete probability distribution since the variable we measure – the winning number – is confined to a finite set of values. It would therefore look like a set of 100 bars of equal height and width since all numbers are equally likely to win.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Compare the function of the probability of drawing the lottery number with one depicting the probability of: Before test driving a second-hand car, we are uncertain about its value. From seeing pictures of the car, we might think that it is worth anywhere from $2000 to $4000, with all values being equally likely.

A

SInce the range of possible values are continuous (kinda), The graph would depict the probability density instead and it would be one square box from 2000 to 4000 with the height being a probability of 1/2000.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The aforementioned cases are both examples of valid probability distributions. So what are their defining properties?

A

o All values of the distribution must be real and non-negative.
o The sum (for discrete random variables) or integral (for continuous random variables) across all possible values of the random variable must be 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is this satisfied in the discrete lottery case?

A

E^100 i = 1 1/100 = 1

i.e the sum of 100 1/100s = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is this satisfied for the continuous case of the second-hand car example?

A

All values of the distribution must be real and non-negative: The graph indicates that p(v) = 1/2000 ≥ 0 for 2000 ≤ v ≤ 4000

integral (for continuous random variables) across all possible values of the random variable must be 1: Fortunately, since integration is essentially just working out an area underneath a curve, we can calculate the integral by appealing to the geometry of the graph. Since this is just a rectangular shape, we calculate the integral by multiplying the base by its height:

area = 1/2000 x 2000 = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

It may seem that this definition is arbitrary or, perhaps, well-trodden territory for some readers, why is it important to note?

A

It is of central importance to Bayesian statistics. This is because Bayesians like to work with and produce valid probability distributions. This is because only valid probability distributions can be used to describe uncertainty. The pursuit of this ideal underlies the majority of all methods in applied Bayesian statistics – analytic and computational

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would you calculate probability that the winning number, X , is 3 in the discrete probability distribution for the lottery? How would you calculate 10 or less?

A

Easy!
Pr(X = 3) = 1 / 100

To calculate the probability that the winning number is 10 or less, we just sum the probabilities of it being {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}: 1/10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How would you calculate the probability that the value of the second-hand car is $2500?

A

We could conclude that Pr(value = $2500) = 1/2000. However, using the same logic, we would deduce that the probabilities of the value of the car being {$2500, $2500.10, $2500.01, $2500.001} are all 1/2000. Furthermore, we could deduce the same probability for an infinite number of possible values, which if summed together would yield infinity. This means that, for a continuous random variable, we always have Pr(θ = number) = 0, to avoid an infinite sum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the solution to this problem regarding infinite sums in continuous distributions?

A

When we consider p(θ) for a continuous random variable, it turns out we should interpret its values as probability densities, not probabilities. We can use a continuous probability distribution to calculate the probability that a random variable lies within an interval of possible values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the equivalent of a sum in when calculating probability from a continuous distribution?

A

To do this, we use the continuous analogue of a sum, an integral. Calculating an integral is equivalent to calculating the area under a probability density curve. For the car example, we can calculate the probability that the car’s value lies between $2500 and $3000 by determining the rectangular area underneath the graph shown:

1 / 2000 (height) x 500 (base) = 1/4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between p(…) and pr(…)?

A

we use Pr to explicitly state that the result is a probability, whereas p(value) is a probability density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is meant by the base in calculating the example previously?

A

In the example of crossing the ice you are certain to fall into from the book:

For densities we must supply a volume, which provides the exchange rate to convert it into a probability. Note that the word volume is used for its analogy with three-dimensional solids, where we calculate the mass of an object by multiplying the density by its volume. Analogously, here we calculate the probability mass of an infinitesimal volume:

probability mass = probability density x volume

However, here a volume need not correspond to an actual three- dimensional volume in space, but to a unit of measurement across a parameter range of interest. In the above examples we use a length then an area as our volume unit, but in other cases it might be a volume, a percentage or even a probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we hope to obtain a sample of numbers from our distribution, since they are all individually impossible?

A

When we say an event is impossible, it has a probability of zero. When we use the word impossible we mean that the event is not within our space of potential outcomes.

Imagine a sample of numbers from a standard normal distribution. Here the purely imaginary number i does not belong to the set of possible outcomes and hence has zero probability. Conversely, consider attempting to guess exactly the number that we sample from a standard normal distribution. Clearly, obtaining the number 3.142 here is possible – it does not lie outside of the range of the distribution – so it belongs to our potential outcomes. However, if we multiply our probability density by the volume corresponding to this single value, then we get zero because the volume element is of zero width. So we see that events that have a probability of zero can still be possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we use Bayes’ rule differently for probability distributions and probability distributions?

A

While it is important to understand that probabilities and probability densities are not the same types of entity, the good news for us is that Bayes’ rule is the same for each.

p(θ = 1| X = 1) just becomes pr(θ = 1| X = 1)

When the data, X , and the parameter θ are discrete, and hence Pr denotes a probability. When the data and parameter are continuous and p denotes a probability density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the mean of a distribution?

A

A mean, or expected value, of a distribution is the long-run average value that would be obtained if we sampled from it an infinite number of times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does the method of calculating a mean depend on the distribution?

A

The method to calculate the mean of a distribution depends on whether it is discrete or continuous in nature. However, the concept is essentially the same in both cases. The mean is calculated as a weighted sum (for discrete random variables) or integral (for continuous variables) across all potential values of the random variable where the weights are provided by the probability distribution.

19
Q

Give both the equation for calculating the mean of discrete distributions and continuous distributions

A

E(X) = E aPr(X = a)

E(X) = S(all a) a p(a)d a

In the two expressions, α is any one of the discrete set, or continuum, of possible values for the random variable X, respectively. We use Pr in the first expression in (3.9) and p in the second, to indicate these are probabilities and probability densities, respectively.

20
Q

Therefore how would you calculate the mean winning number of the lottery example

A

(1 x 1/100) + (2 x 1/100) … (99 x 1/100) x (100 x 1/100)
= 50.5

You can also demonstrate the long-run nature of the mean value of by computationally simulating many plays of the lottery. As the number of games played increases, the running mean becomes closer to this value.

21
Q

How would you calculate the expected (or mean) value of the second-hand car?

A

This amounts to integrating the curve V x 1/2000 between $2000 and $4000. The region bounded by this curve and the axis can be broken up into triangular and rectangular regions , and so we calculate the total area by summing the individual areas (see figures):

area = {2000 x 1} + {0.5 x 2000 x 1} = 3000
A B
We got this through finding the area under the graph representing the PDF times the car’s value. A corresponds to the rectangle while B corresponds to the triangle, 2000 is the value of the car along the x axis (4000 - 2000) while 1 is the height of the probability density on the y axis. 0.5 is added in B because the triangle is half the size of the rectangle (also the formula for the area of a triangle)

21
Q

Comment on the generalisability of these examples

A

Life is often more complex than the examples encountered thus far. We often must reason about the outcomes of a number of processes, whose results may be interdependent. The next few examples involve considering the outcome of two measurements to introduce the mechanics of two-dimensional probability distributions. Fortunately, these rules do not become more complex when generalising to higher dimensional problems. This means that if the reader is comfortable with the following examples, then they should understand the majority of calculations involving probability distributions.

22
Q

Imagine that you are a horse racing aficionado and want to quantify the uncertainty in the outcome of two separate races. In each race there are two horses from a particular stable, called A and B. From their historical performance over 100 races, you notice that both horses often react the same way to the racing conditions. When horse A wins, it is more likely that, later in the day, B will also win, and vice versa, with similar interrelations for the losses; when A finds conditions tough, so does B. Wanting to flex your statistical muscle, you represent the historical race results by the two-dimensional probability distribution (see figures)

                0             1 0 (lose)    30/100   10/100 1 (win)      10/100    50/100

Does this distribution satisfy the requirements for a valid probability distribution? Show why or why not

A

Since all the values of the distribution are real and non-negative, this satisfies our first requirement. Since our distribution is composed of two discrete random variables, we must sum over the possible values of both to test if it is normalised

E Pr(XA = i, XB = j) = 3/10 + 1/10 + 1/10 + 5/10 = 1

XA and XB are random variables which represent the race for horses A and B, respectively. Notice that since our situation considers the outcome of two random variables, we must index the probability, Pr(XA,XB), by both.

23
Q

How can we interpret the probability distribution shown in this table? Specifically how do we figure out the probability that both horses lose?

A

The probability that both horses lose (and hence both their random variables equal 0) is just read off from the top-left entry in the table, meaning:
Pr(XA = 0, XB = 0) = 30/100
This is similar for looking at any of these outcomes:
Pr(XA = 1, XB = 1) = 50/100
Pr(XA = 1, XB = 0) = 10/100
Pr(XA = 0, XB = 1) = 10/100

24
Q

Suppose that we measure the foot size and literacy test scores for a group of individuals. Both of these variables can be assumed to be continuous.

How many dimensions do we need to plot it?

A

Since this distribution is two-dimensional we need three dimensions to plot it – two dimensions for the variables and one dimension for the probability density. These three-dimensional plots are, however, a bit cumbersome to deal with, and so we prefer to use contour plots to graph two-dimensional continuous probability distributions (see figures.)

25
Q

How can we interpret contour plots?

A

In contour plots, we mark the set of positions where the value of the probability density function is constant, as contour lines. The rate of change of the gradient of the function at a particular position in parameter space is, hence, determined by the local density of contour lines.

26
Q

Notice that in the right-hand plot of Figure 3.7, the contour lines are diagonally oriented. What does this mean?

A

This means that there is a positive correlation between foot size and scores on the literacy test; as an individual’s foot size increases, so does their literacy score, on average. (This is because the confound of children’s age)

27
Q

Although in the horse racing example there are two separate races, each with an uncertain outcome, we can still consider the outcome of one race on its own. Suppose, for example, that we witness only the result for A.

What would be the probability distribution that describes this outcome?

A

A marginal probability distribution: To calculate this, we must average out the dependence of the other variable. Since we are interested only in the result of A, we can sum down the column values for B to give us the marginal distribution of A (see figures).

Mathematically we can write down this rule for a two-dimensional probability distribution as: Discrete marginal probability distributions

Pr(A = a) = E Pr(A = a, B = B)

28
Q

How does this differ similar to how we calculate marginal probability with continuous random variables? Write of and explain the equation for this

A

For continuous random variables we use the continuous analogue of a sum, an integral, to calculate the marginal distribution because the other variable can now equal any of a continuum of possible values:

S (all B) pAB(α,β) dB

pAB(α,β) represents the joint probability distribution of random variables A and B, evaluated at (A = α ,B = β). Similarly, pA(α)
represents the marginal distribution of random variable A, evaluated at A = α. Although it is somewhat an abuse of notation, for simplicity, from now on we write pAB(α,β) as p (A,B) and pA(α) as p(A).

29
Q

Therefore how would you calculate marginal probability in the foot size and literacy test example if we want to summarise the distribution for literacy score, irrespective of foot size?

A

We can obtain this distribution by ‘integrating out’ the dependence on foot size:
(30)S(0) p(score, FS) dFS

The result of carrying out the calculation in (3.17) is the distribution shown in the right-side graph in Figure 3.8. We have rotated this graph to emphasise that it is obtained by summing (really, integrating) across the joint density at each individual value of literacy score.

Similarly, we can obtain the marginal distribution for foot size by integrating the joint density with respect to literacy score. The resultant distribution is shown in the bottom graph of Figure 3.8.

30
Q

What is an alternative approach to estimating the marginal distribution of literacy test score?

A

Sampling from the joint distribution of literacy score and foot size. In particular, if we can generate independent samples from the joint distribution of literacy score and foot size, we can estimate the marginal distribution for each variable. To estimate these marginal distributions we ignore the observations of the variable not directly of interest and draw a histogram of the remaining samples (see figures). While not exact, the shape of this histogram is a good approximation of the marginal distribution if we have enough samples.

31
Q

What kind of diagrams are an alternative way to think about marginal distributions? Explain

A

An alternative way to think about marginal distributions is using Venn diagrams. In a Venn diagram, the area of a particular event indicates its probability, and the rectangular area represents all the events that can possibly happen, so it has an area of 1. In Figure 3.10, we specify the events of horses A and B winning as sub-areas in the diagram. These areas overlap, indicating a region of joint probability where Pr(XA = 1,XB = 1). Using this diagram, it is straightforward to calculate the marginal probability of A or B winning: we find the area of the elliptic shapes A or B, respectively (see figures).

32
Q

What is a conditional probability distribution?

A

In probability, when we observe one variable and want to update our uncertainty for another variable, we are seeking a conditional distribution. This is because we compute the probability distribution of one uncertain variable, conditional on the known value of the other(s).

33
Q

In the two dimensional examples described above, how many dimensions does the conditional distribution have?

A

In each case, we have reduced some of the uncertainty in the system by observing one of its characteristics. Hence, in the two-dimensional examples described above, the conditional distribution is one-dimensional because we are only now uncertain about one variable.

34
Q

What equation do we use to obtain the probability of one variable, conditional on the value of the other?

A

p(A | B) = p(A, B) / p(B)

p(A|B) refers to the probability (or probability density) of A occurring, given that B has occurred. On the right-hand side of this expression, p(B) is the marginal distribution of B, and p(A,B) is the joint probability that A and B both occur.

For the horses example, suppose that we observe that horse A wins. To calculate the probability that B also wins, we use:

50/100 / 10/100 + 50/100 = 5/6

35
Q

In the following table describing the probability of success::

                0             1 0 (lose)    30/100   10/100 1 (win)      10/100    50/100

How can we use this table to calculate conditional probability when we observe that A wins?

A

We reduce our solution space to only the middle column. Therefore, we renormalise the solution space to have a total probability of 1 by dividing each of its entries by its sum of probabilities:

                0             1          Pr(XR | XA = 1) 0 (lose)    30/100   10/100  10/100 / 60/100 = 1/6 1 (win)      10/100    50/100 50/100 / 60/100 = 5/6 Pr(XA)      40/100   60/100
36
Q

When do we say that two events are dependent?

A

If there is a relationship between two random variables, we say that they are dependent. This does not necessarily mean causal dependence, as it is sometimes supposed, in that the behaviour of one random variable affects the outcome of another. It just means that the outcome of the first is informative for predicting the second.

37
Q

When are two events deemed disjoint?

A

If two events, A and B, are disjoint, then if one occurs, the other cannot.

38
Q

If two events are disjoint then are they dependent or independent?

A

In this case, it is often mistakenly believed that the variables are independent, although this is not true. In this case, knowledge that event A has occurred provides significant information about whether B will. If A occurs, then we know for certain that B cannot.

39
Q

Mathematically what does a disjoint relationship mean?

A

Mathematically, this means that the conditional probability of A is equal to its marginal

40
Q

How can you test whether two outcomes are independent?

A

Using the conditional probability rule, we use this to rewrite this expression as:
Pr(A, B) / Pr(B) = Pr(A)

In other words, the ratio of the joint probability A and B occurring to the marginal probability of B is the same as the overall probability of A

41
Q

What is meant by central limit theorem

A

This central tendency of the sample mean increases along with sample size, since extreme values then require more individual scores to be simultaneously extreme, which is less likely. Also as our sample size increases, the distribution is an increasingly good fit to the normal distribution. This approximation, it turns out, becomes exact in the limit of an infinite sample size and is known as the central limit theorem (CLT).

42
Q

When is central limit theorem applicable?

A

For practical purposes approximation is generally reasonable if the sample size is above about 20

43
Q

How is this just one aspect of CLT?

A

There are, in fact, a number of central limit theorems. The above CLT applies to the average of independent, identically distributed random variables. However, there are also central limit theorems that apply far less stringent conditions. This means that whenever an output is the result of the sum or average of a number of largely independent factors, then it may be reasonable to assume it is normally distributed.