Lecture 8 Two theorems on conditional probability

Last time we met the conditional probability \(\mathbb P(B \mid A)\) of one event \(B\) given another event \(A\). In this lecture we will be looking two very useful theorems about conditional probability, called the law of total probability and Bayes’ theorem – and they’re particularly powerful when used together.

8.1 Law of total probability

Example 8.1 My friend has three dice: a 4-sided dice, a 6-sided dice, and a 10-sided dice. He picks one of them at random, with each dice equally likely. What is the probability my friend rolls a 5?

If my friend were to tell which dice he picked, then this question would be very easy! If we write \(D_4\), \(D_6\) and \(D_{10}\) to be the events that he picks the 4-sided, 6-sided, or 10-sided dice, then we know immediately that \[ \mathbb P(\text{roll 5} \mid D_4) = 0 \qquad \mathbb P(\text{roll 5} \mid D_6) = \tfrac16 \qquad \mathbb P(\text{roll 5} \mid D_{10}) = \tfrac{1}{10} . \] What we need is a way to combine the results for different “sub-cases” into an over-all answer.

Luckily, there exists just such a tool for this job! It’s called the “law of total probability” (also known as the “partition theorem”). The important point is to make sure that the different sub-cases cover all possibilities, but that only one of them happens at a time.

Definition 8.1 A set of events \(A_1, A_2, \dots, A_n\) are said to be a partition of the sample space \(\Omega\) if

they are disjoint, in that \(A_i \cap A_j = \varnothing\) for all \(i \neq j\);
they cover space, in that \(A_1 \cup A_2 \cup \cdots \cup A_n = \Omega\).

Theorem 8.1 (Law of total probability) Let \(A_1, A_2, \dots, A_n\) be a partition, and \(B\) another event. Then \[ \mathbb P(B) = \sum_{i=1}^n \mathbb P(A_i) \, \mathbb P(B \mid A_i) . \]

So the law of total probability tells us we can add up the probabilities \(\mathbb P(B \mid A_i)\) for each of the sub-cases provided we weight them by how likely \(\mathbb P(A_i)\) by how likely each sub-case is.

Proof. Since the partition of \(A_i\)s cover space, we can split up \(B\) depending on which part of the partition it is in: \[ B = (B \cap A_1) \cup (B \cap A_2) \cup \cdots \cup (B \cap A_n) . \]

[I meant to draw a picture here, but didn’t get round to it – perhaps you’d like to draw your own?]

Since the \(A_i\) are disjoint, the union on the right is disjoint also. Therefore we can use Axiom 3 to get \[\begin{align*} \mathbb P(B) = \mathbb P(B \cap A_1) + \mathbb P(B \cap A_2) + \cdots + \mathbb P(B \cap A_n) \\ &- \sum_{i=1}^n \mathbb P(B \cap A_i) . \end{align*}\]

But using the definition of conditional probability, each “summand” (term inside the sum) is \[ \mathbb P(B \cap A_i) = \mathbb P(A_i) \, \mathbb P(B \mid A_i) . \] The result follows.

Example 8.1 continued. Returning to our dice example, we see that \(\{D_4, D_6, D_{10}\}\) is indeed a partition, since my friend must choose exactly one of the three dice. So the law of total probability tells us that \[ \mathbb P(\text{roll 5}) = \mathbb P(D_4) \, \mathbb P(\text{roll 5} \mid D_4) + \mathbb P(D_6) \, \mathbb P(\text{roll 5} \mid D_6) + \mathbb P(D_{10}) \, \mathbb P(\text{roll 5} \mid D_{10}) . \]

We were told that all the dice were picked with equal probability, so \(\mathbb P(D_4) = \mathbb P(D_6) = \mathbb P(D_{10}) = \frac13\), and we’ve already calculated the individual conditional probabilities as \[ \mathbb P(\text{roll 4} \mid D_4) = 0 \qquad \mathbb P(\text{roll 4} \mid D_6) = \tfrac16 \qquad \mathbb P(\text{roll 4} \mid D_{10}) = \tfrac{1}{10} . \] Therefore, we have \[ \mathbb P(\text{roll 5}) = \tfrac13\times 0 + \tfrac13\times\tfrac16 + \tfrac13\times\tfrac1{10} = \tfrac{8}{90} = 0.089. \]

8.2 Bayes’ theorem

In this section, we will discuss an important result called Bayes’ theorem. Let’s first state and prove this result, and do an example, and then afterwards we’ll talk about two reasons why Bayes’ theorem is so important.

Theorem 8.2 (Bayes' theorem) For events \(A\) and \(B\) with \(\mathbb P(A), \mathbb P(B) > 0\), we have \[ \mathbb P(A \mid B) = \frac{\mathbb P(A) \,\mathbb P(B \mid A)}{\mathbb P(B)} . \]

Bayes’ theorem is thought to have first appeared in the writings of Rev. Thomas Bayes, a British church minister and mathematician, shortly after his death, in the 1760s. (Bayes’ work was significantly edited by Richard Price, another minister–mathematician, and many historians of mathematics think that Price deserves a good share of the credit.)

Proof. From the definition of conditional probability, we can write \(\mathbb P(A \cap B)\) in two different ways: we can write it as \[ \mathbb P(A \cap B) = \mathbb P(A) \, \mathbb P(B\mid A) , \] but we can also write it as \[ \mathbb P(A \cap B) = \mathbb P(B) \, \mathbb P(A\mid B) . \] Since these are two different ways of writing the same thing, we can equate them, to get \[ \mathbb P(A) \, \mathbb P(B\mid A) = \mathbb P(B) \, \mathbb P(A\mid B) . \] Dividing both sides by \(\mathbb P(B)\) gives the result.

Example 8.2 My friend again secretly picks the 4-sided, 6-sided, or 10-sided dice, each with probability \(\frac13\). He rolls that secret dice, and tells me he rolled a 5. What is the probability he picked the 6-sided dice?

This is asking us to calculate \(\mathbb P(D_6 \mid \text{roll 5})\). Bayes’ theorem tells us that \[ \mathbb P(D_6 \mid \text{roll 5}) = \frac{\mathbb P(D_6) \, \mathbb P(\text{roll 5} \mid D_6)}{\mathbb P(\text{roll 5})} = \frac{\frac13 \times \frac16}{\frac{8}{90}} = \tfrac{5}{8} , \] since we had calculated \(\mathbb P(\text{roll 5}) = \frac{8}{90}\) in the previous subsection.

The first way to think about Bayes’ theorem is that it tells us how to relate \(\mathbb P(A \mid B)\) and \(\mathbb P(B \mid A)\). Remember that \(\mathbb P(A \mid B)\) and \(\mathbb P(B \mid A)\) are not the same thing! The conditional probability someone is under 40 given they are a Premiership footballer is very high, but the conditional probability someone is a Premiership footballer given they are under 40 is very low.

Bayes’ theorem, in this first view, is a useful technical result that helps us switch the order of a conditional probability from \(B\) given \(A\) to \(A\) given \(B\): we have \[ \mathbb P(A \mid B) = \frac{\mathbb P(A)}{\mathbb P(B)} \times \mathbb P(B \mid A) . \]

In the dice example, the probability \(\mathbb P(\text{roll 5} \mid D_6) = \frac16\) was very obvious, but Bayes’ theorem allowed us to reverse the conditioning, to find \(\mathbb P(D_6 \mid \text{roll 5}) = \frac58\) instead.

The second way to think about Bayes’ theorem is that it tells us how to update our beliefs as we acquire more evidence. That is, we might start by believing that the probability some event \(A\) will occur is \(\mathbb P(A)\). But then we find out that \(B\) has occurred, so we want to incorporate this knowledge and update our belief of the probability \(A\) will occur to \(\mathbb P(A \mid B)\), the conditional probability \(A\) will occur given this new evidence \(B\).

Bayes theorem, in this second view, tells us how to update from \(\mathbb P(A)\) to \(\mathbb P(A \mid B)\): we have \[ \mathbb P(A \mid B) = \mathbb P(A) \times \frac{\mathbb P(B \mid A)}{\mathbb P(B)} . \]

In the dice example, we initially believed there was a \(\mathbb P(D_6) = \frac13 = 0.333\) chance our friend had chosen the six-sided dice. But when we heard that our friend had rolled a 5, we updated our belief to now thinking there was now a \(\mathbb P(D_6 \mid \text{roll 5}) =\frac58 = 0.625\) chance it was the 6-sided dice.

This second way of thinking about Bayes’ theorem is at the heart of Bayesian statistics. In Bayesian statistics, we start with a “prior” belief about a model, then, after collecting some data, we update, using Bayes’ theorem, to a “posterior” belief about the model. We will discuss Bayesian statistics much more in Week 10.

Quite often we use Bayes’ theorem and the law of total probability together. If we have a partition \(A_1, A_2, \dots, A_n\), perhaps representing some possible hypotheses, and we observe some evidence \(B\), then Bayes’ theorem tells us how likely each hypothesis is given the evidence: \[ \mathbb P(A_i \mid B) = \frac{\mathbb P(A_i) \,\mathbb P(B \mid A_i)}{\mathbb P(B)} . \] But this shared denominator \(\mathbb P(B)\) can be expanded using the law of total probability \[ \mathbb P(B) = \sum_{j=1}^n \mathbb P(A_j) \,\mathbb P(B \mid A_j) . \] Putting these together, we get the following.

Theorem 8.3 Let \(\{A_1, A_2, \dots, A_n\}\) be a partition of a sample space and let \(B\) be another event. Then, for all \(i=1,2,\dots,n\), we have \[ \mathbb P(A_i \mid B) = \frac{\mathbb P(A_i) \,\mathbb P(B \mid A_i)}{\sum_{j=1}^n \mathbb P(A_j) \, \mathbb P(B \mid A_j)} . \]

This is essentially what we did with the dice example – although we split up the calculation into two separate parts rather than using this formula directly.

8.3 Diagnostic testing

Example 8.3 Members of the public are tested for a certain rare disease. About 2% of the population have the disease. The test is 95% accurate, in the following sense: if you have the disease, there’s a 95% chance you correctly get a positive test result; while if you don’t have the disease, there’s a 95% chance you correctly get a negative test result. Suppose you get a positive test result. What is the probability you have the disease?

The first thing we have to do is translate the words in the question into probability statements. Let \(D\) be the event you have the disease, so \(D^\mathsf{c}\) is the event you don’t have the disease, and let \(+\) be the event you get a positive result. Then the question tells us that

\(\mathbb P(D) = 0.02\) and \(\mathbb P(D^\mathsf{c}) = 0.98\);
\(\mathbb P({+} \mid D) = 0.95\);
\(\mathbb P({+}\mid D^\mathsf{c}) = 0.05\);
we want to find \(\mathbb P(D \mid {+})\).

So from Bayes’ theorem, we have \[ \mathbb P(D \mid {+}) = \frac{\mathbb P(D) \,\mathbb P({+} \mid D)}{\mathbb P(+)} = \frac{0.02 \times 0.95}{\mathbb P(+)} . \]

How can we find \(\mathbb P(+)\)? Well, importantly, \(D\) (you have the disease) and \(D^\mathsf{c}\) (you don’t) make up a partition – so we can use the law of total probability. We have \[\begin{align*} \mathbb P(+) &= \mathbb P(D) \,\mathbb P({+} \mid D)+\mathbb P(D^\mathsf{c}) \,\mathbb P({+} \mid D^\mathsf{c}) \\ &= 0.02 \times 0.95 + 0.98 \times 0.05 . \end{align*}\]

Putting the Bayes’ theorem result and the law of total probability result together, we get \[ \mathbb P(D \mid {+}) = \frac{0.02 \times 0.95}{0.02 \times 0.95 + 0.98 \times 0.05} = 0.28 .\]

So if you get a positive result on this 95%-accurate test, there’s still only about a 1 in 4 chance you actually have the disease.

Many people find this result surprising. It sometimes helps to put more concrete numbers on things. Suppose 1000 people get tested. On average, we expect about 20 of them to have the disease, and 980 of to not have the disease. Of the 20 with the disease, on average 19 will correctly test positive, while 1 will test negative. Of the 980 without the disease, an average 931 will correctly test negative, but 49 will wrongly test positive. So of the \(19+49 = 68\) people with positive tests, only 19 of them actually have the disease, which is 28%.

The key point is that the disease is rare – only 2% of people have it. So even though positive test increases the likelihood you have the disease a lot (it’s about 14 times more likely), it’s not enough to make it a very large probability.

Summary

The law of total probability says that if \(A_1, A_2, \dots A_n\) is a partition of the sample space (that is, exactly one of them occurs), then \[ \mathbb P(B) = \sum_{i=1}^n \mathbb P(A_i) \, \mathbb P(B \mid A_i) . \]
Bayes’ theorem says that \({\displaystyle \mathbb P(A \mid B) = \frac{\mathbb P(A) \,\mathbb P(B \mid A)}{\mathbb P(B)} }\).

Recommended reading:

Stirzaker, Elementary Probability, Sections 2.1 and 2.2.
Grimmett and Welsh, Probability, Section 1.8.

On Problem Sheet 3, you should now be able to complete Questions A1, A2, B1–3 and C1.