Lecture 20 Bayesian statistics II

20.1 Beta–Bernoulli model

Recall that the Beta distribution \(\pi(\theta) \propto \theta^{\alpha - 1}(1-\theta)^{\beta - 1}\) makes a useful prior for a probability parameter \(\theta \in [0,1]\). Let’s look at our joke-shop coin again.

Example 20.1 After examining the coin, we decide to use a \(\text{Beta}(3,3)\) prior. This is symmetric with expectation \(\frac12\), with (almost) the same variance \(0.036\) as our previous three-point prior. Again, we examine toss the coin three times, and get three heads. How should we update our beliefs?

The prior is the Beta prior \(\pi(\theta) \propto \theta^2(1-\theta)^2\).

The likelihood of getting 3 heads out of 3 is \(p(\text{HHH} \mid \theta) = \theta^3\).

Using the formula \[ \text{posterior} \propto \text{prior} \times \text{likelihood} , \] we have \[ \pi(\theta \mid \text{HHH}) \propto \theta^2(1-\theta)^2 \times \theta^3 = \theta^5(1-\theta)^2 .\] But we recognise that this is (proportional to) the PDF for a Beta distribution again! This time, it’s \[ \theta \mid \text{HHH} \sim \Beta(6, 3) . \]

The “posterior expectation” and “posterior variance” are \[ \mathbb E(X \mid \text{HHH}) = \frac{6}{6+3} = \frac{2}{3} = 0.67 \qquad \text{Var}(X \mid HHH) = \frac{\frac23\big(1-\frac23\big)}{6+3+1} = \frac{1}{45} = 0.022 \]

Compared to our expectation has increased quite a lost from our prior expectation \(\frac12\) to our posterior expectation \(\frac23\). Our variance has decreased slightly, from \(0.036\) to \(0.022\), because collecting data has allowed us to become more confident about our beliefs.

Let’s try to generalise what we did here.

Consider a situation modelled by a Bernoulli likelihood, where \(X_1, X_2, \dots, X_n\) are IID \(\text{Bern}(\theta)\). The joint PMF is \[ p(\mathbf x \mid \theta) = \prod_{i=1}^n \theta^{x_i} (1-\theta)^{1 - x_i} = \theta^{\sum_i x_i} (1 - \theta)^{n-\sum_i x_i} = \theta^y (1 - \theta)^{n-y}, \] where we have written \(y = \sum_i x_i\) for the total number of successes.

Consider further using a \(\text{Beta}(\alpha, \beta)\) prior for \(\theta\), so that \[ \pi(\theta) = \frac{1}{B(\alpha, \beta)} \theta^{\alpha-1} (1-\theta)^{\beta - 1} \propto \theta^{\alpha-1} (1-\theta)^{\beta - 1} \] (Because we’re going to use the “posterior has to add up to 1” trick at the end, we’re free to drop constants whenever we want.)

This combination of a Beta prior and a Bernoulli likelihood is known as the Beta–Bernoulli model.

Suppose we collect data \(\mathbf x = (x_1, x_2, \dots, x_i)\), with \(y = \sum_i x_i\) successes. What now is the posterior distribution for \(\theta\) given this data?

Using Bayes’ theorem, we have \[\begin{align*} \pi(\mathbf x \mid \theta) &\propto \pi(\theta) p(\mathbf x \mid \theta) \\ &= \theta^{\alpha-1} (1-\theta)^{\beta - 1} \times \theta^y (1 - \theta)^{n-y} \\ &= \theta^{\alpha + y - 1} (1 - \theta)^{\beta + n - y - 1} . \end{align*}\] This is (proportional to) the PDF for a \(\text{Beta}(\alpha + y, \beta + n - y)\) distribution.

So we see that, like the prior, the posterior is also a Beta distribution, where the first parameter has gone from \(\alpha\) to \(\alpha + y\) and the second parameter has gone from \(\beta\) to \(\beta + (n-y)\). In other words, \(\alpha\) has increased by the number of successes, and \(\beta\) has increased by the number of failures. The expectation has gone from the prior expectation \[ \frac{\alpha}{\alpha + \beta} \] to the posterior expectation \[ \frac{\alpha + y}{\alpha + \beta + n} .\]

This is somewhere between the prior expectation \(\alpha/(\alpha + \beta)\) and the mean of the data \(y/n\). So the data slowly drags our expectation from the prior expectation towards the mean of the data. The more data we get, the more the prior drops away and the more the data itself matters.

20.2 Beta–geometric model

Jokeshop example

20.3 Modern Bayesian statistics

In these two lectures, we’ve given just a brief taster of Bayesian statistics. Bayesian statistics is a deep and complicated subject, and you may have the opportunity to find out a lot more about it later in your university career.

We have seen that in Bayesian statistics, one brings in a subjective “prior” based on previous beliefs and evidence, then updates this prior based on the data. This contrasts with the more traditional frequentist statistics. In frequentist one uses only the data – no prior beliefs! – and judges to what extent the data is consistent or inconsistent with a hypothesis, without weighing in on how likely such a hypothesis is. (Frequentist statistics is the main subject studied in MATH1712 Probability and Statistics II.)

For a while, there was an occasionally fierce debate between “Bayesians” and “frequentists”. Frequentists thought that bringing subjective personal beliefs into things was unmathematical, while Bayesians thought that ignoring how plausible a hypothesis is before testing it is unscientific. The debate has now largely dissipated, and it is largely accepted that modern statisticians need to know about both frequentist and Bayesian methods.

In the two main examples of Bayesian statistics we have looked at – the Bernoulli likelihood and the normal likelihood – we ended up with a posterior in the same parametric family as prior, just with different parameters. Such a prior is called a “conjugate prior”. Of course, these are very convenient and easy to work with. However, with more complicated likelihoods and more complicated priors – especially those not with a single parameter but with many parameters – calculating the posterior distribution can be very difficult. In particular, working out the constant of proportionality (even just approximately) and/or sampling from the posterior distribution are very hard problems.

For this reason, Bayesian statistics was for a long time a minor area of statistics. However, increases in computer power in the 1980s made some of these problems more tractable, and Bayesian statistics has increased in importance and popularity since then.

There are still plenty of open problems in Bayesian statistics, and lots of these involve the computational side: finding algorithms that can efficiently calculate the normalising constants in posterior distributions or sample from those posterior distributions, especially when the parameter(s) have very high dimension.

Summary

In Bayesian statistics, we start with a prior distribution for a parameter \(\theta\), and update to a posterior distribution given the data \(\mathbf x\), through \(\pi(\theta \mid \mathbf x) \propto \pi(\theta)p(\mathbf x \mid \theta)\), or \(\text{posterior} \propto \text{prior} \times \text{likelihood}\).
The Beta distribution is a useful family of distributions to use as priors for probability parameters.
A Beta prior for a Bernoulli likelihood leads to a Beta posterior with different parameters.
A normal prior for the expectation of a normal likelihood wioth known variance leads to a normal posterior with different parameters.