Problem Sheet 5

This is Problem Sheet 5, which covers material from Lectures 22 to 27. You should work through all the questions on this problem sheet in advance of the problems class, which takes place in the lecture of Thursday 11 December. If you are stuck on any of the questions, you are welcome to discuss them with me in my office hours on Mondays at 1500.

This problem sheet is to help you practice material from the module. It is not for assessment and will not be marked.

Full solutions should be released on Friday 12 December.

Remember that Thursday 12 December at 1400 is also the deadline for the module coursework

1. The Gamma distribution \(X \sim \Gamma(m, \lambda)\) has PDF \[ f(x) = \frac{\lambda^m}{(m-1)!} \,x^{m-1}\,\mathrm{e}^{-\lambda x} . \] [Correction: There was a typo in this PDF earlier.] (This PDF can be evaluated with the dgamma() function in R.) The time between eruptions of a volcano, measured in years, is modelled as \(X \sim \Gamma(5, \theta)\), IID over eruptions, where \(\theta\) is unknown. A scientist beliefs about \(\theta\) are represented in a prior distribution \(\theta \sim \operatorname{Exp}(6)\).

The following times between eruptions are available in the historical record:

       56.55,  2.57, 29.97, 10.27,  4.32, 17.91, 51.98,  7.06, 11.40, 54.80

Sample from the posterior distribution for \(\theta\), using the random walk Metropolis algorithm. Explain how you chose the stepsize for your algorithm and how you checked it was a good choice.

Draw a histogram of the posterior distribution, and comment on how it differs from the prior.

Solution. The posterior distribution is \[ \pi(\theta \mid \mathbf x) \propto 6\mathrm{e}^{-6\theta} \times \prod_{i=1}^{10} x_i^{4} \, \mathrm{e}^{-\theta x_i} . \]

The code for the Metropolis random walk algorithm will be the following.

x <- c(56.55,  2.57, 29.97, 10.27,  4.32,
       17.91, 51.98,  7.06, 11.40, 54.80)

prior <- function(theta) dexp(theta, 6)
like  <- function(theta) prod(dgamma(x, 5, theta))
post  <- function(theta) prior(theta) * like(theta)

accept <- function(curr, prop) post(prop) / post(curr)
initial <- 1
n <- 1e5
sigma <- 0.25

MRW <- rep(0, n)
MRW[1] <- initial
for (i in 1:(n - 1)) {
  prop <- MRW[i] + rnorm(1, 0, sigma)
  if (prop < 0)                             MRW[i + 1] <- MRW[i]
  else if (runif(1) < accept(MRW[i], prop)) MRW[i + 1] <- prop
  else                                      MRW[i + 1] <- MRW[i]
}

hist(MRW[-(1:99)], probability = TRUE, xlim = c(0, 0.5))
curve(prior, add = TRUE, col = "blue", lwd = 2)

A little experimentation suggested \(\sigma = 0.2\) worked about right – \(\sigma = 0.5\) rejected too many changes. With my initial start point of 1, I found a short burn-in period was necessary, so my histogram starts with the 100th step of the random walk. (I could have just started from 0.2 instead, I suppose.)

We see that while the prior though any number between 0 and \(\tfrac12\), maybe even more, was plausible for \(\theta\), we can see the prior has become concentrated in the interval \([0.11, 0.32]\) or so, with values around 0.2 the most common. The posterior variance is much smaller than the prior, so we have gained certainty from the results.

2. [2019 exam, Question 4]

(a) Introducing any notation you use, state the Metropolis–Hastings algorithm for discrete state space, and explain the purpose of this algorithm.

Solution. The Metropolis–Hastings algorithm is a method to sample from a distribution \(\pi\) on a discrete space \(\mathcal S\) by setting up a Markov chain \((X_i)\) on \(\mathcal S\) with stationary distribution \(\pi\).

Suppose the Markov chain is at \(X_i = x\). The Metropolis–Hastings algorithm proposes a move to \(y\) with probability \(r(x, y)\), where \(\mathsf R = (r(x, y))\) is the transition matrix for an irreducible Markov chain on \(\mathcal S\). This move is accepted with probability \[ \alpha(x, y) = \min \left\{ \frac{\pi(y)\,r(y,x)}{\pi(x)\,r(x,y)},\, 1\right\} , \] meaning that \(X_{i+1} = y\), or rejected otherwise, meaning that \(X_{i+1} = X_i = x\).

After being run for a large number of steps (possibly with a burn-in period), the values \(X_i\) should be approximately distributed like \(\pi\).

(b) Using the Metropolis–Hastings algorithm, find a Markov chain with state space \(\mathbb N = \{1, 2, 3, \dots\}\) that has stationary distribution \(\mathbb P(X_n = x) = 2^{-x}\) for \(x \in \mathbb N\).

Solution. Although this isn’t the only way, I will use \(\mathsf R\) to be the transition matrix of the simple symmetric random walk on \(\mathbf Z\). Because this is symmetric, the acceptance probabilities are \[ \begin{align} \alpha(x, x+1) &= \min \left\{ \frac{\pi(x+1)}{\pi(x)},\, 1 \right\} = \min\big\{\tfrac12,\, 1\big\} = \tfrac 12 \\ \alpha(x, x-1) &= \min \left\{ \frac{\pi(x-1)}{\pi(x)},\, 1 \right\} = \min\{2,\, 1\} = 1, \end{align} \] except that \(\alpha(1, 0) = 1\).

I’ll choose the starting point \(X = 1\). From here, the Markov chain repeatedly proposes a move up 1 with probability \(\tfrac12\), which is accepted with probability \(\tfrac12\), or a move down 1 with probability \(\tfrac12\), which is always accepted – except for a move from 1 down to 0 which is always rejected.

(c) In a Bayesian setting, assume we have observations \(Z_1, X_2, \dots, Z_n\) from an \(\operatorname{N}(0, \sigma^2)\) distribution, where the prior distribution for \(\sigma^2\) is \(\operatorname{Exp}(1)\). Using the Metropolis–Hastings algorithm, describe in detail a method for sampling from the posterior distribution of \(\sigma^2\).

Solution The posterior distribution is given by \[ \begin{align} \pi(\sigma^2 \mid \mathbf Z) &\propto \pi(\sigma^2) \prod_{i=1}^n f(z_i \mid \sigma^2) \\ &= \mathrm{e}^{-\sigma^2} \prod_{i=1}^n \frac{1}{\sqrt{\sigma^2}} \,\exp\left(- \frac{Z_i^2}{2\sigma^2} \right) \\ &= \mathrm{e}^{-\sigma^2} \,\sigma^{-n/2} \,\exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n Z_i^2 \right) . \end{align} \]

I will choose as the Markov chain a symmetric Gaussian random walk on \(\mathbb R\), meaning that the Markov chain is \(X_{i+1} = X_i + \mathrm{N}(0, s^2)\). (The special case is known as the random walk Metropolis algorithm in continuous space.) Because this is symmetric, the acceptance probability is \[ \alpha(x, y) = \min \left\{ \frac{ \pi(x \mid \mathbf Z)}{\pi(y \mid \mathbf Z)},\,1\right\} , \] for \(y \geq 0\), where the constant of proportionality in the \(\pi\)s cancel out, of \(\alpha(x, y) = 0\) for \(y < 0\).

I would suggest starting from \(X_1\) between \(1\) (the prior variance) and the sample variance of the \(Z_i\) (the posterior variance in the large sample limit \(n \to \infty\)). I would suggest choosing the step size \(s\) by experiment, with smaller \(s\) for large \(n\), and vice versa.

(d) Let the set \(A \subset \mathbb R^2\) be given by \[ A = \big([0,3]\times[0,1]\big) \cup \big([0,1]\times [4,5]\big) \cup \big([4,5]\times[0,1]\big) \cup \big([4,5] \times [3,5]\big) . \] Consider the Metropolis–Hastings algorithm on \(\mathbb R^2\) with target density \[ \pi(\mathbf x) = \begin{cases} \frac17 & \text{if }\mathbf x \in A \\ 0 & \text{otherwise} \end{cases} \] for all \(x \in \mathbb R^2\), and where the proposals \(\mathbf Y_{i+1}\) are chosen uniformly distributed on a disk of radius \(r\) around the previous state – that is, \[ Y_{j+1} \sim \operatorname{U} \Big(\big\{ \mathbf y \in \mathbb R^2 : |\mathbf y - \mathbf X_i | \leq r \big\}\Big) . \] Assume the algorithm starts at \(\mathbf X_1 = (0.5, 0.5)\).

i. For every \(r > 0\), determine the stationary distribution of the resulting Markov chain.

ii. For what values of \(r\) does the algorithm work correctly?

Solution. To see what’s going on where, we need to draw a picture of the set \(A\).

The question is: for what \(r\) will it be possible to jump between the different blocks. We start from \((0.5, 0.5)\) which is in the bottom-left block. To jump from bottom-left to bottom-right, we need \(r > 1\). To jump from bottom-right to top-right, we need \(r > 2\). To jump from bottom-left to top-left, we need \(r > 3\).

(i) The distribution is uniform, so as long as the Markov chain can reach an area, the stationary distribution will be uniform there. So the stationary distribution is uniform on the following sets:

\(0 < r \leq 1\): \(\big([0,3]\times[0,1]\big)\)
\(1 < r \leq 2\): \(\big([0,3]\times[0,1]\big) \cup \big([4,5]\times[0,1]\big)\)
\(2 < r \leq 3\): \(\big([0,3]\times[0,1]\big) \cup \big([4,5]\times[0,1]\big) \cup \big([4,5] \times [3,5]\big)\)
\(r > 3\): \(A\)

(If \(r\) is only just above the boundary, convergence to the stationary distribution may be extremely slow.)

(ii) The stationary distribution is the desired distribution for \(r > 3\), which is the values for which the algorithm technically “works”, as \(n \to \infty\). To work “well”, \(r\) will need to be comfortably bigger than 3 (so that movement to and from the top-left is reasonably common) while not being so huge that most steps are to ridiculously far away.

3. [2017 exam, Question 5]

(a) Give the definition of a Markov chain with state space \(\mathcal S\).

Let \((Z_i)\) be a sequence of IID random variables whose distribution is known. Consider the stochastic processes \((U_i)\), \((V_i)\), \((W_i)\) defined by \[ \begin{align} U_i &= Z_1 + Z_2 + \cdots + Z_i \\ V_i &= \frac{1}{i} (Z_1 + Z_2 + \cdots + Z_i) \\ W_i &= \max \{Z_j : 1 \leq j \leq i-1\} + Z_i \end{align} \] for \(i = 1, 2, \dots\). Which of these processes are Markov chains? Justify your answers.

(b) Defining any notation you use, state the random walk Metropolis algorithm. What is the purpose of this algorithm?

(c) Use the Metropolis–Hastings algorithm to define a Markov chain \((X_n)\) on \(\mathcal S = \{0, 1, 2, \dots \}\) whose stationary distribution is the Poisson distribution \(\pi(x) = \mathrm{e}^{-\lambda} \lambda^x/x!\).

(d) Let \((X_i)\) be a Markov chain with values in \(\mathbb R\) and stationary distribution \(\pi\). Consider the estimator \[\widehat{\theta}_n^{\mathrm{MCMC}} = \frac{1}{n} \sum_{i=1}^n \phi(X_i) \] for \(\theta = \operatorname{\mathbb E}(\phi(X))\) for \(X\) having PDF \(\pi\). Assuming \(X_1\) has PDF \(\pi\), derive the result \[ \operatorname{Var} \big(\widehat{\theta}_n^{\mathrm{MCMC}}\big) \approx \frac{\operatorname{Var}(\phi(X))}{n} \left(1 + 2 \sum_{i=2}^\infty \operatorname{Corr}\big(\phi(X_1),\phi(X_i)\big) \right) . \]

4. Let \(\mathbf X = (X_1, \dots, X_n)\) be IID samples from a random variable \(X\) with cumulative distribution function \(F\). Let \(F^*\) be the empirical cumulative distribution function of the samples \(\mathbf X\).

(a) For fixed \(x\), show that \(\operatorname{\mathbb E}F^*(x) = F(x)\).

Solution. We have \[ F^*(x) = \frac{1}{n} \sum_{i=1}^n\mathbb{I}_{(-\infty, x]}(X_i) . \] So \[ \begin{multline} \operatorname{\mathbb E}F^*(x) = \mathbb E \left( \frac{1}{n} \sum_{i=1}^n\mathbb{I}_{(-\infty, x]}(X_i) \right) = \frac{1}{n} \sum_{i=1}^n \operatorname{\mathbb E} \mathbb{I}_{(-\infty, x]}(X_i) \\ = \frac{1}{n}\,\sum_{i=1}^n \mathbb P\big(X_i \in (\infty, x]\big) = \frac{1}{n}\,n\,\mathbb P\big(X_i \in (\infty, x]\big) = F(x) \end{multline} \]

(b) For fixed \(x\), show that \(\operatorname{Var}\big(F^*(x)\big) = \displaystyle\frac{1}{n}F(x)\big(1 - F(x)\big)\).

Solution. With the same notation as part (a), \[ \begin{multline} \operatorname{Var}\big(F^*(x)\big) = \operatorname{Var} \left( \frac{1}{n} \sum_{i=1}^n\mathbb{I}_{(-\infty, x]}(X_i) \right) = \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}\big(\mathbb{I}_{(-\infty, x]}(X_i)\big) \\ = \frac{1}{n^2}\,\sum_{i=1}^n F(x) \big(1 - F(x)\big) = \frac{1}{n^2}\,n\,F(x) \big(1 - F(x)\big) = \frac{1}{n} F(x) \big(1 - F(x)\big) \end{multline} \]

(c) For fixed \(x\) and \(y\) with \(y > x\), find \(\operatorname{Cov}\big(F^*(y), F^*(x)\big)\). (c) For fixed \(x\) and \(y\) with \(x \leq y\), find \(\operatorname{Cov}\big(F^*(y), F^*(x)\big)\).

Solution. Using the same notation again, \[ \begin{align} \operatorname{Cov}\big(F(x), F(y)\big) &= \operatorname{Cov} \left( \frac1n\sum_{i=1}^n \mathbb{I}_{(-\infty, x]}(X_i) , \frac1n\sum_{i=1}^n \mathbb{I}_{(-\infty, y]}(X_i) \right) \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n \operatorname{Cov} \big( \mathbb{I}_{(-\infty, x]}(X_i), \mathbb{I}_{(-\infty, y]}(X_i)\big) + 2 \sum_{i<j} \operatorname{Cov} \big( \mathbb{I}_{(-\infty, x]}(X_i), \mathbb{I}_{(-\infty, y]}(X_j)\big) \right) \\ &= \frac{1}{n^2} \left(n \operatorname{Cov} \big( \mathbb{I}_{(-\infty, x]}(X_i), \mathbb{I}_{(-\infty, y]}(X_i)\big) + 0\right) \\ &= \frac{1}{n} \operatorname{Cov} \big( \mathbb{I}_{(-\infty, x]}(X_i), \mathbb{I}_{(-\infty, y]}(X_i)\big) \end{align} .\]

So we need to find that covariance. In particular, we need \[\operatorname{\mathbb E} \big( \mathbb{I}_{(-\infty, x]}(X_i) \mathbb{I}_{(-\infty, y]}(X_i)\big) = F(x),\] since the produce of indicator function is 1 if and only if both indicators are 1, which is if \(X_i \leq x\) and \(X_i \leq y\). Since \(x\) is the smaller, this is \(F(x)\).

Hence we have \[ \operatorname{Cov}\big(F^*(y), F^*(x)\big) = \frac{1}{n} \big(F(x) - F(x)F(y)\big) = \frac{1}{n}\,F(x) \big(1 - F(y)\big) . \]

5. A random variable \(X\) has bounded support, and a statistician wishes to estimate \(\theta = \max X\), the maximum value \(X\) can take. The statistician has access to 24 IID samples \(X_1, X_2, \dots, X_{24}\) from this random variable which can be read into R as follows:

   samples <- c(12.40,  2.99, 14.79,  3.59, 24.92,  4.11,
             22.68, 28.30,  0.90,  8.84, 24.17, 31.31,
             12.97, 12.43, 20.34, 14.75, 20.61, 10.93,
             17.59, 30.88, 28.13,  0.83, 17.43, 20.78)

(a) Explain why the plug-in estimator is \(\theta^* = \max \{X_j : j = 1, \dots, 24\}\), and find the \(\theta^*\) for this data.

Solution. This is basically just the definition of the plug-in estimator. (Not sure why I set that as part of the question…) The maximum value of \(X_i\) here is

samples <- c(12.40,  2.99, 14.79,  3.59, 24.92,  4.11,
             22.68, 28.30,  0.90,  8.84, 24.17, 31.31,
             12.97, 12.43, 20.34, 14.75, 20.61, 10.93,
             17.59, 30.88, 28.13,  0.83, 17.43, 20.78)
plugin <- max(samples)
plugin

[1] 31.31

So \(\theta^* = 31.31\).

(b) Use the bootstrap to estimate the bias of the plug-in estimator.

Solution. The idea is to pick 24 samples with replacement and look at how the maximum of those differs from the overall maximum 31.31.

boots <- 1e5
m <- length(samples)
bootests <- rep(0, boots)
for (k in 1:boots) {
  resample <- sample(samples, m, replace = TRUE)
  bootests[k] <- max(resample)
}
bias <- mean(bootests) - plugin
bias

[1] -0.5259627

(c) Improve the plug-in estimator by using your estimation of the bias to (approximately) “de-bias” the plug-in estimate.

Solution. We want to subtract the bias from the plugin estimator to debias it. In maths, this is \(\theta^* - (\overline{T^*} - \theta^*) = 2\theta^* - \overline{T^*}\). In code, this is

plugin - bias

[1] 31.83596