14  Rejection sampling

14.1 Rejection

In the last two lectures, we have taken standard uniform \(U \sim \operatorname{U}[0,1]\) random variables, and have applied a function to them to transform into some other distribution \(X\). One \(U\) gets turned into one \(X\). (Or, for the Box–Muller transform, two \(U\)s become two \(X\)s.)

But so far, we have taken each sample we are given. But another way to get a different distribution is to throw out samples we don’t like and wait until we get a sample we do like. This is called rejection sampling.

Suppose we want not a \(\operatorname{U}[0,1]\) random variable but instead a \(\operatorname{U}[0,\tfrac12]\) random variable. One way we’ve already seen to do this is by the inverse transform method: simply multiply \(U\) by \(\tfrac12\). But we could also do this by rejection. We start with a proposed sample \(U \sim \operatorname{U}[0,1]\). If \(U \leq \tfrac12\), we “accept” the sample, and keep it. But if \(U > \tfrac12\), we “reject” the samples – we throw it away and ask for a new one. We keep proposing samples until we accept one that’s less than \(\tfrac12\). It should be easy to convince yourself that we get a \(\operatorname{U}[0,\tfrac12]\) random variable this way. (But we’ll prove it later, if not.)

The advantage of rejection sampling is that it can help us get samples from some distributions that we couldn’t access with the inverse transform method. The disadvantage is that it can be costly or slow, because we may have to reject lots of samples before finding enough that we can accept. The more often we reject samples, the slower the procedure will be.

Rejection sampling is particularly useful for sampling from a conditional distribution, such as the conditional distribution of \(Y\) given that \(Y \in A\): we simply accept a sample \(y\) if \(y \in A\) and reject it if not.

Example 14.1 Let \(Y \sim \operatorname{N}(0, 1)\). Suppose we wish to use Monte Carlo estimation to estimate \(\mathbb E(Y \mid Y \geq 1)\).

To do this, we will need samples from the conditional distribution \(Y \mid Y \geq 1\). So we accept proposed standard normal samples that are at least 1, and reject proposed samples that are less than 1.

There are two ways we could run this in practice. First, we could decide to take \(n\) proposal samples from \(Y\) and just see how many get accepted.

n_prop <- 1e6
props   <- rnorm(n_prop)
accepts <- props[props >= 1]
length(accepts)
[1] 158692
MCest1 <- mean(accepts)
MCest1
[1] 1.525705

We end up accepting around 160,000 samples out of the 1,000,000 proposals we had to start with.

Second, we could keep proposing as many samples as needed until we reach some desired number of acceptances.

n_acc <- 1e5

samples <- rep(0, n_acc)
count <- 0
for (i in 1:n_acc) {
  newsample <- 0
  while (newsample < 1) {
    newsample <- rnorm(1)
    count <- count + 1
  }
  samples[i] <- newsample
}
count
[1] 628133
MCest2 <- mean(samples)
MCest2
[1] 1.526865

This required taking about 630,000 proposals to get 100,000 acceptances.

Here we used a “while” loop to keep taking samples until we go one that was not less than 1. The lines involving count were just so I could see how many proposals ended up being needed – these aren’t an integral part of the code.

14.2 Acceptance probability

So far, we have looked at always accepting or always rejecting a proposed sample, depending on its value. But we could “perhaps” accept some proposals too. Suppose we are already sampling from some distribution \(Y\) (perhaps generated via the inverse transform method, for example). If we see the proposed sample \(Y = x\), we could accept it with some acceptance probability \(\alpha(x) \in [0,1]\). We can control the accepted samples more delicately by adjusting this acceptance function \(\alpha\) to values that aren’t just 0 or 1.

What is the distribution of an accepted sample \(X\)?

Well, using Bayes’ theorem, we have in the discrete case \[\mathbb P(X = x) = \mathbb P(Y = x \mid \text{accept}) = \frac{\mathbb P(Y = x)\,\mathbb P(\text{accept} \mid Y = x)}{\mathbb P(\text{accept})} = \frac{1}{Z} \alpha(x)\,\mathbb P(Y = x) .\] where \(Z = \mathbb P(\text{accept})\) is the normalising constant. In the continuous case, with \(g\) the PDF of the original \(Y\) and \(f\) the PDF of the accepted \(X\), we have \[ f(x) = g(x \mid \text{accept}) = \frac{g(x)\,\mathbb P(\text{accept} \mid X = x)}{\mathbb P(\text{accept})} = \frac{1}{Z}\,\alpha(x)\,g(x) , \] where \(Z = \mathbb P(\text{accept})\) again.

Example 14.2 Suppose we wish to sample from the distribution \[ f(x) \propto \exp\big(-\tfrac12x^2\big)\,(\sin^2 x) . \tag{14.1}\] How can we do this?

Well, we can note that the PDF of the standard normal is \[ g(x) = \frac{1}{2\pi}\,\exp\big(-\tfrac12x^2\big) \propto \exp\big(-\tfrac12x^2\big) \] and that \[ 0 \leq \sin^2x\leq 1 .\] (Here, \(\sin^2 x\) means \((\sin x)^2\), by the way.) This means that, if we take proposals \(Y \sim \operatorname{N}(0,1)\), and then accept an proposed sample with probability \(\alpha(x) = \sin^2 x\), that will give us the distribution Equation 14.1.

n_prop <- 1e6
props <- rnorm(n_prop)
accepts <- props[runif(n_prop) <= sin(props)^2]
length(accepts)
[1] 431663
hist(accepts, probability = TRUE, breaks = 50)
curve(
  0.92 * exp(-x^2 / 2) * sin(x)^2, add = TRUE, n = 1001,
  lwd = 2, col = "blue"
)

By rejecting lots of proposals with values near 0, we turned the unimodal (“one hump”)proposal distribution \(Y \sim \operatorname{N}(0,1)\) into this interesting bimodal (“two hump”) distribution.

Let’s explain line 3 more carefully. We want to accept a proposal \(x\) with probability \(\sin^2 x\). We saw in Lecture 12 that we can simulate a Bernoulli\((p)\) distribution by taking the value 1 is \(U \leq p\) and taking 0 if \(U > p\). So in line 3, we are accepting each proposed sample \(x_i\) if a standard uniform variate \(u_i\) satisfies \(u_i \leq \sin^2 x_i\).

In this example, we found we accepted about 430,000 samples.

In this example, we managed to sample from the PDF in Equation 14.1, \[ f(x) = \frac{1}{Z}\,\exp\big(-\tfrac12x^2\big)\,(\sin^2 x) ,\] even though we never found out what the normalising constant \(Z\) was. This idea – that we can sample from a distribution even if we only know it up to a multiplicative constant – is a very important one that will come up a lot later in this module.

We won’t go into that idea deeply now, but we briefly mention that it is very important in Bayesian statistics. In Bayesian statistics, the posterior distribution is often known only up to proportionality. That’s because we have \[\begin{align} \text{posterior} &\propto \text{prior}\times\text{likelihood} \\ \pi(\theta \mid x) &\propto\, \pi(x) \times p(x \mid \theta) \end{align}\] It’s often very difficult to find the normalising constant in this expression – indeed, it can be impossible in practice. So being able to sample from such a posterior distribution without finding the constant is very important for Bayesian statisticians.

14.3 How many samples?

We have mentioned that the downside of rejection sampling is that we may have to take lots of proposed samples to get enough accepted ones. Or, conversely, we may not get enough accepted samples from a fixed number of proposed samples. Remember that the accuracy of Monte Carlo estimation, for example, depends on how many (accepted) samples we get – the mean-square error scales like \(1/n\) and the root-mean-square error like \(1/\sqrt{n}\). So it’s important to be able to get a lot of accepted samples \(n\).

Let’s examine these questions a bit closer. Write \(a = \mathbb P(\text{accept})\) for the unconditional probability a proposal gets accepted (that is, the a priori acceptance probability, before we have seen the value \(Y = x\) of the proposal). In the discrete case, this is \[ a = \mathbb P(\text{accept}) = \sum_x \mathbb P(Y = x)\,\alpha(x) , \] and in the continuous case this is \[ a = \mathbb P(\text{accept}) = \int_{-\infty}^{+\infty} g(x)\,\alpha(x)\,\mathrm{d}x . \] In both cases, this can be written more succinctly as \(a = \Exg\alpha(Y)\).

  • If we take \(N\) proposals, then the expected number of accepted samples is \(n = aN\).
  • If we keep taking proposals until getting \(n\) acceptances, then the expected number of proposals is \(N = n/a\).

If the number of proposals \(N\) is large and the unconditional acceptance probability \(a\) is not very close to 0, then the actual number of acceptances will likely be very close to \(aN\). This is justified by the law of large numbers (and the central limit theorem). Similarly, if the desired number of acceptances \(n\) is large and \(a\) is not very close to 0, then the number of required proposals will likely be very close to \(n/a\). Thus it is in our interest to make \(a\) as large as we can: this gives us the most acceptances, or requires the fewest proposals.

We have to be careful when the unconditional acceptance probability is very close to \(0\). In that case, there can be a lot of variability required in either the (very small) number of accepted samples or the (very large) number of required proposals. This unpredictability is another reason to avoid rejection sampling where the unconditional acceptance probability \(a\) is very small.

Next time. We look closer at rejection sampling, and in particular how we can target rejection sampling at a given distribution using the “envelope” method.

Summary:

  • In rejection sampling, we accept a proposed sample \(Y = x\) with probability \(\alpha(x)\).

  • If the PDF of a proposed sample is \(g\), then the PDF of an accepted sample is proportional to \(\alpha(x) \,g(x)\).

  • When the acceptance probability is low, rejection sampling can require a lot of proposed samples to get enough accepted samples.

Read more: Voss, An Introduction to Statistical Computing, Subsections 1.4.1 and 1.4.3.