<- 1e6
n <- runif(n, -1, 2)
samples mean(samples^4)
[1] 2.200859
\[\newcommand{\Exg}{\operatorname{\mathbb{E}}} \newcommand{\Ex}{\mathbb{E}} \newcommand{\Ind}{\mathbb{I}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Corr}{\operatorname{Corr}} \newcommand{\ee}{\mathrm{e}}\]
1. Let \(X\) be uniform on \([-1,2]\).
(a) By hand, calculate the exact value of \(\Ex X^4\).
Solution.
\[\int_{-1}^2 x^4\,\frac{1}{2-(-1)}\,\mathrm{d}x = \tfrac13 \Big[\tfrac15x^5\Big]_{-1}^2 = \tfrac13\Big(\tfrac{32}{5}-\big(-\tfrac15\big)\Big) = \tfrac{33}{15} = \tfrac{11}{5} = 2.2\]
(b) Using R, calculate a Monte Carlo estimate for \(\Ex X^4\).
Solution. I used the R code
<- 1e6
n <- runif(n, -1, 2)
samples mean(samples^4)
[1] 2.200859
2. Let \(X\) and \(Y\) both be standard normal distributions. Compute a Monte Carlo estimate of \(\Exg \max\{X,Y\}\). (You may wish to investigate R’s pmax()
function.)
Solution. By looking at ?pmax
(or maybe searching on Google) I discovered that pmax()
gives the “parallel maxima” of two (or more vectors). That is the first element is the maximum of all the first elements of the vectors; the second element is the maximum of all the second elements of the vectors; and so on.
So I used the R code
<- 1e6
n <- rnorm(n)
xsamples <- rnorm(n)
ysamples mean(pmax(xsamples, ysamples))
[1] 0.5637339
3. You are trapped alone on an island. All you have with you is a tin can (radius \(r\)) and a cardboard box (side lengths \(2r \times 2r\)) that it fits snugly inside. You put the can inside the box [left picture].
When it starts raining, each raindrop that falls in the cardboard box might fall into the tin can [middle picture], or might fall into the corners of the box outside the can [right picture].
(a) Using R, simulate rainfall into the box. You may take units such that \(r = 1\). Estimate the probability \(\theta\) that a uniformly-random raindrop that falls in the cardboard box falls into the tin can also.
Solution. I set things up so that the box is \([-1, 1]^2\), centered at the origin. This means that the inside of the can is the set of points is those \((x,y)\) such that \(x^2 + y^2 \leq 1\).
<- 1e6
n <- runif(n, -1, 1)
rain_x <- runif(n, -1, 1)
rain_y <- function(x, y) x^2 + y^2 <= 1
in_box mean(in_box(rain_x, rain_y))
[1] 0.785823
(b) Calculate exactly the probability \(\theta\).
Solution. The area of the box is \(2r \times 2r = 4r^2\). The area of the can is \(\pi r^2\). So the probability a raindrop landing in the box lands in the can is \[ \frac{\text{area of can}}{\text{area of box}} = \frac{\pi r^2}{4r^2} = \frac{\pi}{4} \approx 0.785. \]
(c) You realise that, despite being trapped all alone on the island, you now have a way of approximating the value of \(\pi\). If you want to calculate \(\pi\) to 6 decimal places, roughly how many raindrops do you need to fall into the box?
Solution. The phrase “to 6 decimal places” isn’t a precise mathematical one. I’m going to interpret this as getting the root-mean-square error below \(10^{-6}\). If you interpret it slightly differently that’s fine – for example, getting the width of a 95% confidence interval below \(10^{-6}\) could be another, slightly stricter, criterion.
One could work this out by hand. Since the variance of a Bernoulli random variable is \(p(1-p)\), the mean-square error of our estimator is \[ \frac{\frac{\pi}{4}(1 - \frac{\pi}{4})}{n} \approx \frac{0.169}{n} . \] So we need \[n = \frac{0.169}{(10^{-6})^2} \approx 169 \text{ billion} . \]
That said, if we are trapped on our desert island, maybe we don’t know what \(\frac{\pi}{4}(1 - \frac{\pi}{4})\) is. In that case we could do this using the can and the box. Our estimate of the variance is
<- var(in_box(rain_x, rain_y))
var_est / (1e-6)^2 var_est
[1] 168305380974
We will probably spend a long time waiting for that much rain!
4. Let \(h(x) = 1/(x + 0.1)\). We wish to estimate \(\int_0^5 h(x) \, \mathrm{d}x\) using a Monte Carlo method.
(a) Estimate the integral using \(X\) uniform on \([0,5]\).
Solution.
<- 1e6
n <- function(x) 1 / (x + 0.1)
integrand <- runif(n, 0, 5)
samples1 mean(5 * integrand(samples1))
[1] 3.929335
(In fact, the true answer is \(\log(5.1) - \log(0.1) = 3.932\), so it looks like this is working correctly.)
(b) Can you come up with a choice of \(X\) that improves on the estimate from (a)?
Solution. Let’s look at a graph of the integrand \(h\).
curve(
n = 1001, from = 0, to = 5,
integrand, col = "blue", lwd = 3,
xlab = "x", ylab = "integrand h(x)", xlim = c(0,5)
)abline(h = 0)
We see that we get a much bigger contribution to the integral from values near 0. So a random variable that picks values nearer to 0 more often might have a chance of giving a more accurate result.
I decided to try an exponential distribution with rate 1, which should sample the smaller values of \(x\) more often.
<- function(x) dexp(x, 1)
pdf <- function(x) (integrand(x) / pdf(x)) * (x <= 5)
phi
<- rexp(n, 1)
samples2 mean(phi(samples2))
[1] 3.935744
(I had to include x <= 5
in the expression for \(\phi\), because my exponential distribution will sometimes take samples above 5, but they should count as 0 in an estimate for the integral between 0 and 5.)
To see whether or not this was an improvement, I estimated the mean-square error.
var(5 * integrand(samples1)) / n
[1] 3.361979e-05
var(phi(samples2)) / n
[1] 6.073885e-06
I found that I had reduced the mean-square error by roughly a factor of 5.
5. When calculating a Monte Carlo estimate \(\frac{1}{n} \sum_{i=1}^n \phi(x_i)\), one might wish to first generate the \(n\) samples \((x_1, x_2, \dots, x_n)\) and store them, and only then, after all samples are generated, finally calculate the estimate. However, when \(n\) is extremely large, storing all \(n\) samples uses up a lot of space in a computer’s memory. Describe (in words, in R code, or in a mixture of the two) how the Monte Carlo estimate could be produced using much less memory.
Solution. The idea is to keep a “running total” of the \(\phi(x_i)\)s. Then we only have to store that running total, not all the samples. Once this has been done \(n\) times, then divide by \(n\) to get the estimate.
In R code, this might be something like
<- 1e6
n
<- 0
total for (i in 1:n) {
<- # sampling code for 1 sample
sample <- total + phi(sample)
total
}
<- total / n MCest
6. Show that the indicator functions \(\mathbb I_A(X)\) and \(\mathbb I_B(X)\) have correlation 0 if and only if the events \(\{X \in A\}\) and \(\{X \in B\}\) are independent.
Solution. Recall that two random variables \(U\), \(V\) have correlation 0 if and only if their covariance \(\Cov(U,V) = \Ex UV - (\Ex U)(\Ex V)\) is 0 too.
We know that \(\Exg\Ind_A(X) = \mathbb P(X \in A)\) and \(\Exg \Ind_B(Y) = \mathbb P(X \in B)\). What about \(\Exg \Ind_A(X) \Ind_B(X)\)? Well, \(\Ind_A(x) \Ind_B(x)\) is 1 if and only if both indicator functions equal 1, which is if and only if both \(x \in A\) and \(x \in B\). So \(\Exg \Ind_A(X) \Ind_B(X) = \mathbb P(X \in A \text{ and } X \in B)\).
So the covariance is \[ \Cov \big(\Ind_A(X), \Ind_B(X) \big) = \mathbb P(X \in A \text{ and } X \in B) - \mathbb P(X \in A)\, \mathbb P(X \in B) . \] If this is 0, then \(\mathbb P(X \in A \text{ and } X \in B) = \mathbb P(X \in A)\, \mathbb P(X \in B)\), which is precisely the definition of those two events being independent.
7. Let \(X\) be an exponential distribution with rate 1.
(a) Estimate \(\mathbb EX^{2.1}\) using the standard Monte Carlo method.
<- 1e6
n <- rexp(n, 1)
samples mean(samples^2.1)
[1] 2.204066
(b) Estimate \(\mathbb EX^{2.1}\) using \(X^2\) as a control variate. (You may recall that if \(Y\) is exponential with rate \(\lambda\) then \(\mathbb EY^2 = 2/\lambda^2\).)
Solution. We have \(\Ex X^2 = 2\). So, re-using the same sample as before (you don’t have to do this – you could take new samples), our R code is as follows.
mean(samples^2.1 - samples^2) + 2
[1] 2.198793
(c) Which method is better?
Solution. The better answer is the one with the smaller mean-square error.
For the basic method,
var(samples^2.1) / n
[1] 2.82498e-05
For the control variate method,
var(samples^2.1 - samples^2) / n
[1] 6.958636e-07
So the control variates method is much, much better.
8. Let \(Z\) be a standard normal distribution. A statistician has been investigating Monte Carlo estimates of \(\mathbb EZ^k\) for different positive integers values of \(k\). Her colleague suggests using \(Z' = -Z\) as an antithetic variable. Without running any R code, explain whether or not this is a good idea (a) when \(k\) is even; (b) when \(k\) is odd.
Solution.
(a) When \(k\) is even, we have \(Z^k = (-Z)^k\). So the antithetic variables method just repeats each sample twice. This is obviously no benefit at all, and just wastes time. Indeed, we have perfect positive correlation \(\rho = +1\), which is the “worst-case scenario”.
(b) When \(k\) is odd, we have \(Z^k = -(-Z)^k\). In this case we know that \(\Ex Z^k = 0\), because the results for positive \(z\) exactly balance out those for negative \(z\), so no Monte Carlo sampling is necessary. If our statistician has somehow forgotten that, though, she will get a pleasant surprise! After just two samples, she will get the estimate \[\frac{1}{2} \big(Z_1^k + (-Z_1)^k \big) = \frac{1}{2} (Z_1^k - Z_1^k) = 0 ,\] Thereby getting the result exactly right. Indeed, we have perfect negative correlation \(\rho = -1\), which is the “best-case scenario”.
\[\newcommand{\Exg}{\operatorname{\mathbb{E}}} \newcommand{\Ex}{\mathbb{E}} \newcommand{\Ind}{\mathbb{I}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Corr}{\operatorname{Corr}} \newcommand{\ee}{\mathrm{e}}\]
1. [2018 exam, Question 4]
(a) Suppose it is desired to estimate the value of an integral \[ \theta = \int_0^1 h(x)\,\mathrm{d}x \] by Monte Carlo integration
i. By splitting \(h(x) = \phi(x)f(x)\), where \(f\) is a probability density function, describe how the Monte Carlo method of estimating \(\theta\) works.
Solution. As suggested, we have \[ \theta = \int_0^1 h(x)\,\mathrm{d}x = \int_0^1 \phi(x)\,f(x)\,\mathrm{d}x = \Exg \phi(X) ,\] where \(X\) is a random variable with PDF \(f\). To estimate this, we sample \(X_1, \dots, X_n\) from \(X\), and use \[ \widehat\theta_n^{\mathrm{MC}} = \frac{1}{n} \sum_{i=1}^n \phi(X_i) . \]
ii. Let \(\widehat\theta_n\) be the Monte Carlo estimate based on \(n\) simulations. Find the expectation and variance of \(\widehat\theta_n\), as a function of \(n\).
Solution. As in lectures, we have \[ \Exg \theta_n^{\mathrm{MC}} = \Ex \left(\frac{1}{n} \sum_{i=1}^n \phi(X_i) \right) = \frac{1}{n}\,n\Exg\phi(X) = \Exg\phi(X) \] and \[ \Var\big((\theta_n^{\mathrm{MC}}\big) = \Var \left(\frac{1}{n} \sum_{i=1}^n \phi(X_i) \right) = \frac{1}{n^2}\,n\Var\big(\phi(X)\big) = \frac{1}{n}\Var\big(\phi(X)\big) . \]
iii. What guidelines can be given for the choice of \(f\) in practice?
Solution. First, \(f\) (or equivalently \(X\)) should be easy to sample from. Second, we want to minimise \(\Var(\phi(X))\), so should pick \(f\) approximately proportional to \(h\), so that \(\phi\) is roughly constant, and therefore has low variance.
In the absence of better options, \(X\) being uniform on \([0, 1]\) (so \(f(x) = 1\) on this interval) is often not a bad choice.
(b) Consider evaluation the integral \[ \int_0^1 x^2\,\mathrm{d}x \] by Monte Carlo integration using \(f(x) = \Ind_{[0,1]}(x)\). Write down the Monte Carlo estimator \(\widehat\theta_n\).
Solution. Since \(f(x) = 1\) on this interval, we must take \(\phi(x) = x^2\). Thus the estimator is \[\widehat\theta_n = \frac{1}{n} \sum_{i=1}^n X_i^2 ,\] where \(X_i \sim \operatorname{U}[0,1]\) are independent.
Explain how antithetic variables can be used in this situation, and justify briefly why their use here is guaranteed to improve efficiency.
Solution. Antithetic variables attempt to reduce the variance in Monte Carlo estimation by using pairs of variables \((X_i, X'_i)\) that both have the same distribution as \(X\), but where \(\phi(X)\) and \(\phi(X')\) are negatively correlated.
In this situation, if \(X_i \sim \operatorname{U}[0,1]\), then \(X'_i = 1 - X_i\) has this same distribution. The corresponding antithetic variable estimator is \[\widehat\theta_n^{\mathrm{AV}} = \frac{1}{n} \sum_{i=1}^{n/2} \big(X_i^2 + (1 - X_i)^2\big) .\]
An anitithetic variables estimator always decreases the variance of a Monte Carlo estimator if the correlation (or, equivalently, the covariance) between \(\phi(X)\) and \(\phi(X')\) is negative. We saw in lectures that, if \(X \sim \operatorname{U}[0,1]\) and \(\phi\) is monotonically increasing, then \(\phi(X)\) and \(\phi(1 - X)\) have negative correlation. Since \(x^2\) is increasing on \([0,1]\), that is the case here.
For \(U \sim \operatorname{U}[0,1]\), use the results \[\Ex U^2 = \tfrac13 \qquad \Ex U^4 = \tfrac15 \qquad \Ex U^2(1-U)^2 = \tfrac{1}{30} \] to find the correlation between \(U^2\) and \((1-U)^2\). Hence, or otherwise, confirm that using antithetic variables reduces the variance of the Monte Carlo estimator by a factor of 8.
Solution. We need to find the correlation between \(U^2\) and \((1 - U)^2\), where \(U \sim \operatorname{U}[0,1]\). The covariance is \[ \operatorname{Cov}\big(U^2, (1-U)^2\big) = \Ex U^2(1-U)^2 - \big(\Ex U^2\big) \big(\Ex (1-U)^2\big) = \tfrac{1}{30} - \Big(\tfrac{1}{3}\Big)^2 = -\tfrac{7}{90} .\] The variance are both \[ \Var(U^2) = \Ex U^4 - \big(\Ex U^2\big)^2 = \tfrac15 - \Big(\tfrac{1}{3}\Big)^2 = \tfrac{4}{45} . \] Hence, the correlation is \[ \rho = \operatorname{Corr}\big(U^2, (1-U)^2\big) = \frac{-\frac{7}{90}}{\frac{4}{45}} = -\tfrac78 . \]
The variance of the standard Monte Carlo estimator is \(\frac{1}{n}\Var(\phi(X))\), while the variance of the antithetic variables estimator is \(\frac{1+\rho}{n}\Var(\phi(X))\). So the variance changes by a factor of \(1 + \rho\), which here is \(1 - \frac{7}{8} = \frac{1}{8}\). So the variance reduces by a factor of 8, as claimed.
(You may use without proof any results about the variance of antithetic variable estimates, but you should clearly state any results you are using.)
2. Let \(X \sim \operatorname{N}(0,1)\). Consider importance sampling estimation for the probability \(\theta = \mathbb P(3 \leq X \leq 4)\) using samples \(Y_i\) from the following sample distributions: (i) \(Y \sim \operatorname{N}(1,1)\); (ii) \(Y \sim \operatorname{N}(2,1)\); (iii) \(Y \sim \operatorname{N}(3.5,1)\); (iv) \(Y \sim 3 + \operatorname{Exp}(1)\).
Each of these four distributions gives rise to a different importance sampling method. Our aim is to compare the resulting estimates.
(a) For each of the four methods, estimate the variance \(\Var\big(\frac{f(Y)}{g(Y)}\,\phi(Y)\big)\). Which of these four methods gives the best results?
Solution. I used the following R code
<- 1e5
n <- function(x) (x >= 3) & (x <= 4)
phi <- function(x) dnorm(x, 0, 1)
pdf_x
<- function(x) dnorm(x, 1, 1)
pdf_y1 <- rnorm(n, 1, 1)
samples_y1 <- var((pdf_x(samples_y1) / pdf_y1(samples_y1)) * phi(samples_y1))
var1
<- function(x) dnorm(x, 2, 1)
pdf_y2 <- rnorm(n, 2, 1)
samples_y2 <- var((pdf_x(samples_y2) / pdf_y2(samples_y2)) * phi(samples_y2))
var2
<- function(x) dnorm(x, 3.5, 1)
pdf_y3 <- rnorm(n, 3.5, 1)
samples_y3 <- var((pdf_x(samples_y3) / pdf_y3(samples_y3)) * phi(samples_y3))
var3
<- function(x) dexp(x - 3, 1)
pdf_y4 <- 3 + rexp(n, 1)
samples_y4 <- var((pdf_x(samples_y4) / pdf_y4(samples_y4)) * phi(samples_y4))
var4
signif(c(var1, var2, var3, var4), 3)
[1] 8.20e-05 1.39e-05 6.69e-06 1.92e-06
(For the fourth PDF, we used that the PDF of \(3 + Z\) is \(f_Z(z-3)\).)
We see that the fourth method \(3 + \operatorname{Exp}(1)\) is the most accurate.
(b) Determine a good estimate for \(\mathbb P(3 \leq X \leq 4)\), and discuss the accuracy of your estimate.
Solution. We’ll use the fourth method, and boost the number of samples to one million.
<- 1e6
n <- mean((pdf_x(samples_y4) / pdf_y4(samples_y4)) * phi(samples_y4))
ISest <- var((pdf_x(samples_y4) / pdf_y4(samples_y4)) * phi(samples_y4)) / n
IS_MSE c(ISest, sqrt(IS_MSE))
[1] 1.316356e-03 1.387238e-06
So our estimate is \(\widehat\theta = 0.00132\). Since the RMSE is three orders of magnitude less than the estimate, so the estimate is probably accurate to a couple of significant figures.
(c) For each of the four methods, approximate how many samples from \(Y\) are required to reduce the root-mean-square error of the estimate of \(\mathbb P(3 \leq X \leq 4)\) to 1%?
Solution. To get the error to 1% means an absolute error of roughly \(\epsilon = 0.01\widehat\theta\). Then we know that the required number of samples is \[ n = \frac{\Var\big(\frac{f(Y)}{g(Y)}\phi(Y)\big)}{\epsilon^2} . \]
<- 0.01 * ISest
eps round(c(var1, var2, var3, var4) / eps^2)
[1] 473334 80024 38607 11106
3. [2017 exam, Question 3]
(a) Defining any notation you use, write down the basic Monte Carlo estimator \(\widehat\theta_n^{\mathrm{MC}}\) and the importance sampling estimator \(\widehat\theta_n^{\mathrm{IS}}\) for an expectation of the form \(\theta = \Exg \phi(X)\), where \(X\) is a random variable with probability density function \(f\).
What is the advantage of importance sampling over the standard Monte Carlo method?
Solution. The basic Monte Carlo estimator is \[ \widehat\theta_n^{\mathrm{MC}} = \frac{1}{n} \sum_{i=1}^n\phi(X_i) , \] where the \(X_i\) are independent random samples from \(X\).
The basic importance sampling estimator is \[ \widehat\theta_n^{\mathrm{IS}} = \frac{1}{n} \sum_{i=1}^n \frac{f(Y_i)}{g(Y_i)} \,\phi(Y_i) , \] where the \(Y_i\) are independent random samples from a random variable \(Y\) with probability density function \(g\). We must have \(g(y) > 0\) whenever \(f(y) > 0\).
The main advantage of the importance sampling estimator is that can reduce the variance of the estimator by oversampling the most important areas of \(y\), but then downweighting those samples. Another advantage is that importance sampling can be used when it is difficult to sample from \(X\).
(b) Prove that both the basic Monte Carlo and importance sampling estimates from part (a) are unbiased.
Solution. For the standard Monte Carlo estimator, \[ \Exg \theta_n^{\mathrm{MC}} = \Ex \left(\frac{1}{n} \sum_{i=1}^n \phi(X_i) \right) = \frac{1}{n}\,n\Exg\phi(X) = \Exg\phi(X) = \theta .\]
For the importance sampling estimator, first note that \[ \mathbb E \left(\frac{f(Y)}{g(Y)}\,\phi(Y)\right) = \int_{-\infty}^{+\infty} \frac{f(y)}{g(y)}\,\phi(y)\,g(y)\,\mathrm{d}y = \int_{-\infty}^{+\infty} \phi(y) \,f(y) \, \mathrm{d}y = \Exg\phi(X) , \] since \(f\) is the PDF of \(X\). Hence \[ \Exg \widehat\theta_n^{\mathrm{IS}} = \Exg \left(\frac{1}{n} \sum_{i=1}^n \frac{f(Y_i)}{g(Y_i)} \,\phi(Y_i) \right) = \frac{1}{n}\,n\,\Ex \left(\frac{f(Y)}{g(Y)}\,\phi(Y)\right) = \Exg \phi(X) = \theta. \]
Hence, both estimators are unbiased.
(c) Show that the variance of the importance sampling estimator is given by \[ \Var\big(\widehat\theta_n^{\mathrm{IS}}\big) = \frac{1}{n}\int_{-\infty}^{+\infty} \frac{f(y)^2 \phi(y)^2}{g(y)}\,\mathrm{d}y - \frac{1}{n}\big(\Exg \phi(X)\big)^2. \]
Solution. First, note that \[\begin{align} \Var \left(\frac{f(Y)}{g(Y)}\,\phi(Y)\right) &= \Ex \left(\frac{f(Y)}{g(Y)}\,\phi(Y)\right)^2 - \big(\Exg \phi(X)\big)^2 \\ &= \int_{-\infty}^{+\infty} \left(\frac{f(y)}{g(y)}\,\phi(y)\right)^2 g(y)\,\mathrm{d}y - \big(\Exg \phi(X)\big)^2\\ &= \int_{-\infty}^{+\infty} \frac{f(y)^2}{g(y)^2}\,\phi(y)^2 \,g(y)\,\mathrm{d}y - \big(\Exg \phi(X)\big)^2\\ &= \int_{-\infty}^{+\infty} \frac{f(y)^2}{g(y)}\,\phi(y)^2 \,\mathrm{d}y - \big(\Exg \phi(X)\big)^2 \end{align} \] Second, we have \[ \Var\big(\widehat\theta_n^{\mathrm{IS}}\big) = \Var \left(\frac{1}{n} \sum_{i=1}^n \frac{f(Y_i)}{g(Y_i)} \,\phi(Y_i) \right) = \frac{1}{n} \Var \left(\frac{f(Y)}{g(Y)}\,\phi(Y)\right) . \] Putting these together proves the result.
(d) Let \(X \sim \operatorname{N}(0,2)\) and \(a \in \mathbb R\). We want to estimate \(\theta = \Ex \big(\sqrt{2}\exp(-(X-a)^2/4)\big)\), using importance sampling with samples \(Y \sim \operatorname{N}(\mu, 1)\) for some \(\mu \in \mathbb R\). Using the result from part (c), or otherwise, show that in the case the variance of the importance sampling estimator \(\widehat\theta_n^{\mathrm{IS}}\) is given by \[ \Var\big(\widehat\theta_n^{\mathrm{IS}}\big) = \frac{\exp(\mu^2 - a\mu) - \theta^2}{n} . \] [Note: This equation has changed since an earlier version of the question.]
Solution. It’s clear, using part (c), that it will suffice to show that \[ \int_{-\infty}^{+\infty} \frac{f(y)^2 \phi(y)^2}{g(y)}\,\mathrm{d}y = \exp(\mu^2 - a\mu) . \] Here, we have \[ \begin{align} f(y) &= \frac{1}{\sqrt{4\pi}} \exp\big(-\tfrac14 y^2 \big) \\ g(y) &= \frac{1}{\sqrt{2\pi}} \exp\big(-\tfrac12 (y-\mu)^2 \big) \\ \phi(y) &= \sqrt{2} \exp\big(-\tfrac14 (y-a)^2 \big) . \end{align} \] Therefore, by a long and painful algebra slog, we have \[ \begin{align} \frac{f(y)^2 \phi(y)^2}{g(y)} &= \frac{\frac{1}{4\pi} \exp\big(-\tfrac12 y^2 \big)\times 2\exp\big(-\tfrac12 (y-a)^2 \big)}{\frac{1}{\sqrt{2\pi}} \exp\big(-\tfrac12 (y-\mu)^2 \big)} \\ &= \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big(y^2 + (y-a)^2 - (y-\mu)^2 \big)\Big) \\ &= \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big(y^2 - 2(a - \mu)y + a^2 + \mu^2\big)\Big) \\ &= \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big((y - (a - \mu))^2 - (a- \mu)^2 + a^2 - \mu^2\big)\Big) \\ &= \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big((y - (a - \mu))^2 + 2a\mu - 2\mu^2 \big)\Big) \\ &= \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big(y - (a - \mu)\big)^2\Big)\exp(\mu^2 + a\mu), \end{align} \] where we ‘completed the square’ on the fourth line. Thus \[ \begin{align} \int_{-\infty}^{+\infty} \frac{f(y)^2 \phi(y)^2}{g(y)}\,\mathrm{d}y &= \int_{-\infty}^{+\infty} \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big(y - (a - \mu)\big)^2\Big)\exp(\mu^2 + 2a) \,\mathrm{d}y\\ &= \exp(\mu^2 + 2a) \int_{-\infty}^{+\infty} \frac{1}{\sqrt{2\pi}} \exp \Big(-\tfrac12 \big(y - (a - \mu)\big)^2\Big)\,\mathrm{d}y \\ &= \exp(\mu^2 + a\mu) , \end{align} \] since the big integral on the right is the integral of the PDF of a normal \(\operatorname{N}(a-\mu, 1)\) distribution, so equals 1.
Hence, we have proved the result.
For fixed \(n\) and \(a\), find the value of \(\mu\) for which the importance sampling estimator has the smallest mean-square error. Comment on the result.
Solution. Minimising this expression is equivalent to minimising \(\mu^2 + a\mu\). By differentiating with respect to \(\mu\) (or otherwise), we see that this is at \(a = \tfrac12 \mu\).
4. (Answer the following question “by hand”, without using R. You may check your answer with R, if you wish.)
(a) Consider the LCG with modulus \(m = 2^4 = 16\), multiplier \(a = 5\), and increment \(a = 8\). What is the period of this LCG when started from the seed (i) \(x_1 = 1\); (ii) \(x_1 = 2\)?
Solution. For (i), we have \[ \begin{align} x_1 &= 1 \\ x_2 &= (5 \times 1 + 8) \bmod 16 = 13 \bmod 16 = 13 \\ x_3 &= (5 \times 13 + 8) \bmod 16 = 73 \bmod 16 = 9 \\ x_4 &= (5 \times 9 + 8) \bmod 16 = 53 \bmod 16 = 5 \\ x_5 &= (5 \times 5 + 8) \bmod 16 = 33 \bmod 16 = 1 . \end{align}\] Here, \(x_5\) is a repeat of \(x_1\), so the period is \(5-1=4\).
For (ii), we have \[ \begin{align} x_1 &= 2 \\ x_2 &= (5 \times 2 + 8) \bmod 16 = 18 \bmod 16 = 2 . \end{align}\] This is an immediate repeat, so the period is 2.
(b) Consider the LCG with modulus \(m = 2^4 = 16\), multiplier \(a = 2\), and increment \(c = 4\). Start from the seed \(x_1 = 3\). (i) When do we first see a repeat output? (ii) What is the period?
Solution. We have \[ \begin{align} x_1 &= 3 \\ x_2 &= (2 \times 3 + 4) \bmod 16 = 10 \bmod 16 = 10 \\ x_3 &= (2 \times 10 + 4) \bmod 16 = 24 \bmod 16 = 8 \\ x_4 &= (2 \times 8 + 4) \bmod 16 = 20 \bmod 16 = 4 \\ x_5 &= (2 \times 4 + 4) \bmod 16 = 12 \bmod 16 = 12 \\ x_6 &= (2 \times 12 + 4) \bmod 16 = 28 \bmod 16 = 12 . \end{align}\]
(i) The first repeat is \(x_6\).
(ii) Since 12 is a fixed point of this LCG, the remainder of the sequence is 12 forever, with period 1.
5. Consider the following LCGs with modulus \(m = 2^8 = 256\):
(i) \(a = 31\), \(c = 47\);
(ii) \(a = 21\), \(c = 47\);
(iii) \(a = 129\), \(c = 47\).
(a) Without using a computer, work out which of these LCGs have a full period of 256.
Solution.
(i) Here, \(a\) is \(3 \bmod 4\), not \(1 \bmod 4\), so this does not have full period of 256.
(ii) Here, \(c\) is odd and \(a\) is \(1 \bmod 4\), so this does have full period of 256.
(iii) Here, \(c\) is odd and \(a\) is \(1 \bmod 4\), so this does have full period of 256.
(b) Which of these LCGs would make good pseudorandom number generators?
Solution. We will check the random appearance of the outputs using R.
<- function(n, modulus, mult, incr, seed) {
lcg <- rep(0, n)
samples 1] <- seed
samples[for (i in 1:(n - 1)) {
+ 1] <- (mult * samples[i] + incr) %% modulus
samples[i
}return(samples)
}
(i) This does not have full period, so is unlikely to be good PRNG. Let’s check
<- 2^8
m <- 1
seed plot(lcg(m, m, 31, 47, seed))
The picture confirms that this does not look random, and in fact has very short period of 16.
(ii) This does have full period, so is a candidate for a good PRNG if the output looks random.
plot(lcg(m, m, 21, 47, seed))
This mostly looks random, but there does appear to be a sort of thick diagonal line in the picture going from bottom left to rop right. I’d might be happy to use this for casual statistical work – the lack of randomness does not seem super-serious – but I would avoid this for cryptographic purposes, for example.
(iii) This does have full period, so is a candidate for a good PRNG if the output looks random.
plot(lcg(m, m, 129, 47, seed))
Even though this has full period, there is a very clear non-random pattern. This is not appropriate for a PRNG.
6. Consider an LCG with modulus \(m = 2^{10} = 1024\), multiplier \(a = 125\), and increment \(c = 5\). Using R:
(a) Generate 1000 outputs in \(\{0, 1, \dots, 1023\}\) from this LCG, starting with the seed \(x_1 = 1\).
[This question originally said 200 outputs, not 1000. It’s fine if you answered that version, but the conclusions to the problem are less interesting that way.]
Solution. Using the same lcg()
function from Question 5, we have
<- 1000
n <- 2^{10}
m <- 1
seed <- lcg(n, m, 125, 5, seed) samples1
(b) Convert these to 1000 outputs to pseudorandom uniform samples in \([0, 1]\).
To do this, we simply divide the samples by \(m\).
<- samples1 / m samples1
(c) Using these samples, obtain a Monte Carlo estimate for \(\mathbb E\cos(U)\), where \(U \sim \operatorname{U}[0,1]\).
Solution.
mean(cos(samples1))
[1] 0.8418218
This is very close to the correct answer \(\sin 1 = 0.8414\)
(d) What is the root-mean-square error of your estimate?
Solution.
sqrt(var(cos(samples1)) / n)
[1] 0.004392217
(e) Repeat parts (a) to (d) for the LCG with the same \(m\), but now with multiplier \(a = 127\) and increment \(c = 4\).
Solution.
<- lcg(n, m, 127, 4, seed)
samples2 <- samples2 / m
samples2 mean(cos(samples2))
[1] 0.868257
sqrt(var(cos(samples2)) / n)
[1] 0.003899394
This does not seem to be quite accurate to the answer to \(\sin 1 = 0.841\), and the reported RMSE to too small to account for the error.
However, the problem is that this LCG is not actually a uniform (pseudo)random number generator – it has period 8.
plot(samples2)
Thus the estimator is just keeping using the same 8 points over and over again. So this is actually estimating \(\mathbb E(\cos Y)\), where \(Y\) is uniform on the 8 points actually visited by the LCG. So while the correct answer is \(\mathbb EU = \sin 1 = 0.841\), this is in fact estimating \[ \begin{multline} \frac{1}{8} \bigg(\cos \frac{1}{2^{10}} + \cos \frac{131}{2^{10}} + \cos \frac{257}{2^{10}} + \cos \frac{899}{2^{10}} \\ + \cos \frac{513}{2^{10}} + \cos \frac{643}{2^{10}} + \cos \frac{769}{2^{10}} + \cos \frac{387}{2^{10}}\bigg) = 0.868 \end{multline} \] (where the numerators are the 8 values visited by the LCG), which is not the correct answer.
\[\newcommand{\Exg}{\operatorname{\mathbb{E}}} \newcommand{\Ex}{\mathbb{E}} \newcommand{\Ind}{\mathbb{I}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Corr}{\operatorname{Corr}} \newcommand{\ee}{\mathrm{e}}\]
1. Consider a discrete random variable that takes values \(1, 2, 3, 4.5\) with probabilities \(0.2, 0.2, 0.5, 0.1\) respectively. Write some R code that will sample from this distribution. (Your code may use the runif()
function, but may not use the sample()
function.) Check that a large sample from your code really does have the correct distribution.
Solution. There are various ways to do this. With only four outcomes, you can just write a lot of this “by hand”, in a way that wouldn’t be practical if the range were very large, but it pretty easy to write – that’s what I did.
<- function(n) {
rq1 <- c(1, 2, 3, 4.5)
x <- c(0.2, 0.2, 0.5, 0.1)
prob <- cumsum(prob)
cumprob <- runif(n)
unif ifelse(unif <= cumprob[1], x[1],
ifelse(unif <= cumprob[2], x[2],
ifelse(unif <= cumprob[3], x[3], x[4])))
}
Let’s test its accuracy by drawing a bar plot of a large sample.
<- 1e6
n <- rq1(n)
samples plot(table(samples) / n)
Looks good to me.
2. The geometric distribution \(X \sim \operatorname{Geom}(p)\) represents the number of trials until the first success, where each trial succeeds independently with probability \(p\). The probability mass function of \(X\) is \[ p(x) = (1-p)^{x-1}p \qquad x = 1, 2, \dots. \]
(a) Show that the cumulative distribution function \(F\) of \(X\) is given by \[ F(x) = 1 - (1-p)^x \qquad x = 1, 2, \dots .\]
Solution. I can think of two ways to do this. The first way is to just sum the probabilities, using the formula \[ \sum_{z=0}^{x-1} a^z = \frac{1 - a^x}{1-a} \] for the sum of a geometric progression. We have \[ F(x) = \sum_{y=1}^x (1-p)^{y-1} p = p \sum_{z=0}^{x-1} (1-p)^z = p \frac{1 - (1-p)^{x}}{1 - (1-p)} = 1 - (1-p)^x , \] where in the second equality where shifted the index of the sum with \(z = y-1\).
Alternatively (and, in my opinion, better) is to think about what the geometric distribution means. The geometric probability \(p(x)\) is the probability the first success occurs on the \(x\)th trial. So the complement of the CDF, \(1 - F(x) = \mathbb P(X > x)\) is the probability the first success happens after the \(x\)th trial, which is if and only if the first \(x\) trials are all failures. This is \[ 1 - F(x) = \mathbb P(X > x) = (1-p)^x \] because each of those \(x\) trials fails with probability \(1 - p\). This gives the answer.
(b) Write down a function – either in mathematical notation or in R code – that will transform a standard uniform random variable \(U\) into a geometric distribution. Try to make your function as simple as possible.
Solution. The “dividing lines” between the segments come at the values \(u\) where \(u = 1- (1-p)^x\). Actually, since we’re trying to make the function simple, we could use \(v = 1-u\) (since one minus a standard uniform is still standard uniform), with dividing lines at \(v = (1-p)^x\), which corresponds to \(x = \log v / \log(1-p)\). So the segment corresponding to a uniformly distributed \(v\) will correspond to this value rounded up to the next integer.
So we can take \(U \sim \operatorname{U}[0,1]\), and put \[ X = \left\lceil \frac{\log U}{\log(1-p)} \right\rceil . \]
Let’s check this with R code
<- function(n, p) {
rgeom2 <- runif(n)
unif ceiling(log(unif) / log(1 - p))
}
<- 1e6
n <- 1/3
p <- rgeom2(n, p)
samples plot(table(samples) / n, xlim = c(0, 10))
We could alternatively check the probabilities exactly.
<- (1 - p)^{1:8 - 1} * p
true <- table(samples)[1:8] / n
emp round(rbind(true, emp), 4)
1 2 3 4 5 6 7 8
true 0.3333 0.2222 0.1481 0.0988 0.0658 0.0439 0.0293 0.0195
emp 0.3334 0.2225 0.1476 0.0985 0.0659 0.0442 0.0292 0.0195
Looks good again.
3. Consider a Cauchy random variable \(X\) with probability density function \[ f(x) = \frac{1}{\pi(1 + x^2)} .\]
(a) Show that the cumulative distribution function of \(X\) is \[ F(x) = \frac12 + \frac{1}{\pi}\arctan x \]
Solution. The CDF is \[ \begin{multline} F(x) = \int_{-\infty}^x f(y)\,\mathrm{d}y = \frac{1}{\pi} \int_{-\infty}^x \frac{1}{1+y^2}\,\mathrm{d}y = \frac{1}{\pi} \big[\arctan y\big]_{-\infty}^x \\ = \frac{1}{\pi} \bigg( \arctan x - \Big(-\frac{\pi}{2}\Big)\bigg) = \frac{1}{\pi}\arctan x + \frac12, \end{multline}\] since \(\lim_{y\to-\infty} \arctan y = -\frac{\pi}{2}\).
(b) Write down a function that will transform a standard uniform random variable \(U\) into a Cauchy distribution.
Solution. We use the inverse transform method: write \(U = F(X)\) and invert. Here, we have \[ U = \frac12 + \frac{1}{\pi}\arctan X .\] Inverting gives \[X = \tan \bigg(\pi \Big(U - \frac12\Big) \bigg) .\]
This can be interpreted as \(X = \tan\Theta\) where \(\Theta\) is uniform between \(-\frac{\pi}{2}\) and \(\frac{\pi}{2}\).
(c) Using your answer to part (b), draw a histogram of samples from the Cauchy distribution in R.
Solution. My function for generating samples is the following:
<- function(n) {
rcauchy2 <- runif(n)
unif tan(pi * (unif - 1/2))
}
The Cauchy distribution is a very heavy-tailed distribution. So to make the histogram look sensible, I’m going to throw away the occasional very large or very small sample. If I didn’t do this, the histogram would probably just look like a single spike at 0. (You of course shouldn’t do this when performing Monte Carlo estimation – those rare very large or very small samples can be extremely important in calculations!)
<- 1e6
n <- rcauchy2(n)
samples
<- samples[abs(samples) < 12]
samples_rest hist(samples_rest, probability = TRUE, ylim = c(0, 0.33), breaks = 48)
curve(1 / (pi * (1 + x^2)), add = TRUE, n = 1001, lwd = 2, col = "blue")
4. Let \(F\) be a cumulative distribution function and \(F^{-1}\) its inverse.
(a) Prove that \(F^{-1}\) is a non-decreasing function.
Solution. We recall the definition \(F^{-1}(u) = \min \{x : F(x) \geq u\}\). As \(u\) increases, the set \(A_u = \{x : F(x) \geq u \}\) gets smaller – more specifically, for \(u \leq v\), we have \(A_u \subseteq A_v\). Hence the minimum of the set \(A_u\) cannot be larger than the minimum of the set \(A_v\). Hence \(F^{-1}(u) = \min A_u \leq \min A_v = F^{-1}(v)\), as required.
(b) Show that \(X = F^{-1}(U)\) and \(X' = F^{-1}(1-U)\) have negative (or, rather, non-positive) correlation. You may use any results from the module, provided you state them clearly.
Solutions. The relevant result here is Theorem 7.2. This said that if \(\phi\) is an non-decreasing function, then \(\phi(U)\) and \(\phi(1-U)\) have covariance – and therefore correlation – less than or equal to 0. Here, we use \(F^{-1}\) as the function \(\phi\).
5. Let \(X \sim \operatorname{Beta}(3, 2)\) be a Beta distribution with PDF \[ f(x) = 12 x^2(1-x) \qquad 0 \leq x \leq 1 . \] [Note: An earlier version of this question wrongly had the constant at the front as \(\frac{1}{12}\) instead of \(12\).]
Show how you could sample from \(X\) using envelope rejection sampling and an optimised value of the constant \(c\).
Solution. The obvious suggestion here is to take \(Y\) uniformly distributed on \([0, 1]\), so \(g(x) = 1\) (although you don’t have to choose that).
To find the optimal value of \(c\), we consider the maximum of \(f(x)\). We can find that by differentiating \[ f'(x) = \frac{\mathrm{d}}{\mathrm{d}x} (12x^2 - 12x^3) = (24x - 36x^2) , \] so the maximum is at \(x = \frac23\), where \(f(x) = \frac{16}{9}\). Therefore, we take \(c = \frac{16}{9}\).
Thus, our algorithm is to sample from a standard uniform, and then to accept with probability \[ \alpha(x) = \frac{12x^2(1-x)}{\frac{16}{9}} = \frac{27}{4}\,x^2(1-x) . \]
6. Consider sampling from the half-normal distribution \[ f(x) = \sqrt{\frac{2}{\pi}} \exp\big(-\tfrac12 x^2\big) \qquad x \geq 0 \] using envelope rejection sampling with an \(\operatorname{Exp}(\lambda)\) proposal distribution \[g(x) = \lambda \mathrm{e}^{-\lambda x} \qquad x \geq 0 .\] You wish to design your envelope rejection sampling algorithm so that the acceptance probability is as high as possible.
(a) For fixed \(\lambda\), show that the optimal value of \(c\) is \[c = \sqrt{\frac{2}{\pi}}\,\frac{\exp(\frac12\lambda^2)}{\lambda}.\] ::: {.myanswers} Solution. This is very similar to Example 15.2 in Lecture 15. We have \[ \frac{f(x)}{g(x)} = \frac{\sqrt\frac{2}{\pi} \exp(\tfrac12 x^2)}{\lambda \exp(-\lambda x)} = \sqrt\frac{2}{\pi} \,\frac{\exp \big(-\tfrac12 x^2 + \lambda x\big)}{\lambda} . \] We want to pick \(c\) to be the maximum value of this. The maximum occurs where \(-\frac12x^2 + \lambda x\) is maximised. By differentiating this and setting equal to 0, we get \(-x + \lambda = 0\), so \(x = \lambda\) and \[ c = \sqrt\frac{2}{\pi} \,\frac{\exp \big(-\tfrac12 \lambda^2 + \lambda^2\big)}{\lambda} = \sqrt\frac{2}{\pi} \,\frac{\exp \big(\tfrac12 \lambda^2\big)}{\lambda}, \] as required.
(b) Show that the optimal value of \(\lambda\) is \(\lambda = 1\).
Solution. Again, our goal is to choose \(\lambda\) get \(c\) as small as possible, since \(1/c\) is the acceptance probability. Differentiating the expression from part (a) with respect to \(\lambda\) gives \[ \frac{\mathrm d}{\mathrm d\lambda} \,c = \sqrt\frac{2}{\pi} \frac{\exp \big(\tfrac12 \lambda^2\big)(\lambda^2 - 1)}{\lambda^2}.\] This is 0 when \(\lambda^2 = 1\), and since \(\lambda \geq 0\), the only solution is \(\lambda = 1\). (This can be easily checked to be a minimum by looking at a sketch of \(c\) against \(\lambda\), or differentiating twice, or just by thinking about the behaviour of \(c\) as a function of \(\lambda\).)
:::::
7. [2016 exam, Question 1] In this question, we consider generating samples from the distribution with probability density function \[ f_a(x) = \frac{1}{Z} \,\frac{1}{a(\cos x + 1) + x^2} \qquad x \geq 1,\] where \(a\) is a parameter and \[Z = \int_{1}^{\infty} \frac{1}{a(\cos x + 1) + x^2}\,\mathrm{d}x \] is a normalising constant.
(a) Introducing any notation you use, state the inverse transform method for random number generation.
Solution. Fix a cumulative distribution function (CDF) \(F(x) = \mathbb P(X \leq x)\). The inverse CDF is defined to be \(F^{-1}(u) = \max \{x : F(x) \geq u\}\). We then have that \(X = F^{-1}(U)\) has CDF \(F\). We use this typically by writing \(U = F(X)\) and inverting to make \(X\) the subject.
(b) For \(a = 0\), explain how the inverse transform method can be used to generate samples with PDF \(f_0\).
For \(a = 0\), the PDF is \[ f_0(x) = \frac{1}{Z} \,\frac{1}{x^2} \qquad x \geq 1 .\]
We should start by finding the normalising constant \(Z\): it’s \[ Z = \int_1^\infty \frac{1}{x^2}\,\mathrm{d}x = \big[-x^{-1}\big]_1^\infty = 0 -(-1) = 1 , \] so the PDF is simply \(f_0(x) = 1/x^2\).
Next, we want to find the CDF \(F\): it’s \[ F(x) = \int_1^x f_0(y)\,\mathrm{d}y = \int_1^x \frac{1}{y^2}\,\mathrm{d}y = \big[-y^{-1}\big]_1^x = 1 - \frac{1}{x} . \]
Finally, to perform the inverse transform method, we write \(U = F(X)\) and invert. We have \(U = 1 - 1/x\), and so \(X = 1/(1-U)\).
(c) Introducing any notation you use, state the envelope rejection sampling method for random number generation.
(d) For \(a > 0\), explain how the envelope rejection sampling method can be used to generate samples with PDF \(f_a\).
(e) How does the efficiency of your method in part (d) change as the value of \(a\) increases? Justify your answer.