<- 100
n <- rexp(n, 2)
samples <- (1 / n) * sum(samples)
MCest MCest
[1] 0.5378136
\[\newcommand{\Exg}{\operatorname{\mathbb{E}}} \newcommand{\Ex}{\mathbb{E}} \newcommand{\Ind}{\mathbb{I}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Corr}{\operatorname{Corr}} \newcommand{\ee}{\mathrm{e}}\]
Today, we’ll start the first main topic of the module, which is called “Monte Carlo estimation”. But first, a bit about the subject as a whole.
“Statistical computing” – or “computational statistics” – refers to the branch of statistics that involves not attacking statistical problems merely with a pencil and paper, but rather by combining human ingenuity with the immense calculating powers of computers.
One of the big ideas here is simulation. Simulation is the idea that we can understand the properties of a random model not by cleverly working out the properties using theory – this is usually impossible for anything but the simplest “toy models” – but rather by running the model many times on a computer. From these many simulations, we can observe and measure things like the typical (or “expected”) behaviour, the spread (or “variance”) of the behaviour, and other things. This concept of simulation is at the heart of the module MATH5835M Statistical Computing.
In particular, we will look at Monte Carlo estimation. Monte Carlo is about estimating a parameter, expectation or probability related to a random variable by taking many samples of that random variable, then computing a relevant sample mean from those samples. We will study Monte Carlo in its standard “basic” form (Lectures 1–9), but also in the modern Markov chain Monte Carlo form (Lectures 17–23), which has become such a crucial part of Bayesian statistical analysis.
To run a simulation, one needs random numbers with the correct distributions. Random number generation (Lectures 10–16) will be an important part of this module. We will look first at how to generate randomness of any sort, and then how to get that randomness into the shape of the distributions we want.
When dealing with a very big data set, traditionally we want to “reduce the dimension” by representing it with a simple parametric model. For example, tens of thousands of datapoints might get reduced just to estimates of the parameters \(\mu\) and \(\sigma^2\) of a normal distribution. But with computational statistics, we don’t need to make such a simplification – we can do inference using the full details of the whole dataset itself, without making extra assumptions. An computational scheme that takes advantage of this idea is the bootstrap (Lectures 24–27).
MATH5835M Statistical Computing is a mathematics module that will concentrate on the mathematical ideas that underpin statistical computing. It is not a programming module that will go deeply into the practical issues of the most efficient possible coding of the algorithms we study. But we will want to investigate the behaviour of the methods we learn about and to explore their properties, so will be computer programming to help us do that. (We will be using the statistical programming language R, although one could just as easily have used Python or other similar languages.) As my PhD supervisor once told me: “You don’t really understand a mathematical algorithm until you’ve coded it up yourself.”
Let \(X\) be a random variable. We recall the expectation \(\Ex X\) of \(X\): if \(X\) is discrete with probability mass function (PMF) \(p\), then this is \[ \Ex X = \sum_x x\,p(x) ;\] while if \(X\) is continuous with probability density function (PDF) \(f\), then this is \[ \Ex X = \int_{-\infty}^{+\infty} x\,f(x)\,\mathrm{d}x . \] More generally, the expectation of a function \(\phi\) of \(X\) is \[ \Exg \phi(X) = \begin{cases} {\displaystyle \sum_x \phi(x)\,p(x)} & \text{for $X$ discrete}\\ {\displaystyle \int_{-\infty}^{+\infty} \phi(x)\,f(x)\,\mathrm{d}x} & \text{for $X$ continuous.} \end{cases}\] (This matches with the “plain” expectation when \(\phi(x) = x\).)
But how do we actually calculate an expectation like one of these? If \(X\) is discrete and can only take a small, finite number of values, we can simply add up the sum \(\sum_x \phi(x)\,p(x)\). Otherwise, we just have to hope that \(\phi\) and \(p\) or \(f\) are sufficiently “nice” that we can manage to work out the sum/integral using a pencil and paper. But while this is often the case in the sort of “toy example” one comes across in maths or statistics lectures, this is very rare in “real life”.
Monte Carlo estimation is the idea that we can get an approximate answer for \(\Ex X\) or \(\Exg \phi(X)\) if we have access to lots of samples from \(X\). For example, if we have access to \(X_1, X_2 \dots, X_n\) , independent and identically distributed (IID) samples with the same distribution as \(X\), then we already know that the mean \[ \overline X = \frac{1}{n}(X_1 + X_2 + \cdots + X_n) = \frac{1}{n} \sum_{i=1}^n X_i \] is usually close to the expectation \(\Ex X\), at least if \(n\) is big. Similarly, it should be the case that \[ \frac{1}{n} \big(\phi(X_1) + \phi(X_2) + \cdots + \phi(X_n) \big) = \frac{1}{n} \sum_{i=1}^n \phi(X_i) \] should be close to \(\Exg \phi(X)\).
In this module we will write that \(X_1, X_2, \dots, X_n\) is a “random sample from \(X\)” to mean that \(X_1, X_2, \dots, X_n\) are IID with the same distribution as \(X\).
Definition 1.1 Let \(X\) be a random variable, \(\phi\) a function, and write \(\theta = \Exg\phi(X)\). Then the Monte Carlo estimator \(\widehat\theta_n^{\mathrm{MC}}\) of \(\theta\) is \[ \widehat{\theta}_n^{\mathrm{MC}} = \frac{1}{n} \sum_{i=1}^n \phi(X_i) , \] where \(X_1, X_2, \dots, X_n\) are a random sample from \(X\).
While general ideas for estimating using simulation go back a long time, the modern theory of Monte Carlo estimation was developed by the physicists Stanislaw Ulam and John von Neumann. Ulam (who was Polish) and von Neumann (who was Hungarian) moved to the US in the early 1940s to work on the Manhattan project to build the atomic bomb (as made famous by the film Oppenheimer). Later in the 1940s, they worked together in the Los Alamos National Laboratory continuing their research on nuclear weapons, where they used simulations on early computers to help them numerically solve difficult mathematical and physical problems.
The name “Monte Carlo” was chosen because the use of randomness to solve such problems reminded them of gamblers in the casinos of Monte Carlo, Monaco. Ulam and von Neumann also worked closely with another colleague Nicholas Metropolis, whose work we will study later in this module.
Let’s see some simple examples of Monte Carlo estimation using R.
Example 1.1 Let’s suppose we’ve forgotten the expectation of the exponential distribution \(X \sim \operatorname{Exp}(2)\) with rate 2. In this simple case, we could work out the answer using the PDF \(f(x) = 2\mathrm{e}^{-2x}\) as
\[ \mathbb E X = \int_0^\infty x\,2\mathrm{e}^{-2x}\,\mathrm{d}x \](and, without too much difficulty, get the answer \(\frac12\)). But instead, let’s do this the Monte Carlo way.
In R, we can use the rexp()
function to get IID samples from the exponential distribution: the full syntax is rexp(n, rate)
, which gives n
samples from an exponential distribution with rate rate
. So our code here should be
<- 100
n <- rexp(n, 2)
samples <- (1 / n) * sum(samples)
MCest MCest
[1] 0.5378136
So our Monte Carlo estimate is 0.53781, to 5 decimal places.
To get a (hopefully) more accurate estimation, we can use more samples. We could also simplify the third line of this code by using the mean()
function.
<- 1e6
n <- rexp(n, 2)
samples <- mean(samples)
MCest MCest
[1] 0.4994785
(In the second line, 1e6
is R code for the scientific notation \(1 \times 10^6\), or a million. I just picked this as “a big number, but where my code still only took a few seconds to run.”)
Our new Monte Carlo estimate is 0.49948, which is much closer to the true value of \(\frac12\).
By the way: all R code “chunks” displayed in the notes should work perfectly if you copy-and-paste them into RStudio. (Indeed, these lecture notes should not appear unless the code runs correctly without errors.) I strongly encourage playing about with the code as a good way to learn this material and explore further!
Example 1.2 Let’s try another example. Let \(X \sim \operatorname{N}(1, 2^2)\) be a normal distribution with mean 0 and standard deviation 2. Suppose we want to find out \(\mathbb E(\sin X)\) (for some reason). While it might be possible to somehow calculate the integral \[ \mathbb E(\sin X) = \int_{-\infty}^{+\infty} (\sin x) \, \frac{1}{\sqrt{2\pi\times 2^2}} \exp\left(\frac{(x - 1)^2}{2\times 2^2}\right) \, \mathrm{d} x , \] that looks extremely difficult to me.
Instead, a Monte Carlo estimation of \(\Exg(\sin X)\) is very straightforward. (Although we must remember that when using the rnorm()
function to generate normally distributed variates, the third argument is the standard deviation, here \(2\), not the variance, here \(2^2 = 4\).)
<- 1e6
n <- rnorm(n, 1, 2)
samples <- mean(sin(samples))
MCest MCest
[1] 0.1129711
Our Monte Carlo estimate is 0.11297.
Next time: We look at more examples of things we can estimate using the Monte Carlo method.
Summary:
Statistical computing is about solving statistical problems by combining human ingenuity with computing power.
The Monte Carlo estimate of \(\Exg \phi(X)\) is \[ \widehat{\theta}_n^{\mathrm{MC}} = \frac{1}{n} \sum_{i=1}^n \phi(X_i) , \] where \(X_1, \dots, X_n\) are IID random samples from \(X\).
Monte Carlo estimation typically gets more accurate as the number of samples \(n\) gets bigger.
Read more: Voss, An Introduction to Statistical Computing, Section 3.1 and Subsection 3.2.1.