queues <- c(rep(0, 11), rep(1, 5), rep(2, 7), rep(3, 3), 4, 5, 7, 11)
boots <- 1000
maxes <- rep(0, boots)
for (k in 1:boots) {
minisample <- sample(queues, 5, replace = TRUE)
maxes[k] <- max(minisample)
}\[ \newcommand{\Exg}{\operatorname{\mathbb{E}}} \newcommand{\Ex}{\mathbb{E}} \newcommand{\Ind}{\mathbb{I}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Corr}{\operatorname{Corr}} \newcommand{\ee}{\mathrm{e}} \]
26 Bootstrap II
26.1 Bootstrap with a prediction interval
Recall where we had got to last time. We are interested in a statistic \(T = T(X_1, \dots, X_n)\), where \(X_1, \dots, X_n\) are IID copies of a random variable \(X\). We want to find out about the distribution of \(T\). But all we have to work with is \(X_1, \dots, X_m\), where are\(m\) IID samples from \(X\).
So the bootstrap procedure is to repeatedly choose \(n\) samples \(X^*_1, \dots, X^*_n\) from the empirical distribution \(X^*\) – or, equivalently, choose \(n\) of the values \(X_1, \dots, X_m\) with replacement, and calculate the statistic \(T^*_k = T(X^*_1, \dots, X^*_n)\). The distribution of the \(T^*_k\) should be similar to the distribution of \(T\).
In particular, to estimate \(\mathbb ET\) we can use the sample mean of the \(T^*_1, \dots, T^*_B\), and to estimate \(\operatorname{Var}(T)\), we can use the sample variance \(S^2\) of the \(T^*_k\).
What if we want to estimate a prediction interval – that is, an interval \([U, V]\) such that \(\mathbb P(U \leq T \leq V) \approx 1- \alpha\)?
There is a lazy way to do this, which is to hope that the \(T^*_k\) are approximately normally distributed. With \(\overline{T^*}\) as the sample mean and \(S^2\) as the sample variance of our observed statistics, we could take \[ \left[\overline{T^*} - z_{\alpha/2}\,S,\, \overline{T^*} + z_{\alpha/2}\,S\right] .\] But we can do better than this by taking the actual distribution of the \(T^*_k\), which might not be normal, into account.
Instead, we can take the sample quantiles of the \(T^*_k\). That is, put \(T^*_1, \dots, T^*_B\) in increasing order. Go \(\alpha/2\) of the way up the list to get the lower-\(\alpha/2\) sample quantile, and \(1 - \alpha/2\) of the way up the list to get the upper-\(\alpha/2\) sample quantile. These two values can be our prediction interval.
Example 26.1 Let’s go back to The Edit Room cafe queues from last time. The statistic in question was the maximum queue length from 5 visits. The data and our bootstrap samples were these:
We saw that our estimates for the expectation and the variance of the statistic were the following:
max_mean <- mean(maxes)
max_var <- var(maxes)
c(max_mean, max_var)[1] 4.91700 10.50462
A lazy 80% prediction interval under a normal assumption would be the following:
max_mean + qnorm(c(0.1, 0.9)) * sqrt(max_var)[1] 0.7633857 9.0706143
But that seems a bit silly – our queue length isn’t going to be a real number with seven decimal places. Better is to use the actual quantiles of the statistics we evaluated.
quantile(maxes, c(0.1, 0.9))10% 90%
2 11
I usually get the interval \([2, 11]\) from this data. this much better reflects the actual data. It also takes into account the “skew” of the data (lots of values of 0, where there was no queue, but no negative values, of course), to give a tighter lower boundary than the crude and inaccurate normal approximation.
26.2 Bootstrap for statistical inference
So we’ve seen two ideas here about what to do when we only have samples from a distribution (for example, some data measurements).
The first idea was that to estimate something about the distribution, use the plug-in estimator.
The second ideas was the to find out about properties of a statistic, use bootstrap sampling.
But an estimator \(\widehat\theta\) is just a special type of statistic – one that is hoped to be close to some true parameter \(\theta\). So we can combine these two ideas together. First, we estimate the true parameter \(\theta\), use the plug-in estimator \(\theta^*\). Then we can find out things about that estimator \(\theta^*\) – for example, its bias, variance, mean-square error, or a confidence interval – using boot resampling.
One way to think about this is the “two world” of bootstrap estimation:
The real-world: We consider the data \(X_1, X_2, \dots, X_m\) as a random sample from the population; and due to that randomness, there is variability of the plug-in estimator \(\theta^*\) around the true value \(\theta\).
The bootstrap world: We treat the samples \(X_1, X_2, \dots, X_m\) as fixed; bu there is variability of the bootstrap estimates \(T^*_k\) around the plug-in estimate \(\theta^*\), due to the sampling-with-replacement within the bootstrap procedure.
The key idea of bootstrapping is this: that the variability of \(\theta^*\) around \(\theta\) in the real world can be approximated by the variability of the \(T^*_k\)s around \(\theta^*\) in the bootstrap world.
26.3 Bootstrap estimation of bias
Suppose we have parameter \(\theta\) we wish to estimate using a dataset \(X_1, X_2, \dots, X_m\). Our best “point estimate” for \(\theta\) is the plug-in estimator \(\theta^*\).
How might we estimate the bias of \(\theta^*\) as an estimate of \(\theta\)?
Well, the bias in “the real world” is \(\mathbb E \theta^* - \theta\), the expected value of the estimator \(\theta^*\) minus the true value \(\theta\). In the bootstrap world, the role of \(\theta\) is replaced \(\theta^*\) and the role of \(\theta^*\) is played by the \(T_k^*\)s; so the bias \(\mathbb E \theta^* - \theta\) is estimated by \(\overline{T^*} - \theta^*\)
In other words, we repeatedly sample \(m\) points \(X_1^*, \dots, X_m^*\) with replacement from the dataset, and calculate the bootstrap statistic \(T^*_k = \theta^*(X^*_1, \dots, X^*_m)\) Then the estimate of the bias is \[\overline{T^*} - \theta^* = \frac{1}{B} \sum_{k=1}^B T^*_k - \theta^* . \]
Example 26.2 Let’s talk through an example. Let’s suppose we want to estimate the average number of hours sleep people have per night. That is, we want to know \(\theta = \mathbb EX\), where \(X\) is the random distribution of sleep times for the entire population.
We have data on \(m = 1991\) people thanks to a survey by the US Centers for Disease Control and Prevention. (Credit: Scott, Data Science in R.) We can read this into R as follows.
sleep <- read.csv("https://bookdown.org/jgscott/DSGI/data/NHANES_sleep.csv")$SleepHrsNight
m <- length(sleep)Our estimate for the median sleep time \(\theta = \mathbb E X\) will be the plug-in estimator \(\theta^* = \mathbb E_* X^*\), which, as we have discussed before, is the sample mean \(\theta^* = \overline{X}\).
estimate <- mean(sleep)
estimate[1] 6.878955
This is 6.8790 hours.
But we should check whether our estimator is likely to be biased or not. (Although, actually, in this case we do know the mean is an unbiased estimator of the expectation, so the answer is 0.)
We form our bootstrap statistics as follows:
set.seed(3)
boots <- 1e4
bootests <- rep(0, boots)
for (k in 1:boots) {
resample <- sample(sleep, m, replace = TRUE)
bootests[k] <- mean(resample)
}
est_exp <- mean(bootests)
est_exp[1] 6.879461
This typically comes out as very close to the original estimate. With the seed set as 3 for reproducibility, we get 6.8795. So our estimate of the bias is \[ \widehat{\operatorname{bias}}(\theta^*) = \widehat{\mathbb E(\theta^*)} - \theta^* = 6.8795 - 6.8790 = 0.0005 .\] This is very small (as we expected).
Had we found some bias, it is often preferable to improve our estimator by subtracting that bias. The refined, or “debiased”, estimator would be \[ \theta^* - \widehat{\operatorname{bias}}(\theta) = \theta^* - \Big(\widehat{\mathbb E(\theta^*)} - \theta^*\Big) = 2\theta^* - \widehat{\mathbb E(\theta^*)} . \]
26.4 Bootstrap estimation of MSE
What about the mean-square error of the plug-i estimator? Can we estimate the MSE? Well, the mean-square error, in the real world is \[ \operatorname{MSE} = \operatorname{\mathbb E} \big(\theta^* - \theta\big)^2 . \] In the bootstrap world, we estimate this by \[ \overline{(T^* - \theta^*)^2} = \frac{1}{B} \sum_{k=1}^B \big(T^*_k - \theta^*\big)^2 , \] where \(\theta\) is replaced by \(\theta^*\) and \(\theta^*\) is replaced by the \(T^*_k\)s.
Example 26.3 We continue the sleep example.
MSEest <- (1 / boots) * sum((bootests - estimate)^2)
MSEest[1] 0.0008791049
In the case of this simple parameter \(\theta = \mathbb{E}X\), we could also have estimated the MSE in the tradition Monte Carlo-style way
var(sleep) / m[1] 0.0008717188
Reassuringly, these come out as extremely close to each other.
Next time, we will look at the more complicated question of estimating a confidence interval for the plug-in estimator.