26  Bootstrap II

26.1 Bootstrap with a prediction interval

Recall where we had got to last time. We are interested in a statistic T=T(X1,,Xn), where X1,,Xn are IID copies of a random variable X. We want to find out about the distribution of T. But all we have to work with is X1,,Xm, where arem IID samples from X.

So the bootstrap procedure is to repeatedly choose n samples X1,,Xn from the empirical distribution X – or, equivalently, choose n of the values X1,,Xm with replacement, and calculate the statistic Tk=T(X1,,Xn). The distribution of the Tk should be similar to the distribution of T.

In particular, to estimate ET we can use the sample mean of the T1,,TB, and to estimate Var(T), we can use the sample variance S2 of the Tk.

What if we want to estimate a prediction interval – that is, an interval [U,V] such that P(UTV)1α?

There is a lazy way to do this, which is to hope that the Tk are approximately normally distributed. With T as the sample mean and S2 as the sample variance of our observed statistics, we could take [Tzα/2S,T+zα/2S]. But we can do better than this by taking the actual distribution of the Tk, which might not be normal, into account.

Instead, we can take the sample quantiles of the Tk. That is, put T1,,TB in increasing order. Go α/2 of the way up the list to get the lower-α/2 sample quantile, and 1α/2 of the way up the list to get the upper-α/2 sample quantile. These two values can be our prediction interval.

Example 26.1 Let’s go back to The Edit Room cafe queues from last time. The statistic in question was the maximum queue length from 5 visits. The data and our bootstrap samples were these:

queues <- c(rep(0, 11), rep(1, 5), rep(2, 7), rep(3, 3), 4, 5, 7, 11)

boots <- 1000
maxes <- rep(0, boots)
for (k in 1:boots) {
  minisample <- sample(queues, 5, replace = TRUE)
  maxes[k] <- max(minisample)
}

We saw that our estimates for the expectation and the variance of the statistic were the following:

max_mean <- mean(maxes)
max_var <- var(maxes)
c(max_mean, max_var)
[1]  4.91700 10.50462

A lazy 80% prediction interval under a normal assumption would be the following:

max_mean + qnorm(c(0.1, 0.9)) * sqrt(max_var)
[1] 0.7633857 9.0706143

But that seems a bit silly – our queue length isn’t going to be a real number with seven decimal places. Better is to use the actual quantiles of the statistics we evaluated.

quantile(maxes, c(0.1, 0.9))
10% 90% 
  2  11 

I usually get the interval [2,11] from this data. this much better reflects the actual data. It also takes into account the “skew” of the data (lots of values of 0, where there was no queue, but no negative values, of course), to give a tighter lower boundary than the crude and inaccurate normal approximation.

26.2 Bootstrap for statistical inference

So we’ve seen two ideas here about what to do when we only have samples from a distribution (for example, some data measurements).

  1. The first idea was that to estimate something about the distribution, use the plug-in estimator.

  2. The second ideas was the to find out about properties of a statistic, use bootstrap sampling.

But an estimator θ^ is just a special type of statistic – one that is hoped to be close to some true parameter θ. So we can combine these two ideas together. Suppose we have m samples X1,X2,,Xm.

  1. To estimate a parameter θ, use the plug-in estimator.

  2. To find out things about that parameter – for example, its bias, variance, mean-square error, or a confidence interval – use bootstrap sampling.

26.3 Bootstrap estimation of bias

Example 26.2 Let’s talk through an example. Let’s suppose we want to estimate the average number of hours sleep people have per night. That is, we want to know θ=EX, where X is the random distribution of sleep times for the entire population.

We have data on m=1991 people thanks to a survey by the US Centers for Disease Control and Prevention. (Credit: Scott, Data Science in R.) We can read this into R as follows.

sleep <- read.csv("https://bookdown.org/jgscott/DSGI/data/NHANES_sleep.csv")$SleepHrsNight
m <- length(sleep)

Our estimate for the average sleep time θ=EX will be the plug-in estimator θEX, which, as we have discussed before, is the sample mean θ=X.

estimate <- mean(sleep)
estimate
[1] 6.878955

This is a bit under 7 hours.

But we should check whether our estimator is likely to be biased or not. How can we estimate the bias (in order to check it’s close to 0)? Well, the bias is, of course, bias(θ)=E(θ)θ. How will we estimate each of those terms. Well, we’ve already estimated the second term θ by θ=6.8790. And the first term, we have just discussed how to estimate that using bootstrapping.

Here our estimator/statistic is using all m pieces of data so, in our previous notation, n and m are the same: we will take n=1991 samples from our m=1991 pieces of data. Note that these are sampling with replacement, so we’re not just getting the same thing every time.

set.seed(3)
boots <- 1e5
bootests <- rep(0, boots)
for (k in 1:boots) {
  resample <- sample(sleep, m, replace = TRUE)
  bootests[k] <- mean(resample)
}
est_exp <- mean(bootests)
est_exp
[1] 6.878924

This typically comes out as very close to the original estimate. With the seed set as 3 for reproducibility, we get 6.8795. So our estimate of the bias is bias^(θ)=E(θ)^θ=6.87956.8790=0.0005.

Note here that we are the bias bias(θ)=E(θ)θ is estimated by bias^(θ)=E(θ)^θ In the first, true, expression, θ is playing the role of the estimator in the first term on the right-hand side; but in the second, estimated, expression, θ is playing the role of (the estimate of) the true value in the second term on the right-hand side.

On this occasion our estimator turned out to be unbiased – or at least extremely close to unbiased.

But had we found some bias, it is often preferable to improve our estimator by removing that bias. Our refined, or “debiased”, estimator would be θbias^(θ)=θ(E(θ)^θ)=2θE(θ)^.

26.4 Bootstrap estimation of variance and intervals

What about the error in our estimator? We measure that through the variance, because the variance equals the mean-square error for an unbiased estimator. (For a biased estimator, the MSE is the bias-square plus the variance – remember s@thm-MSE-bias a long time ago). Again, we can use the bootstrap.

Example 26.3 We continue the sleep example.

var(bootests)
[1] 0.0008780489

This is 0.09, which is a pretty good estimate for the variance.

However, this sample variance is calculated as S2=1B1k=1B(TkT)2. But it’s a bit strange to be looking at squared-difference from T, a Monte Carlo-like estimate of θ, when we have the plug-in estimate θ available, which we would think is better. So preferable to use is 1B1k=1B(Tkθ)2. Here, this is the following

(1 / (boots - 1)) * sum((bootests - estimate)^2)
[1] 0.0008780498

In this case, with plenty of samples and well-behaved data, this is identical to 6 decimal places. But on other occasions, with fewer samples and weirder data, it could be an improvement.