<- c(rep(0, 11), rep(1, 5), rep(2, 7), rep(3, 3), 4, 5, 7, 11)
queues
<- 1000
boots <- rep(0, boots)
maxes for (k in 1:boots) {
<- sample(queues, 5, replace = TRUE)
minisample <- max(minisample)
maxes[k] }
26 Bootstrap II
26.1 Bootstrap with a prediction interval
Recall where we had got to last time. We are interested in a statistic
So the bootstrap procedure is to repeatedly choose
In particular, to estimate
What if we want to estimate a prediction interval – that is, an interval
There is a lazy way to do this, which is to hope that the
Instead, we can take the sample quantiles of the
Example 26.1 Let’s go back to The Edit Room cafe queues from last time. The statistic in question was the maximum queue length from 5 visits. The data and our bootstrap samples were these:
We saw that our estimates for the expectation and the variance of the statistic were the following:
<- mean(maxes)
max_mean <- var(maxes)
max_var c(max_mean, max_var)
[1] 4.91700 10.50462
A lazy 80% prediction interval under a normal assumption would be the following:
+ qnorm(c(0.1, 0.9)) * sqrt(max_var) max_mean
[1] 0.7633857 9.0706143
But that seems a bit silly – our queue length isn’t going to be a real number with seven decimal places. Better is to use the actual quantiles of the statistics we evaluated.
quantile(maxes, c(0.1, 0.9))
10% 90%
2 11
I usually get the interval
26.2 Bootstrap for statistical inference
So we’ve seen two ideas here about what to do when we only have samples from a distribution (for example, some data measurements).
The first idea was that to estimate something about the distribution, use the plug-in estimator.
The second ideas was the to find out about properties of a statistic, use bootstrap sampling.
But an estimator
To estimate a parameter
, use the plug-in estimator.To find out things about that parameter – for example, its bias, variance, mean-square error, or a confidence interval – use bootstrap sampling.
26.3 Bootstrap estimation of bias
Example 26.2 Let’s talk through an example. Let’s suppose we want to estimate the average number of hours sleep people have per night. That is, we want to know
We have data on
<- read.csv("https://bookdown.org/jgscott/DSGI/data/NHANES_sleep.csv")$SleepHrsNight
sleep <- length(sleep) m
Our estimate for the average sleep time
<- mean(sleep)
estimate estimate
[1] 6.878955
This is a bit under 7 hours.
But we should check whether our estimator is likely to be biased or not. How can we estimate the bias (in order to check it’s close to 0)? Well, the bias is, of course,
Here our estimator/statistic is using all
set.seed(3)
<- 1e5
boots <- rep(0, boots)
bootests for (k in 1:boots) {
<- sample(sleep, m, replace = TRUE)
resample <- mean(resample)
bootests[k]
}<- mean(bootests)
est_exp est_exp
[1] 6.878924
This typically comes out as very close to the original estimate. With the seed set as 3 for reproducibility, we get 6.8795. So our estimate of the bias is
Note here that we are the bias
On this occasion our estimator turned out to be unbiased – or at least extremely close to unbiased.
But had we found some bias, it is often preferable to improve our estimator by removing that bias. Our refined, or “debiased”, estimator would be
26.4 Bootstrap estimation of variance and intervals
What about the error in our estimator? We measure that through the variance, because the variance equals the mean-square error for an unbiased estimator. (For a biased estimator, the MSE is the bias-square plus the variance – remember s@thm-MSE-bias a long time ago). Again, we can use the bootstrap.
Example 26.3 We continue the sleep example.
var(bootests)
[1] 0.0008780489
This is 0.09, which is a pretty good estimate for the variance.
However, this sample variance is calculated as
1 / (boots - 1)) * sum((bootests - estimate)^2) (
[1] 0.0008780498
In this case, with plenty of samples and well-behaved data, this is identical to 6 decimal places. But on other occasions, with fewer samples and weirder data, it could be an improvement.