<- c(rep(0, 11), rep(1, 5), rep(2, 7), rep(3, 3), 4, 5, 7, 11)
queues
<- 1000
boots <- rep(0, boots)
maxes for (k in 1:boots) {
<- sample(queues, 5, replace = TRUE)
minisample <- max(minisample)
maxes[k] }
25 Plug-in estimation & Bootstrap I
Last time, we defined the empirical distribution of a dataset
25.1 The “plug-in” principle
We have one thing we didn’t get to last time, which is “plug-in estimation”.
Suppose now that
Well, we could form the empirical distribution
We have to be a bit careful here, because there two levels of randomness here.
First, there is the fact that the samples
are random IID samples from .Once we have the samples
, that fixes the empirical PMF . Then is itself a random variable, with PMF .
We will write
One way to estimate something about the random variable
This is easier to see if we take some examples.
Suppose we wanted to estimate the expectation
Suppose we wanted to estimate the variance
Suppose wanted to estimate
25.2 The bootstrap set-up
OK, we’re now moving on from the empirical distribution to a slightly different but related topic: the bootstrap.
Suppose a statistician is interested in a statistic
Suppose I pick a basketball squad of 12 players at random; what is their average height? Here,
is the distribution of basketball players’ heights, , and the statistic isSuppose I visit The Edit Room cafe 5 times; what’s the longest queue I have to deal with. Here,
is the distribution of queue lengths at The Edit Room, , and the statistic isSuppose a supermarket distributor buys 1001 beef steaks; what is the median weight of the steaks? Here
is the distribution of weights of steaks, , and the statistic is
The statistician is likely to be interested in properties of this statistic. For example, three of the most important things the statistician is likely to want to know are:
The expectation
of the statistic.The variance
of the statistic – or related concepts like the standard deviation.A prediction interval
for the statistic, such that .
Now, if the statistician knew the true distribution
Note that there’s two numbers here:
The bootstrap method is the following idea:
Take
samples from the empirical distribution of . This is equivalent to sampling of the values with replacement. Let’s call these samples . Evaluate the statistic with these samplesRepeat step 1 many times; let’s say
times. Keep taking of the samples with replacement and evaluating the statistic. We now have versions of that statistic .Use these
versions of the statistic to get a bootstrap estimate of the expectation, variance, or prediction interval. To estimate the expectation of the statistic , use the sample mean of the evaluated statistics . To estimate the variance use the sample variance We’ll come back to the prediction interval next time.
The bootstrap concept was discovered by the American statistician Bradley Efron in a hugely influential paper “Bootstrap methods: another look at the jackknife” in 1979. The name “bootstrap” comes from the phrase “to pull yourself up by your bootstraps”, which roughly means to make progress without any outside help, in a way that might initially seem impossible – similarly, the bootstrap manages to estimate properties of a statistic by just reusing the same set of samples over and over again. (The “jackknife” in the title of Efron’s paper was and earlier, simpler, less powerful idea along similar lines, named after the multipurpose tool the jackknife.)
25.3 Bootstrap for expectation and variance
Example 25.1 Let’s take the cafe example above. The statistic in question is
The researcher visits The Edit Room at 30 random occasion and notes the following data.
Queue length | 0 | 1 | 2 | 3 | 4 | 5 | 7 | 11 | Total |
---|---|---|---|---|---|---|---|---|---|
Number of occasions | 11 | 5 | 7 | 3 | 1 | 1 | 1 | 1 | 30 |
We start by taking 5 samples from the empirical distribution – that is, we choose 5 of the datapoints uniformly at random with replacement. Let’s say these are
We keep doing this many times – we pick five samples with replacement, and calculate their maximum.
This gives us 1000 realisations of the test statistic. We can use these to look at the distribution of the test statistic:
<- table(maxes) / boots
dist dist
maxes
0 1 2 3 4 5 7 11
0.009 0.034 0.223 0.227 0.103 0.107 0.124 0.173
plot(dist)
We can also look at particular figures of interest. For example, the expectation of
c(mean(maxes), var(maxes))
[1] 4.87900 10.50687
Next time, we’ll look into bootstrap methods in more detail.