This R worksheet does not include assessed questions.You should complete this worksheet in Week 8. If you have difficulty with this worksheet, you can get help at the office hours drop-in sessions.
On the last worksheet, we looked at dealing with famous discrete random variables: the binomial, geometric, and Poisson distributions. In this worksheet, we will look at dealing with arbitrary discrete random variables, as defined by a probability mass function (PMF)
The key to dealing with a discrete random variable is to set up two things: the range of values the random variable can take, and the PMF itself.
Consider, for example, the following PMF for a random variable \(X\):
\(x\) | \(0\) | \(0.5\) | \(1\) | \(2\) | \(2.5\) | \(4\) |
---|---|---|---|---|---|---|
\(p_X(x)\) | \(0.1\) | \(0.2\) | \(0.2\) | \(0.3\) | \(0.1\) | \(0.1\) |
We want to store the \(x\) values in one vector, and the \(p_X(x)\) values in another vector. We can do that like this:
x <- c(0, 0.5, 1, 2, 2.5, 4 )
pmf <- c(0.1, 0.2, 0.2, 0.3, 0.1, 0.1)
(I find this easier to read if I add spaces to make the columns line up, but this is completely optional.)
It’s good practice to check that the PMF does indeed sum to 1:
sum(pmf)
[1] 1
We can find the probability of a given value of \(x\) by pulling out the relevant value of the PMF. We are used to using square brackets to pull out the elements of a vector. But more complicate expressions can be put inside the brackets, like this:
pmf[x == 0.5]
pulls out the value of the PMF that
corresponds to \(x = 0.5\); that is, of
\(p_X(0.5)\). (Note the double-equals
==
. Roughly speaking, in R, a single-equals =
means “is set to be equal to”, while a double-equals
==
means “is equal to”.)
pmf[x > 1.5]
pulls out all the values of the PMF
that corresponding to \(x > 1.5\).
It’s often helpful to use this together with the sum()
function, because sum(pmf[x > 1.5])
will give \(\mathbb P(X > 1.5)\). (Do you see
why?)
Relations that can be used this way are
==
: equals \(=\)>
: greater than \(>\)<
: less than \(>\)>=
: greater than or equal to \(\geq\)<=
: less than or equal to \(\leq\)!=
,
in
, &
, |
.Exercise 8.1. Let \(Y\) be the sum of fair dice rolls. So \(Y\) takes the values \((2, 3, \dots, 12)\) with probabilities \(\frac{1}{36}(1,2,3,4,5,6,5,4,3,2,1)\).
(a) Enter this random variable intoy
andpmf_y
. Do so as efficiently as possible, by taking advantage of the colon notation.
(b) What is \(\mathbb P(Y = 9)\)?
(c) What is \(\mathbb P(Y \leq 6)\)?
If we want to plot a picture of a PMF (like those in the Lecture 11
of the notes) we can use plot()
together with the optional
argument type = "h"
. Here, "h"
stands for
“histogram-like vertical lines”, which are most
appropriate for displaying PMFs. I also like to make the lines somewhat
thicker, with the lwd = ...
argument, and make sure that
the y-axis starts from 0.
So I would draw a graph for our previous PMF like this:
plot(x, pmf, type = "h", lwd = 4, ylim = c(0, max(pmf)))
Exercise 8.2. Draw a plot of the PMF for the dice random variable \(Y\) from Exercise 8.1. Draw the lines in blue. Make sure to label the axes appropriately.
We know that a CDF \(F(x) = \mathbb P(X
\leq x)\) can be calculated by summing up the PDF; that is, \[ F(x) = \sum_{y \leq x} p(x) . \] To
calculate this in R, it is helpful to use the cumulative sum function
cumsum()
. The cumulative sum of a vector
\[ (a, \quad b, \quad c, \quad d, \quad e)
\] is the vector of partial sums \[
(a, \quad a+b, \quad a+b+c, \quad a+b+c+d, \quad a+b+c+d+e) .
\]
So taking the cumulative sum of the PMF,
cumsum(pmf)
[1] 0.1 0.3 0.5 0.8 0.9 1.0
gives a vector of the values of the CDF at the values of \(x\).
However, the CDF also should be defined in between the values of
\(x\) in the range of \(X\), where it should behave like a “step
function”. We can create such a function using the
stepfun()
function. This
stepfun
requires two arguments:
So we form the CDF function with
cdf <- stepfun(x, c(0, cumsum(pmf)))
We can test that this CDF works between the values in the range of \(X\) by testing it:
cdf(1.62)
[1] 0.5
We can also plot the CDF with the plot()
function:
plot(cdf)
with other extra graphical commands, as required. The “blobs” are to let the reader know that at the jumps, the function takes the higher value.
Exercise 8.3. Form the CDF for the dice random variable from Exercise 8.1.
(a) What is \(F_Y(4.6)\)?
(b) What is \(\mathbb P(Y > 8.2)\)?
(c) Plot the CDF. Make sure to label your axes appropriately.
Recall that the definition of the expectation of a discrete random variable is \[ \mathbb EX = \sum_x x\, p(x) . \]
So to calculate this, we can simply use R’s sum()
function. We just want
sum(x * pmf)
[1] 1.55
That’s all there is to it!
Exercise 8.4. With the dice random variable again:
(a) Calculate the expectation \(\mathbb EY\).
(b) Calculate \(\mathbb EY^2\).
We can find a variance using either the definitional formula
\[ \text{Var}(X) = \mathbb E(X - \mu)^2 \]
or the computational formula
\[ \text{Var}(X) = \mathbb EX^2 - \mu^2 .\]
Using R in the same way, these are
mu <- sum(x * pmf)
sum((x - mu)^2 * pmf)
[1] 1.2725
sum(x^2 * pmf) - mu^2
[1] 1.2725
Note that we saved the expectation as an R object – here, called
mu
. This makes sure we keep maximum accuracy, and is much
better practice than simply re-typing in the value of the expectation by
hand.
Exercise 8.5. With the dice random variable again, calculate the variance of \(Y\).
Note that the function var()
, which we saw on R
Worksheet 2, is used to calculate the sample variance of a
dataset, where the dataset is represented a single vector. This is of no
use for finding the variance of a random variable represented by a pair
of vectors for the range and the PMF.
Finally, we may wish to sample from a discrete random variable. To do
this, we will use the sample()
function,
like this:
sample(x, 20, replace = TRUE, prob = pmf)
[1] 0.5 1.0 2.5 1.0 0.5 2.0 4.0 0.0 0.0 1.0 2.0 1.0 1.0 0.5 2.0 2.0 1.0 4.0 0.0 2.5
Here, the arguments are:
replace = TRUE
tells us we want to be allowed “repeats”
in our sample. If we want IID samples of a random variable, we always
want sampling with replacement. (For some sampling-without-replacement
probability problems, it might be appropriate to take
replace = FALSE
, which is R’s default for the
sample()
function.)prob = ...
tells R with what probabilities each value
of x
should be sampled. If this argument is omitted, R
takes each value to be equally likely.Exercise 8.6. Sample 10,000 simulated dice rolls from the random variable \(Y\). Calculate the mean of these simulations. Compare your answer with that from Exercise 8.4(a).