The geometric distribution starts from 0

You keep rolling a dice until you get a six: how many rolls does this take?

More generally, you have a sequence of “trials”, each of which succeeds independently with probability $p$ or fails with probability $1-p$. You keep running these trials until you get a success. How many trials does this take?

Mathematicians call this number of trials needed a geometric distribution. But there’s actually a bit of disagreement about exactly what the geometric distribution is. There are two different conventions:

Convention 1 is that the geometric distribution counts the number of trials up to and including the first success. So if I roll my dice and get three, one, two, two, six, then I rolled the dice 5 times altogether, including the final six, so $X = 5$. The possible numbers of total trials are $1, 2, 3, \dots$, starting from 1. The probability of performing exactly $x$ trials up to and including the first success is $p_1(x) = (1-p)^{x-1} \, p$ for $x = 1, 2, \dots$, since you need $x-1$ failures followed by the $x$th trial being a success.
Convention 0 is that the geometric distribution counts the number of failures before the first success. So if I roll my dice and get three, one, two, two, six, then I rolled 4 non-sixes before the six, so $X = 4$. The possible numbers of failures are $0, 1, 2, \dots$, starting from 0. The probability of getting exactly $x$ failures before the first success is $p_0(x) = (1-p)^x \, p$ for $x = 0, 1, \dots$, since you need $x$ failures followed by a success.

A Convention-1 geometric distribution can be turned into a Convention-0 geometric distribution by subtracting 1; a Convention-0 geometric distribution can be turned into a Convention-1 geometric distribution by adding 1. So these aren’t, deep down, substantially different objects. But it is usually important that people know which convention you’re talking about.

Which convention is more popular?

Convention 1 is used by:
- Probability by Durrett; Probability by Grimmett and Walsh; Probability and Random Processes by Grimmett and Stirzaker; Introductory Probability by Grinstead and Snell; Probability and Computing by Mitzenmacher and Upfal; Introduction to Probability Models by Ross; Elementary Probability by Stirzaker; Weighing the Odds by Williams
- My MATH1710 notes*; my successor’s MATH1700 notes*; Vittoria Silvestri’s Cambridge notes; Oliver Johnson’s Bristol notes
- Every single one of the half-dozen colleagues I asked this week
- Claude*; Microsoft Copilot
Convention 0 is used by:
- Introduction to Probability by Blitzstein and Hwang*
- The statistical programming language R
- My predecessor’s MATH1710 notes; Richard Weber’s Cambridge notes
- Wolfram MathWorld
Both conventions are given equal coverage by
- Wikipedia
- ChatGPT; Google Gemini

(A * denotes that the source mentions the existence of the other convention.)

It seems that Convention 1 is more popular, perhaps almost overwhelmingly so. (Although actually Convention 0 did do a little better than I had expected.)

When I taught the geometric distribution I was a strong Convention 1-er, although I did mention that the language R uses Convention 0, which I said I found very annoying. In the lectures, I think I said something along the lines of: “When I’m King of the World, I will force everyone to use the convention where the geometric distribution is the number of trials up to and including the first success. That this is not universally recognised is just further evidence of the fallen state of Mankind.”

In this blogpost I want to admit I was wrong. Over the past year or so, I’ve had a Damascene conversion, and I’m now fully on-board with Convention 0. (You can see my doubts first starting to bloom towards the end of this earlier blogpost.) I want to explain why I now think that Convention 0 is better.

Why start from 1?

Before explaining why I changed my mind, let me try to recreate my former thought process about why Convention 1 might be preferable.

First, Convention 1 is the thing you actually want to know about. If I’m rolling a dice until getting a certain number, I want to know how many times I have to roll it all together, not how many unsuccessful rolls I’ll have before the successful one.

Second, under Convention 1, the expected number of trials up to and including the first success is $1/p$, while under Convention 2, the expected number of failed trials is $1/p - 1 = (1-p)/p$. The first expression is neater – especially in the “$n$ equally likely outcomes, of which one is a success” setting, where the Convention 1 expectation is $n$ and the Convention 2 expectation is $n - 1$.

To put these together, suppose I’m rolling a $d$-sided dice until getting a particular number. It seems both more useful and more pleasant to say “on average it will take $d$ rolls to succeed” than to say “on average it will take $d-1$ failed rolls before succeeding”.

Why start from 0?

I still have some sympathy with that point of view. But if we look at the mathematical properties of the two conventions, it’s clear that Convention 0 always has the nicer properties. Here are some examples I thought of.

1. Thinning. To thin a random variable $X$ by a probability $a$, we think of $X$ as representing a number of items, each of which is independently kept with probability $a$ or discarded with probability $1-a$.

Convention 0: Thinning a Convention-0 geometric distribution gives another Convention-0 geometric distribution but with a different value of $p$.
Convention 1: Thinning a Convention-1 geometric distribution gives a distribution not in any well-known family.

2. Compound Poisson. A compound Poisson distribution is a sum of Poisson-many IID copies of some distribution. We can think of this as receiving a Poisson number of deliveries, each of which contains a IID random number of items; the total number of items across all the deliveries is a compound Poisson distribution.

Convention 0: A Convention-0 geometric distribution is compound Poisson where the compounded distribution is a logarithmic distribution.
Convention 1: A Convention-1 geometric distribution is not compound Poisson.

3. Mixed Poisson. A mixed Poisson distribution is a Poisson distribution where the rate parameter is itself chosen at random. We can think of the random rate parameter being how popular we are today, then the mixed Poisson distribution as the number of items we receive, which is Poisson conditional on our popularity.

Convention 0: A Convention-0 geometric distribution is mixed Poisson where the rate follows an exponential distribution.
Convention 1: A Convention-1 geometric distribution is not mixed Poisson.

4. Infinite divisibility. A random variable is infinitely divisible if, for any $n$, it can be written as the sum of $n$ copies of an IID random variable $Y_n$. It is called discrete infinitely divisible if $Y_n$ takes only non-negative integer values.

Convention 0: A Convention-0 geometric distribution is infinitely divisible and discrete infinitely divisible.
Convention 1: A Convention-1 geometric distribution is infinitely divisible but is not discrete infinitely divisible.

5. Factorial tilting. This one’s a bit more obscure. One way of defining the exponential tilting $X^{(s)}$ of $X$ is that the moment generating function $M$ of $X$ and the moment generating function $M^{(s)}$ of $X^{(s)}$ are related by $M^{(s)}(t) = M(t + s) / M(s)$. Jørgensen and Kokonendji define the “factorial tilting” $M^{[s]}$ as an alternative for discrete distributions, instead working with the factorial moment generating function $\Phi(t) = \mathbb E(1+t)^X$: the factorial moment generating function $\Phi$ of $X$ and the factorial moment generating function $\Phi^{[s]}$ of $X^{[s]}$ are related by $\Phi^{[s]}(t) = \Phi(t + s) / \Phi(s)$. This preserves, for example, the families of Poisson, Bernoulli, binomial and Hermite distributions.

Convention 0: The factorial tilting of a Convention-0 geometric distribution is another Convention-0 geometric distribution (for $s$ such that the factorial tilting exists).
Convention 1: The factorial tilting of a Convention-1 geometric distribution is not in any well-known family.

I think that’s 5–0 for Convention 0.

Update: I thought of another: the Convention-0 geometric is the equilibrium distribution of the M/M/1 queue; I can’t think of any sensible queueing model for which a Convention-1 geometric distribution is the equilibrium distribution.

A modest proposal

Actually, though, I want to go further. I don’t just want to convert everyone to Convention 0. More controversially still, I want to change the parameter of the geometric distribution. Rather than using the success probability $p$ as the parameter, I want to use the odds against success $\theta = (1 - p)/p$.

Why such a bizarre choice? To do this, I want to put the geometric distribution within the wider family of negative binomial distributions. A negative binomial distribution has two parameters: $n$ and $p$ (or $n$ and $\theta$, I will shortly argue). The negative binomial distribution, at least to us Convention 0ers, counts the number of failures before the $n$th success. So, for example, if you roll a dice until getting a six for the tenth time, the number of non-sixes you rolled en route is negative binomial with $n = 10$ and $p = \frac{1}{6}$ (or $\theta = 5$). Setting $n = 1$ gets back the geometric distribution in the Convention 0 form, so the Convention-0 geometric slots in nicely as the first and most important example in the bigger family of negative binomials.

(None of my Convention 1-loving colleagues were willing to bite the bullet and admit the negative binomial should be the number of trials up to and including the $n$th success, with minimum value $n$. So maybe they’re all secret Convention 0ers like me, deep down.)

But it turns out that the negative binomial with $\theta$ as the odds against success behaves in a number of interesting ways as the “opposite” of the binomial distribution. The binomial distribution is the number of successes out of a fixed number $n$ of trials each of which succeeds with probability $\theta$. Remember that for the binomial $\theta$ is the success probability, but for the negative binomial $\theta$ is the odds against success.

So what are these interesting “opposites”?

(I’ll be using the notation $n^{\underline{k}} = n(n-1)\cdots(n-k+1)$ for the falling factorial and $n^{\overline{k}} = n(n+1)\cdots(n+k-1)$ for the rising factorial.)

1. Probability mass function.

The PMF of the binomial distribution is $\displaystyle \binom{n}{x} \theta^x (1 - \theta)^{n-x}$, where $\binom{n}{x} = n^{\underline{x}} / x!$ is the binomial coefficient.
The PMF of the negative binomial distribution is $\displaystyle \left(\kern-0.4em\binom{n}{x}\kern-0.4em\right) \theta^{-x} (1 + \theta)^{-n-x}$, where $\left(\kern-0.2em\binom{n}{x}\kern-0.2em\right) = n^{\overline{x}} / x!$ is the multiset coefficient.

2. Expectation.

The expectation of the binomial distribution is $n\theta$.
The expectation of the negative binomial distribution is $n\theta$.

3. Variance.

The variance of the binomial distribution is $n\theta(1-\theta)$.
The variance of the negative binomial distribution is $n\theta(1+\theta)$.

4. Factorial moments. The $k$th factorial moment is $\mathbb EX^{\underline{k}} = \mathbb EX(X-1)\cdots(X - k + 1)$.

The $k$th factorial moment of the binomial distribution is $n^{\underline{k}} \,\theta^k$.
The $k$th factorial moment of the negative binomial distribution is $n^{\overline{k}} \,\theta^k$.

5. Probability generating function. The probability generating function is $G_X(t) = \mathbb E\,t^X$.

The probability generating function of the binomial distribution is $(1 - \theta + \theta t)^n$.
The probability generating function of the negative binomial distribution is $(1 + \theta - \theta t)^{-n}$.

6. Thinning

The $a$-thinning of a binomial distribution keeps $n$ the same but changes the success probability from $\theta$ to $a\theta$.
The $a$-thinning of a negative binomial distribution keeps $n$ the same but changes the odds against success from $\theta$ to $a\theta$.

All these “opposites” results are much more pleasant than they would be if the negative binomial (and therefore the geometric) were parameterised by the success probability $p$ where $p = 1/(1 + \theta)$.