The Birthday Problem
And why it matters
Published: 2022-09-27
Itâs a well-known fact that all the best people are born in September. Yours truly is evidence of that. As my birthday approached this year, I found myself thinking on and utilizing one of the more memorable paradoxes from my probability & statistics class: The Birthday Problem.
How many people, chosen at random, would you need to put in a room for it to be likely that two persons share a birthday?
Intuition may initially lead us to consider the answer to be on the order of the number of days in the year. Maybe 200 people? 100?
Well, if the answer were that straightforward, it wouldnât be worth discussing. Spoiler alert; the answer is 23. It seems incredulous, doesnât it, that in a room of 23 people, thereâs a more than 50% chance that two people share the same birthday? The proof is a relatively straightforward combinatorics problem, and while Iâd love to document it here, the Wikipedia entry does a good job (if youâve done any discrete math/counting). Suffice it to say that the number starts becoming more believable when we remember that weâre comparing birthdays between every possible pair. With 23 people, there are 23 Ă 22 / 2 = 253 possible pairings â more than half the days in the year. If anyone does want to discuss the math, let me know - Iâd be happy to do so in a Twitter space or similar.
Why it matters
Why am I talking about this? Is this relevant in computer science, or am I just some nerd who
thinks about weird stuff on his birthday? Yes, and yes.
The birthday problem came up a few weeks ago when I was solving an issue at work, demonstrating its
usefulness.
The scenario:
I needed to generate a unique ID. Thatâs a common enough need. As developers, we often
need a unique ID, whether itâs to make a primary key in a database, to link to a user, to
create a URL shortener, to have a trace ID for logs, etc. In Golang, a common package to use is UUID. It follows specific standards and gives you a âUniversally Unique Identifierâ.
However, in this case, I couldnât use that package because the UUID generated was too long
for my needs.
One of the reasons UUIDs are so long â the string representation is 36 characters (or 32 in raw hex) â is that they follow a standard in generating 128-bit IDs. My use case didnât care about any of the UUID standards. I looked at the package and decided that I could simply replicate its process with fewer bits to be under the limit. But then the question becomes: how do I compare solutions? How do I know whatâs âgood enoughâ? You may be thinking, âWhy not just trim the UUID to the size you need?â But that raises similar questions about knowing how good the solution is (with a more complex answer). The critical functionality of UUIDs is their âuniqueness.â How can I be sure I wonât generate duplicates? How many IDs can I generate before a collision becomes likely? If I cut a UUID in half, does that mean my resulting ID is now half as good? Nope. The answer is much⌠much worse than that. But this is where the birthday problem steps in to answer all our questions.
Fundamentally, the birthday problem is about the likelihood of collisions (duplicates, in our case). It tells us how many IDs we can generate before duplicates become likely. For UUID (version 4), the number of IDs that needs to be generated to have a 50% probability of at least one collision is 2.71 quintillion (2.71 Ă 1018). To put that in relatable terms, the UUID package states:
Randomly generated UUIDs have 122 random bits. Oneâs annual risk of being hit by a meteorite is estimated to be one chance in 17 billion, which means the probability is about 0.00000000006 (6 Ă 10â11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate.
So then we get back to my problem: How to describe to my leads/reviewers how good my solution is compared to the standard UUID package. I leveraged the birthday problem to calculate the collision probabilities with fewer bits. After I concluded that work, I realized that the Wikipedia entry had already made a probability table. I was considering using 64 bits to randomly generate IDs. In terms of uniqueness, I would need to generate 5 billion IDs for a duplicate to be likely. Or, to paraphrase the UUID package:
Oneâs annual risk of dying by falling out of bed is estimated to be one chance in two million, which means the probability is about 5 Ă 10-7, equivalent to the odds of creating a few million IDs in a year and having one duplicate.
That was objectively more âuniquenessâ than our use case would ever warrant, and I
had the numbers to show it.
After generating the number ID, I also did some base conversion to further condense the string representation.
UUIDs use hexadecimal, but I had no reason not to use the rest of the alphabet. So the final ID was
in base 36.
I also thought of a few other approaches to generating IDs, such as being time-based. I wonât go into further detail here, but I created a library with various options. Iâm happy to discuss anything there as well.
Finally, if you want to see how the birthday problem relates to hashes, cryptography, and security, look into birthday attack.
As always, I hope this read has been helpful to you â or at least enjoyable.