Estimating your chances of winning the lottery with sampling
Statistical estimates can be fascinating, can’t they? By just sampling a few instances from a population, you can infer properties of that population such as the mean value or the variance. Likewise, under the right circumstances, it is possible to estimate the total size of the population, as I want to show you in this article.
I will use the example of drawing lottery tickets to estimate how many tickets there are in total, and hence calculate the likelihood of winning. More formally, this means to estimate the population size given a discrete uniform distribution. We will see different estimates and discuss their differences and weaknesses. In addition, I will point you to some other use cases this approach can be used in.
Playing the lottery
Let’s imagine I go to a state fair and buy some tickets in the lottery. As a data scientist, I want to know the probability of winning the main prize, of course. Let’s assume there is just a single ticket that wins the main prize. So, to estimate the likelihood of winning, I need to know the total number of lottery tickets N, as my chance of winning is 1/N then (or k/N, if I buy k tickets). But how can I estimate that N by just buying a few tickets (which are, as I saw, all losers)?
I will make use of the fact, that the lottery tickets have numbers on them, and I assume, that these are consecutive running numbers (which means, that I assume a discrete uniform distribution). Say I have bought some tickets and their numbers in order are [242,412,823,1429,1702]. What do I know about the total number of tickets now? Well, obviously there are at least 1702 tickets (as that is the highest number I have seen so far). That gives me a first lower bound of the number of tickets, but how accurate is it for the actual number of tickets? Just because the highest number I have drawn is 1702, that doesn’t mean that there are any numbers higher than that. It is very unlikely, that I caught the lottery ticket with the highest number in my sample.
However, we can make more out of the data. Let us think as follows: If we knew the middle number of all the tickets, we could easily derive the total number from that: If the middle number is m, then there are m-1 tickets below that middle number, and there are m+1 tickets above that. That is, the total number of tickets would be (m-1) + (m+1) + 1, (with the +1 being the ticket of number m itself), which is equal to 2m-1. We don’t know that middle number m, but we can estimate it by the mean or the median of our sample. My sample above has the (rounded) average of 922, which yields 2*922-1 = 1843. That is, from that calculation the estimated number of tickets is 1843.
That was quite interesting so far, as just from a few lottery ticket numbers, I was able to give an estimate of the total number of tickets. However, you may wonder if that is the best estimate we can get. Let me spoil you right away: It is not.
The method we used has some drawbacks. Let me demonstrate that to you with another example: Say we have the numbers [12,30,88], which leads us to 2*43–1 = 85. That means, the formula suggests there are 85 tickets in total. However, we have ticket number 88 in our sample, so this can not be true at all! There is a general problem with this method: The estimated N can be lower than the highest number in the sample. In that case, the estimate has no meaning at all, as we already know, that the highest number in the sample is a natural lower bound of the overall N.
A better approach: Using even spacing
Okay, so what can we do? Let us think in a different direction. The lottery tickets I bought have been sampled randomly from the distribution that goes from 1 to unknown N. My ticket with the highest number is number 1702, and I wonder, how far away is this from being the highest ticket at all. In other words, what is the gap between 1702 and N? If I knew that gap, I could easily calculate N from that. What do I know about that gap, though? Well, I have reason to assume that this gap is expected to be as big as all the other gaps between two consecutive tickets in my sample. The gap between the first and the second ticket should, on average, be as big as the gap between the second and the third ticket, and so on. There is no reason why any of those gaps should be bigger or smaller than the others, except for random deviation, of course. I sampled my lottery tickets independently, so they should be evenly spaced on the range of all possible ticket numbers. On average, the numbers in the range of 0 to N would look like birds on a power line, all having the same gap between them.
That means I expect N-1702 to equal the average of all the other gaps. The other gaps are 242–0=242, 412–242=170, 823–412=411, 1429–823=606, 1702–1429=273, which gives the average 340. Hence I estimate N to be 1702+340=2042. In short, this can be denoted by the following formula:
N = x + (x-k)/k
Here x is the biggest number observed (1702, in our case), and k is the number of samples (5, in our case). This is just a short form of calculating the average as we just did.
Let’s do a simulation
We just saw two estimates of the total number of lottery tickets. First, we calculated 2*m-1, which gave us 1843, and then we used the more sophisticated approach of x + (x-k)/k and obtained 2042. I wonder which estimation is more correct now? Are my chances of winning the lottery 1/1843 or 1/2042?
To show some properties of the estimates we just used, I did a simulation. I drew samples of different sizes k from a distribution, where the highest number is 2000, and that I did a few hundred times each. Hence we would expect that our estimates also return 2000, at least on average. This is the outcome of the simulation:
Probability densities of the different estimates for varying k. Note that the ground truth N is 2000. Image by author.
What do we see here? On the x-axis, we see the k, i.e. the number of samples we take. For each k, we see the distribution of the estimates based on a few hundred simulations for the two formulas we just got to know. The dark point indicates the mean value of the simulations each, which is always 2000, independent of the k. That is a very interesting point: Both estimates converge to the correct value if they are repeated an infinite number of times.
However, besides the common average, the distributions differ quite a lot. We see, that the formula 2*m-1 has higher variance, i.e. its estimates are far away from the real value more often than for the other formula. The variance has a tendency to decrease with higher k though. This decrease does not always hold perfectly, as this is just as simulation and is still subject to random influences. However, it is quite understandable and expected: The more samples I take, the more precise is my estimation. That is a very common property of statistical estimates.
We also see that the deviations are symmetrical, i.e. underestimating the real value is as likely as overestimating it. For the second approach, this symmetry does not hold: While most of the density is above the real mean, there are more and larger outliers below. How does that come? Let’s retrace how we computed that estimate. We took the biggest number in our sample and added the average gap size to that. Naturally, the biggest number in our sample can only be as big as the biggest number in total (the N that we…
Source link