Drawing From a Random Distribution in SQL | by Sami Abboud

From a probability density function to random samples
Photo by Moritz Kindler on Unsplash

There are various approaches to updating a reinforcement learning agent’s policy during each iteration. Recently, we began experimenting with replacing our current method with a Bayesian inference step. Some of the data workloads in our agent are written in SQL and executed on GCP’s BigQuery engine. We chose this stack because it offers scalable computational capabilities, ML packages, and a straightforward SQL interface.

The Bayesian inference step we wanted to implement utilizes one of the alternative parametrizations of a beta distribution. This means that we need to be able to draw samples from a beta distribution in SQL. While working on this, I realized that there are very few well-documented examples on how to draw random distributions in SQL. Hence, I am documenting my findings here.

Unfortunately, BigQuery does not have a built-in beta distribution or the capability to draw samples from any random distribution. My initial approach was to define the beta distribution in SQL, set the parameters, generate a random number between 0 and 1, and calculate the value of the function. However, this approach did not work as expected.

To find a solution, I decided to ask ChatGPT for help. I asked it how to create random draws from a beta distribution in BigQuery. However, the code provided by ChatGPT had a flaw. It drew two different x values for the presumed beta distribution probability density function (PDF). I fixed the code, made some query optimizations, and sampled 1,000 values. Here is the updated SQL code:

“`sql
WITH raw_data AS (
SELECT ‘user_a’ AS contact_id, 0.674 AS probability, 72 AS lifetime_messages_received
), parameters AS (
SELECT contact_id, probability * lifetime_messages_received AS alpha, (1.0 – probability) * lifetime_messages_received AS beta, RAND() AS x
FROM raw_data
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000)) AS draw_id
)
SELECT contact_id, ARRAY_AGG(POW(x, alpha – 1.0) * POW(1.0 – x, beta – 1)) AS beta_x
FROM parameters
GROUP BY contact_id
“`

However, when I compared the results of the SQL code with a trusted implementation using SciPy’s `beta.rvs()` function in Python, I noticed that the distributions were different. I realized that I had missed including the scaling constant in my SQL calculation. The beta distribution has a scaling constant that depends on the gamma function, which I did not account for.

The problem was that the gamma function does not have a closed-form expression, and BigQuery does not provide an approximation for it. Therefore, I decided to switch to Python for a more efficient experimentation process. The plan was to get the implementation right in Python and then translate it to SQL. However, I still needed a way to approximate the gamma function.

In Python, I implemented a manual draw from a beta distribution using the correct constant with the help of SciPy’s gamma function. I also realized that drawing from a random distribution means sampling from the inverse cumulative distribution function (CDF), not directly from the probability density function (PDF) as I had been doing.

After correcting my approach in Python, I compared the distribution drawn using my manual method with the distribution drawn using SciPy’s `beta.rvs()` function. The two distributions matched, indicating that the manual draw was successful.

Now that I had a working implementation in Python, I could return to SQL. Since BigQuery does not readily provide an implementation of the gamma function, I decided to draw from the logistic distribution instead (with parameters a=0 and b=1) for simplicity. However, you can adjust the parameters based on the PDF support of the distribution you wish to draw from.

I developed SQL code that samples from the logistic distribution using the inverse CDF (quantile function) approach. This approach involves generating a discrete representation of the PDF, computing a discrete CDF, and then drawing random samples using the inverse CDF. The SQL code can be adapted for other distributions where you can obtain a discrete PDF representation by sampling it at consistent intervals.

In conclusion, I have successfully implemented sampling from a random variable in SQL by approximating the inverse CDF. This approach can be used for various distributions, and it offers a way to draw random samples in BigQuery when direct random distribution sampling is not available.

Source link