Be careful, the weight of Algorithm A by Efraimidis and Spirakis cannot be interpreted as the inclusion probability, and thus cannot be used in survey sampling to construct the Horvitz–Thompson estimator.
See "Remarks on some misconceptions about unequal probability sampling without replacement" by Yves Tillé.
Quoted from Tillé's conclusion: "There is a multitude of correct and fast method of sampling... there is no reason to use an incorrect method like weighted random sampling where we do not control the inclusion probabilities"
It's not clear to me how easy it is to implement the "multitude of corect and fast methods" in SQL, though. Would love to see some reference implementation.
If you need to "extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity" check https://github.com/mla/pg_sample
Is there something in the SQL standard that says functions are guaranteed to executed more than once?
I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.
I may have used it in the `select` part. Dialect was Oracle's, from memory.
Postgres makes a distinction between IMMUTABLE, STABLE, and VOLATILE functions, with volatile functions being functions that - with the same arguments - can produce different results even within the same statement. Therefore VOLATILE functions will always be executed once per call.
I'm not sure if this is part of the ANSI SQL standard.
It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.
On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:
I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:
-- Returns a pseudorandom fp64 number in the range [0, 1). The number
-- is determined by the given `key`, `seed` string, and integer `index`.
CREATE MACRO pseudorandom_uniform(key, seed, index)
AS (
(HASH(key || seed || index) >> 11) * POW(2.0, -53)
);
Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.
Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.
Be careful, the weight of Algorithm A by Efraimidis and Spirakis cannot be interpreted as the inclusion probability, and thus cannot be used in survey sampling to construct the Horvitz–Thompson estimator.
See "Remarks on some misconceptions about unequal probability sampling without replacement" by Yves Tillé.
Quoted from Tillé's conclusion: "There is a multitude of correct and fast method of sampling... there is no reason to use an incorrect method like weighted random sampling where we do not control the inclusion probabilities"
It's not clear to me how easy it is to implement the "multitude of corect and fast methods" in SQL, though. Would love to see some reference implementation.
If you need to "extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity" check https://github.com/mla/pg_sample
Is there something in the SQL standard that says functions are guaranteed to executed more than once?
I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.
I may have used it in the `select` part. Dialect was Oracle's, from memory.
related: https://xkcd.com/221/
Postgres makes a distinction between IMMUTABLE, STABLE, and VOLATILE functions, with volatile functions being functions that - with the same arguments - can produce different results even within the same statement. Therefore VOLATILE functions will always be executed once per call.
I'm not sure if this is part of the ANSI SQL standard.
It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.
https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24
Thanks, that's a bit upsetting :-)
Indeed.
On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:
https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....
I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:
`HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?
Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.
Had a good laugh, this is the normal response to difference in SQL implementations in my experience.
No.