Sampling with SQL | Svelte Hacker News

tmoertel 9 hours ago

Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.

rhymer 3 hours ago

Be careful, the weight of Algorithm A by Efraimidis and Spirakis cannot be interpreted as the inclusion probability, and thus cannot be used in survey sampling to construct the Horvitz–Thompson estimator.

See "Remarks on some misconceptions about unequal probability sampling without replacement" by Yves Tillé.

Quoted from Tillé's conclusion: "There is a multitude of correct and fast method of sampling... there is no reason to use an incorrect method like weighted random sampling where we do not control the inclusion probabilities"

It's not clear to me how easy it is to implement the "multitude of corect and fast methods" in SQL, though. Would love to see some reference implementation.

natmaka 7 hours ago

If you need to "extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity" check https://github.com/mla/pg_sample

emmelaich 9 hours ago

Is there something in the SQL standard that says functions are guaranteed to executed more than once?

I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.

I may have used it in the `select` part. Dialect was Oracle's, from memory.

related: https://xkcd.com/221/

yen223 8 hours ago

Postgres makes a distinction between IMMUTABLE, STABLE, and VOLATILE functions, with volatile functions being functions that - with the same arguments - can produce different results even within the same statement. Therefore VOLATILE functions will always be executed once per call.
I'm not sure if this is part of the ANSI SQL standard.
hobs 9 hours ago

It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.
https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24
- emmelaich 8 hours ago
  
  Thanks, that's a bit upsetting :-)
  
  tmoertel 8 hours ago
  
  Indeed.
  On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:
  https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....
  I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:
  -- Returns a pseudorandom fp64 number in the range [0, 1). The number -- is determined by the given `key`, `seed` string, and integer `index`. CREATE MACRO pseudorandom_uniform(key, seed, index) AS ( (HASH(key || seed || index) >> 11) * POW(2.0, -53) );
  
  o11c 8 hours ago
  
  `HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?
  
  tmoertel 6 hours ago
  
  Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.
  
  hobs 7 hours ago
  
  Had a good laugh, this is the normal response to difference in SQL implementations in my experience.
paulddraper 6 hours ago

No.