I found the paper took about as long to read as the blog post and is more informative:
https://arxiv.org/pdf/2301.10191
It is about estimating the cardinality of a set of elements derived from a stream. The algorithm is so simple, you can code it and play with it whilst you read the paper.
The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.
If you refer to the subtitle of the paper - An Algorithm for the (Text) Book - I think that is actually a reference to something *Paul Erdos allegedly said about some proofs are so elegant in their simplicity and beauty that they are "from The Book", like representing some divine Platonic ideal.
Given that Knuth himself reviewed it, he might have remarked that this was one of those algorithms! Perhaps the authors decided to include it in the title as a not-so-humble brag (which would be well-earned if that's the case!)
edit: originally this comment said Knuth was the one who said this about some algorithms being from The Book, but that was my faulty memory.
From the abstract: "... All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory ...."
This is really twisting it, I don't find pairwise indepedence to be more advanced than applying a Chernoff bound. In this problem it'd mostly be the difference of using a Cherbyshev type bound or Chernoff bound.
Pairwise independence is to give the algorithm stronger guarantees and let it work with a simple hash function like ax+b, otherwise probably most existing algorithms can be simplified in the same way. The most similar algorithm is BJKST, which is almost identical except for specifyimg the sampling mechanism to require less randomness.
To someone who worked on this type of thing before, it's like seeing people familar with LLMs say "oh yet another X-billion parameter model on github doing more or less the same".
The point is that the subtitle's is pretty clearly intended to have a dual meaning, it wouldn't be phrased like that otherwise.
I thought The Book was an Erdos thing. I wonder who used it first.
I think you're right, I must have confused the two. I'll edit my comment to reduce the spread of misinformation.
"During a lecture in 1985, Erdős said, `You don't have to believe in God, but you should believe in The Book.`"
https://en.wikipedia.org/wiki/Proofs_from_THE_BOOK
I liked this part. They got Knuth to review it, and found mistakes. That's kind of cool, in its own way.
If you're saying it's just for "undergraduates and textbooks", as opposed to just being simple enough for them to use but not limited to them, would you mind explaining what makes it useful for undergrads but not for professionals?
From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."
That still only speaks to it being simple enough for students, not whether its too simple for any other use vs. useful enough that students who learn it will spend the rest of their lives using it.
For example word processor software is commonly described as simple enough for children to use at school, that doesn't mean that word processor software is of no use to adults.
the reason it's too simple for most real world use is that hyper-log-log is the "good" version of this technique (but is harder to prove that it works)
My interpretation from the paper is that this algorithm is simpler than other options but also worse. So in a professional context you'd use one of those instead
I agree the paper is better than the blog post, although one criticism I have of the CVM paper is that it has some termination/algo exit condition instead of what Knuth's CVM notes (refed else-thread here) do which is just a loop to ensure getting more space in the reservoir halving-step. It seems more work to explain the https://en.wikipedia.org/wiki/Up_tack than just do the loop. [1]
[1] https://news.ycombinator.com/item?id=40388878
On that note, I'm also unfamiliar with this \ operator notation which is used without explanation.
Set difference.
Set X becomes X without element ai. This is the case whether ai was in the set X before the step was taken.
That is conventional set subtraction notation. "Assign to X the same set minus all elements of the set {a_i}".
One example source, but it is pretty common in general: http://www.mathwords.com/s/set_subtraction.htm
I've known that symbol for decades, never knew it's name - up-tack it is. Ta!
Doesn't seem like it. Seems like an algorithm (similar to other approximate cardinality estimation algorithms) with huge applicability.
From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."
That just says that this is also simpler and more accessible algorithm, suitable even for undegraduate textbooks.
Not that this is just useful for textbooks, a mere academic toy example, which would be something else entirely.
This is both accessible AND efficient.
The blog post was more than half padding. Good that the algorithm is so simple it's hard to write a full length blog post about it!
Link to abstract: https://arxiv.org/abs/2301.10191