return to table of content

Computer scientists invent an efficient new way to count

usgroup
23 replies
10h32m

I found the paper took about as long to read as the blog post and is more informative:

https://arxiv.org/pdf/2301.10191

It is about estimating the cardinality of a set of elements derived from a stream. The algorithm is so simple, you can code it and play with it whilst you read the paper.

The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.

vanderZwan
7 replies
9h47m

If you refer to the subtitle of the paper - An Algorithm for the (Text) Book - I think that is actually a reference to something *Paul Erdos allegedly said about some proofs are so elegant in their simplicity and beauty that they are "from The Book", like representing some divine Platonic ideal.

Given that Knuth himself reviewed it, he might have remarked that this was one of those algorithms! Perhaps the authors decided to include it in the title as a not-so-humble brag (which would be well-earned if that's the case!)

edit: originally this comment said Knuth was the one who said this about some algorithms being from The Book, but that was my faulty memory.

usgroup
2 replies
8h6m

From the abstract: "... All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory ...."

dchftcs
0 replies
5h56m

This is really twisting it, I don't find pairwise indepedence to be more advanced than applying a Chernoff bound. In this problem it'd mostly be the difference of using a Cherbyshev type bound or Chernoff bound.

Pairwise independence is to give the algorithm stronger guarantees and let it work with a simple hash function like ax+b, otherwise probably most existing algorithms can be simplified in the same way. The most similar algorithm is BJKST, which is almost identical except for specifyimg the sampling mechanism to require less randomness.

To someone who worked on this type of thing before, it's like seeing people familar with LLMs say "oh yet another X-billion parameter model on github doing more or less the same".

Sharlin
0 replies
6h7m

The point is that the subtitle's is pretty clearly intended to have a dual meaning, it wouldn't be phrased like that otherwise.

kibibu
2 replies
9h30m

I thought The Book was an Erdos thing. I wonder who used it first.

vanderZwan
0 replies
8h43m

I think you're right, I must have confused the two. I'll edit my comment to reduce the spread of misinformation.

cschmidt
0 replies
2h27m

I liked this part. They got Knuth to review it, and found mistakes. That's kind of cool, in its own way.

    We are deeply grateful to Donald E. Knuth for his thorough review, 
    which not only enhanced the quality of this paper (including fixing
    several errors) but has also inspired us for higher standards.

swores
4 replies
9h4m

"The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks."

If you're saying it's just for "undergraduates and textbooks", as opposed to just being simple enough for them to use but not limited to them, would you mind explaining what makes it useful for undergrads but not for professionals?

usgroup
2 replies
8h4m

From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."

swores
1 replies
8h0m

That still only speaks to it being simple enough for students, not whether its too simple for any other use vs. useful enough that students who learn it will spend the rest of their lives using it.

For example word processor software is commonly described as simple enough for children to use at school, that doesn't mean that word processor software is of no use to adults.

adgjlsfhk1
0 replies
4h49m

the reason it's too simple for most real world use is that hyper-log-log is the "good" version of this technique (but is harder to prove that it works)

pfsalter
0 replies
8h12m

My interpretation from the paper is that this algorithm is simpler than other options but also worse. So in a professional context you'd use one of those instead

cb321
4 replies
5h52m

I agree the paper is better than the blog post, although one criticism I have of the CVM paper is that it has some termination/algo exit condition instead of what Knuth's CVM notes (refed else-thread here) do which is just a loop to ensure getting more space in the reservoir halving-step. It seems more work to explain the https://en.wikipedia.org/wiki/Up_tack than just do the loop. [1]

[1] https://news.ycombinator.com/item?id=40388878

imglorp
2 replies
4h40m

On that note, I'm also unfamiliar with this \ operator notation which is used without explanation.

    X ← X \ {ai}

rocqua
0 replies
4h26m

Set difference.

Set X becomes X without element ai. This is the case whether ai was in the set X before the step was taken.

jtanderson
0 replies
4h29m

That is conventional set subtraction notation. "Assign to X the same set minus all elements of the set {a_i}".

One example source, but it is pretty common in general: http://www.mathwords.com/s/set_subtraction.htm

_a_a_a_
0 replies
1h39m

I've known that symbol for decades, never knew it's name - up-tack it is. Ta!

coldtea
2 replies
9h13m

The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.

Doesn't seem like it. Seems like an algorithm (similar to other approximate cardinality estimation algorithms) with huge applicability.

usgroup
1 replies
8h1m

From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."

coldtea
0 replies
5h30m

That just says that this is also simpler and more accessible algorithm, suitable even for undegraduate textbooks.

Not that this is just useful for textbooks, a mere academic toy example, which would be something else entirely.

This is both accessible AND efficient.

resonious
0 replies
7h11m

The blog post was more than half padding. Good that the algorithm is so simple it's hard to write a full length blog post about it!

mudiadamz
22 replies
5h23m

Python implementation:

  def streaming_algorithm(A, epsilon, delta):
      # Initialize parameters
      p = 1
      X = set()
      thresh = math.ceil((12 / epsilon ** 2) * math.log(8 * len(A) / delta))

      # Process the stream
      for ai in A:
          if ai in X:
              X.remove(ai)
          if random.random() < p:
              X.add(ai)
          if len(X) == thresh:
              X = {x for x in X if random.random() >= 0.5}
              p /= 2
              if len(X) == thresh:
                  return '⊥'

      return len(X) / p


  # Example usage
  A = [1, 2, 3, 1, 2, 3]
  epsilon = 0.1
  delta = 0.01

  output = streaming_algorithm(A, epsilon, delta)
  print(output)

planede
9 replies
5h16m

  return '⊥'
what's this?

jbaber
4 replies
5h10m

In some symbolic logic classes, that character "bottom" represents "false" ad flipped "top" means true.

Don't know what they're getting at in the code, though.

hollerith
2 replies
4h55m

In some symbolic logic classes, that character "bottom" represents "false"

That's unfortunate, because in the study of computer programming languages, it means "undefined" (raise an error).

tsss
1 replies
3h40m

Not always. It is also the uninhabited bottom type.

hollerith
0 replies
2h56m

My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.

I have no idea what your point is.

dentemple
0 replies
5h0m

Once again proving the need for comments in code. Especially for comments that are more useful than "initialize parameters"

martinsnow
2 replies
5h7m

An easy way to identify who copies code without understanding it.

mudiadamz
1 replies
5h1m

You can just replace it with something like: print ('Invalid thresh or something')

martinsnow
0 replies
1h19m

This however looks scary so an innocent copy/paste programmer wouldn't touch it.

rcarmo
0 replies
4h12m

An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.

I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:

    from random import random

    def estimate_uniques(iterable, window_size=100):
        p = 1
        seen = set()

        for i in iterable:
            if i not in seen:
                seen.add(i)
            if random() > p:
                seen.remove(i)
            if len(seen) >= window_size:
                seen = {s for s in seen if random() < 0.5}
                p /= 2
        return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.

ericmcer
6 replies
4h55m

I don't think there is a single variable name or comment in this entire code block that conveys any information. Name stuff well! Especially if you want random strangers to gaze upon your code in wonder.

foobarian
2 replies
4h45m

Speaking of, one of my favorite discoveries with Unicode is that there is a ton of code points acceptable for symbol identifiers in various languages that I just can't wait to abuse.

>> ᚨ=3

>> ᛒ=6

>> ᚨ+ᛒ

9

pragma_x
0 replies
1h56m

Ah yes, programming like the vikings intended.

8372049
0 replies
2h30m

You're even using a and b, very good.

tgv
1 replies
4h40m

The names are literally taken from the paper.

seaman1921
0 replies
1h25m

well the paper also contains the code so I doubt anyone who looked at the paper cares about this paste - for folks who did not read the paper this is not very readable

sneva
0 replies
4h37m

Name stuff well

OP is following the same variable names of the article. I prefer that over changing the variable names and then figuring out what variable name maps in code to the article.

rcarmo
2 replies
4h41m

That's not streaming if you're already aware of the length of the iterable.

axitanull
1 replies
1h50m

In python, you can simply substitute `A` with an iterable or generator object, which can be a of unknown length.

mattkrause
0 replies
55m

But for this algorithm, you need to know the total length ("m") to set the threshold for the register purges.

Does it still work if you update m as you go?

sestep
1 replies
5h1m

Is this ChatGPT? Also, I feel like this would be more useful if it included import statements.

mudiadamz
0 replies
5h0m

  import math
  import random

melq
15 replies
10h25m

Estimating the amount of unique elements in a set and counting the amount of unique elements in a set are very different things. Cool method, bad headline.

dools
8 replies
9h44m

It's an approximation, not an estimation.

ranguna
2 replies
9h36m

Still very different things, no?

coldtea
1 replies
9h10m

It's the same thing at different degrees of accuracy. The goal is the same.

amelius
0 replies
7h58m

Still, counting things and counting unique things are two different procedures.

chrisweekly
2 replies
6h2m

For someone who's pretty well-versed in English, but not a math-oriented computer scientist, this seems like a distinction without a difference. Please remedy my ignorance.

lupire
1 replies
5h40m

My GP was wrong, but the words are different.

Eatimation is a procedure the generates an estimate, which is a kind of approximation, while approximation is a result value. They are different "types", as a computer scientist would say. An approximation is any value that is justifiably considered to be nearly exact. ("prox" means "near". See also "proximate" and "proxy".)

Estimation is one way to generate an approximation. An estimate is a subtype of an approximation. There are non-estimation ways to generate an approximation. For example, take an exact value and round it to the nearest multiple of 100. That generates an approximation, but does not use estimation.

jameshart
0 replies
4h49m

I’m not sure the linguistic differences here are as cut and dried as you would like them to be. Estimate and approximate are both verbs, so you can derive nouns from them both for the process of doing the thing, and for the thing that results from such a process.

Estimation is the process of estimating. It produces an estimate.

Approximation is the process of approximating. It produces an approximation.

You can also derive adjectives from the verbs as well.

An estimate is an estimated value.

An approximation is an approximate value.

But you’re right that the ‘approximate’ terms make claims about the result - that it is in some way near to the correct value - while the ‘estimate’ derived terms all make a claim about the process that produced the result (ie that it was based on data that is known to be incomplete, uncertain, or approximate)

davidmurdoch
0 replies
6h34m

The authors of the article disagree with you.

blackkettle
0 replies
8h23m

Actually, my understanding is that it is an estimation because in the given context we don't know or cannot compute the true answer due to some kind of constraint (here memory or the size of |X|). An approximation is when we use a simplified or rounded version of an exact number that we actually know.

jameshart
5 replies
5h53m

They’re not very different things; the terms are used interchangeably in most contexts because in the real world all counting methods have some nonzero error rate.

We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

I kind of think the mythology of the ‘countless stones’ (https://en.wikipedia.org/wiki/Countless_stones) is a sort of folk-reminder that you can never be too certain that you counted something right. Even something as big and solid and static as a standing stone.

The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

lupire
3 replies
5h49m

Come on. There is a fundamental difference between trying to get an exactly answer and not trying to get an exactly correct answer.

jameshart
2 replies
5h33m

It’s not a fundamental difference, it’s a fundamental constraint.

There are circumstances - and in real life those circumstances are very common - where you must accept that getting an exactly correct answer is not realistic. Yet nonetheless you want to ‘count’ things anyway.

We still call procedures for counting things under those circumstances ‘counting’.

The constraints on this problem (insufficient memory to remember all the unique items you encounter) are one such situation where even computerized counting isn’t going to be exact.

Levitz
1 replies
3h9m

I agree with you, but we are talking theory here. The algorithm doesn't count, it estimates.

You can make an algorithm that counts, you can make an algorithm that estimates, this is the second.

sangnoir
0 replies
1h21m

Estimation is counting with error bars.

Frankly, most of what you consider counting in your comment needs error bars - ask anyone who operated an all-cash cash-register how frequently end-of-day reconciliation didn't match the actual cash in the drawer (to the nearest dollar.)

The following is a list from my personal experience - of presumably precisely countable things that didn't turn out to be the case: the number of computers owned by an fairly large regional business, the number of (virtual) servers operated by a moderately sized team, the number of batteries sold in a financial year by a battery company.

samatman
0 replies
24m

the terms are used interchangeably in most contexts

Counting and estimating are not used interchangeably in most contexts.

because in the real world all counting methods have some nonzero error rate.

The possibility that the counting process may be defective does not make it an estimation.

We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

We talk about counting votes in elections because votes are counted. The fact that the process isn't perfect is a defect; this does not make it estimation.

That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

No. Exit polling is estimation. Vote counting is counting. Vote recounting is also counting, and does not necessarily impose a tighter error bound, nor necessarily derive a different number.

The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

So like, computers? Regardless, this is wrong. Estimating something and counting it are not the same thing. Estimation has uncertainty, counting may have error.

This is like saying addition estimates a sum because you might get it wrong. It's just not true.

pixelmonkey
10 replies
11h38m

This algorithm seems to resemble HyperLogLog (and all its variants), which is also cited in the research paper. Using the same insight of the estimation value of tracking whether we've hit a "run" of heads or tails, but flipping the idea on its head (heh), it leads to the simpler algorithm described, which is about discarding memorized values on the basis of runs of heads/tails.

This also works especially well (that is, efficiently) in the streaming case, allowing you to keep something resembling a "counter" for the distinct elements, albeit with a error rate.

The benefit of HyperLogLog is that it behaves similarly to a hash set in some respects -- you can add items, count distinct them, and, importantly, merge two HLLs together (union), all the while keeping memory fixed to mere kilobytes even for billion-item sets. In distributed data stores, this is the trick behind Elasticsearch/OpenSearch cardinality agg, as well as behind Redis/Redict with its PFADD/PFMERGE/PFCOUNT.

I am not exactly sure how this CVM algorithm compares to HLL, but they got Knuth to review it, and they claim an undergrad can implement it easily, so it must be pretty good!

j-pb
4 replies
2h35m

It's a really interesting open problem to get the cost of these down so that they can be used to heuristically select the variable order for worst case optimal joins during evaluation.

It's somewhere on the back of my todo list, and I have the hunch that it would enable instance optimal join algorithms.

I've dubbed these the Atreides Family of Joins:

  - Jessicas Join: The cost of each variable is based on the smallest number of rows that might be proposed for that variable by each joined relation.
  - Pauls join: The cost of each variable is based on the smallest number of distinct values that will actually be proposed for that variable from each joined relation.
  - Letos join: The cost of each variable is based on the actual size of the intersection.
In a sense each of the variants can look further into the future.

I'm using the first and the second in a triplestore I build in Rust [1] and it's a lot faster than Oxigraph. But I suspect that the constant factors would make the third infeasable (yet).

1: https://github.com/triblespace/tribles-rust/blob/master/src/...

refset
3 replies
2h10m

Having read something vaguely related recently [0] I believe "Lookahead Information Passing" is the common term for this general idea. That paper discusses the use of bloom filters (not HLL) in the context of typical binary join trees.

Letos join

God-Emperor Join has a nice ring to it.

[0] "Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis" - https://www.vldb.org/pvldb/vol16/p2962-zhang.pdf

j-pb
2 replies
1h48m

Thanks for the interesting paper!

  We now formally define our _God-Emperor Join_ henceforth denoted join_ge...
Nice work with TXDB btw, it's funny how much impact Clojure, Datomic and Datascript had outside their own ecosystem!

Let me return the favour with an interesting paper [1] that should be especially relevant to the columnar data layout of TXDB. I'm currently building a succinct on-disk format with it [2], but you might be able to simply add some auxiliary structures to your arrow columns instead.

1: https://aidanhogan.com/docs/ring-graph-wco.pdf

2: https://github.com/triblespace/tribles-rust/blob/archive/src...

refset
1 replies
1h24m

> Nice work with TXDB btw

It's X.T. (as in 'Cross-Time' / https://xtdb.com), but thank you! :)

> 1: https://aidanhogan.com/docs/ring-graph-wco.pdf

Oh nice, I recall skimming this team's precursor paper "Worst-Case Optimal Graph Joins in Almost No Space" (2021) - seems like they've done a lot more work since though, so definitely looking forward to reading it:

> The conference version presented the ring in terms of the Burrows–Wheeler transform. We present a new formulation of the ring in terms of stable sorting on column databases, which we hope will be more accessible to a broader audience not familiar with text indexing

j-pb
0 replies
1h4m

My apologies! It's even more emberrasing given the fact that I looked up the website, knowing that I _always_ swap them after having written too many `tx-listen` in my career.

They expanded their work to wider relations and made the whole framework a lot more penetrable. I think they over-complicate things a bit with their forward-extension, so I'm keeping every column twice (still much better than all permutations), which in turn allows for ad-hoc cardinality estimation.

Also the 1 based indexing with inclusive ranges is really not doing them any favours. Most formula become much more streamlined and simpler with 0 based indexing and exclusive ranges. (see `base_range` and `restrict_range` in [0])

0: https://github.com/triblespace/tribles-rust/blob/e3ad6f21cdc...

jalk
0 replies
2h16m

Iirc intersection requires the HLLs to have similar cardinality, otherwise the result is way off.

willvarfar
2 replies
6h14m

Just curious, dusting off my distant school memories :)

How do the HLL and CVM that I hear about relate to reservoir sampling which I remember learning?

I once had a job at a hospital (back when 'whiz kids' were being hired by pretty much every business) where I used reservoir sampling to make small subsets of records that were stored on DAT tapes.

michaelmior
1 replies
4h38m

I guess there is a connection in the sense that with reservoir sampling, each sample observed has an equal chance of remaining when you're done. However, if you have duplicates in your samples, traditional algorithms for reservoir sampling do not do anything special with duplicates. So you can end up with duplicates in your output with some probability.

I guess maybe it's more interesting to look at the other way. How is the set of samples you're left with at the end of CVM related to the set of samples you get with reservoir sampling?

_a_a_a_
0 replies
1h42m

Was wondering the same, thanks for an answer.

imoverclocked
9 replies
11h26m

When do we stop calling this counting and start calling it estimation?

vlovich123
4 replies
11h13m

As soon as people start reading past the headline.

112233
3 replies
11h3m

tbh, the title (and introduction) did a lot to dissuade me from finishing the (really good) article. It was actually informative, why dress it as a SEO blogspam?

seanhunter
0 replies
10h47m

Presumably so it is optimised for search engines and people find it.

Publishers generally have good data about where their audience comes from. They wouldn't do this if it wasn't the best way they know of to maximise readership.

lupire
0 replies
5h50m

It's Quanta. Their mission is to make laypeople like math (not understand math), so they drown the math in sugar.

HDThoreaun
0 replies
10h43m

So they can get paid for their work? Are you giving them anything?

pmontra
1 replies
11h15m

Yes, even the subtile states "a simple algorithm for estimating large numbers of distinct objects".

cubefox
0 replies
8h54m

This doesn't excuse the title though.

card_zero
0 replies
10h35m

Seems this is one of those things like UUIDs where we rely on it being very unlikely to be wrong, because statistics.

the accuracy of this technique scales with the size of the memory.

I wonder if that's proportional to the number of distinct items to count, though.

if the [memory] is so big that it fits all the words, then we can get 100% accuracy

Yes, but then the algorithm isn't being used any more, that's just normal counting.

They counted the distinct words in Hamlet with a memory size of 100 words, about 2.5% of the number to find, and got a result that was off by 2. If you do the same with the whole of Shakespeare, again using 2.5% of the memory needed to hold all the distinct words, is the accuracy better?

Anyway, this is limited to counting, and doesn't help list what the words are, though quickly counting them first is perhaps a way to speed up the task of actually finding them?

BlackFly
0 replies
8h29m

When it is actually impossible to count something and when the error between estimation and an exact answer is not significant the pedantic distinction is not helpful.

The same thing happens with measurement. No measurement is ever exact. If I said I was measuring a count, someone would probably correct to say that I am counting.

Common speech is like that.

drycabinet
7 replies
9h6m

Does finding the number of unique elements in a set actually require comparison of each element with everything else? Can't you use a hashtable? For every element, add it to the table (ignore if already exists), and finally, take a count of keys.

xwolfi
1 replies
8h53m

Imagine a million elements. How big must your hashtable be ? The article explains it very well, did you miss it ? It's a way to save memory.

But to be honest I implemented it, ran it on Hamlet, and it's very wrong, it's barely useful but maybe if you just need a vague idea...

Anduia
1 replies
8h40m

Using a hashtable is effective because you only compare elements within their hash buckets, not the entire set. However, they can become inefficient with very large datasets due to memory usage and processing time, which is where approximate counts shine.

datavirtue
0 replies
5h50m

This algorithm is still spinning a lot of random. I would guess that this is much less overhead than hashing but still seems like it could be significant.

theginger
0 replies
8h50m

That is fine when you have say 1 million values and only 1000 are unique. But when you have 1 million values and about 900 thousand are unique you are putting more or less the whole data set into memory.

jangxx
0 replies
8h54m

Using a hashtable is the "normal" approach mentioned in the article. It works of course, but requires memory to store each unique element (or their hashes). If you have less memory available, the described algorithm can still give a very good approximation.

ReleaseCandidat
0 replies
8h10m

Does finding the number of unique elements in a set

That's easy, that's all of them. Sorry, could not resist.

Yes, hashing is the usual method. In a sorted list you can compare to the following element.

pytness
6 replies
5h4m

Is it me or is the description of the algo wrong?

    > Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word;
If i follow this description of "check if exists in list -> delete":

    if hash_set.contains(word) {
        if !keep_a_word(round) {
            hash_set.remove(word);
            continue;
        }
    } else {
        hash_set.insert(word.to_string());
    }
The algorithm runs for ~20 iterations:

    Total word count: 31955 | limit: 1000
    End Round: 20, word count: 737
    Unique word count: 7233
    Estimated unique word count: 772800512
But if I save the word first and then delete the same word:

    hash_set.insert(word.to_string());

    if !keep_a_word(round) {
        hash_set.remove(word);
        continue;
    }
It gets the correct answer:

    Total word count: 31955 | 1000
    End Round: 3, word count: 905
    Unique word count: 7233
    Estimated unique word count: 7240

josephernest
3 replies
4h45m

I got the same problem.

When implementing the exact method as described in quanta magazine (without looking at the arxiv paper), I always had estimates like 461746372167462146216468796214962164.

Then after reading the arxiv paper, I got the the correct estimate, with this code (very close to mudiadamz's comment solution):

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem =  set()  
    for k in L:
        if k in mem:
            mem.remove(k)
        if np.random.rand() < p:
            mem.add(k)
        if len(mem) == thresh:
            mem = {m for m in mem if np.random.rand() < 0.5}
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Or equivalently:

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem = []
    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Now I found the quanta magazine formulation problem. By reading:

Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

we want to write:

    for k in L:
        if k not in mem:
            mem += [k]
        else:
            if np.random.rand() > p:
                mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
whereas it should be (correct):

    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:    # without the else
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
Just this little "else" made it wrong!

Alexanfa
2 replies
4h38m

Quanta:

    Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

To:

    Round 1. Keep going through Hamlet, but now flipping a coin for each word. If it’s tails, delete the word if it exists; heads, and add the word  if it's not already on the list.

Old edit:

    Round 1. Keep going through Hamlet, adding words but now flipping a coin immediately after adding it. If it’s tails, delete the word; heads, and the word stays on the list.

josephernest
1 replies
4h25m

adding words but now flipping a coin immediately after adding it

Edit: I thought your formulation was correct but not really:

We flip the coin after adding, but we also flip the coin even if we didn't add the word (because it was already there). This is subtle!

wrong:

    if k not in mem:
        mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
wrong:

    if k not in mem:
        mem += [k]
    else:
        if np.random.rand() > p:
            mem.remove(k)
correct:

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)
correct:

    if k not in mem:
        mem += [k]
    if np.random.rand() > p:
        mem.remove(k)

Alexanfa
0 replies
4h7m

Ah, I'm using a set instead of list so I just always add and then toss remove.

Alexanfa
1 replies
4h48m

Was just now solving it and came to see if others had the same issue. Yep, you are right.

    function generateRandomNumbers(c, n) {
      let randomNumbers = new Array(c);
      for (let i = 0; i < randomNumbers.length; i++) {
        let randomNumber = Math.floor(Math.random() * (n + 1));
        randomNumbers[i] = randomNumber;
      }
      return randomNumbers;
    }
    function run(w, wS, m, r) {
        function round(r) {
            while(wS.size < m) {
                const next = w.next()
                if (next.done) return true;
                wS.add(next.value)
                prune(next.value, r)
            }
            return false
        }
        function prune(v,r) {
            for (let i = 0; i < r; i++) {
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(v)
                }
            }
        }
        function purge(wS) {
            const copy = new Set(wS)
            copy.forEach(ith=>{
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(ith)
                }
            })
        }
        const done = round(r);
        if (!done) {
            purge(wS)
            return run(w, wS, r+1,m)
        }
        console.log(`Round ${r} done. ${wS.size} Estimate: ${wS.size / (1/Math.pow(2,r))}`)
    }
    const memory = 1000
    const words = generateRandomNumbers(3000000,15000)
    const w = words[Symbol.iterator]() // create an iterator
    const wS = new Set();
    run(w,wS, memory,0);

Alexanfa
0 replies
1h37m

Noticed an error;

    return run(w, wS, r+1,m)
Should be changed to:

    return run(w, wS, m, r+1)

matt3210
5 replies
11h16m

CS guys always wanting to throw away a good system and start from scratch.

coldtea
1 replies
9h6m

Yeah, if only we have stuck with those stone wheels from 20,000 B.C.

datavirtue
0 replies
5h40m

Well, they are easier to work on.

baq
0 replies
10h40m

Tell me you didn't read the article without telling me you didn't read the article.

...but actually, I think you didn't even read the headline...?

Miraltar
0 replies
11h11m

Which good system are we talking about ?

112233
0 replies
11h5m

Which good system are you referring to?

biscuit1v9
4 replies
11h11m

The trick, he said, is to rely on randomization.

When the space is full, press pause and flip a coin for each word. Heads, and the word stays on the list; tails, and you delete it.

I wasn't expecting to go that far: randomization. How can you verify if the answer is good? Only approximation, maybe..

HDThoreaun
1 replies
10h41m

Yes, the result is an estimation.

biscuit1v9
0 replies
9h11m

Thanks. Just wanted to be sure I didn't misunderstood.

8372049
0 replies
10h46m

Did you read the (entire) article?

akamoonknight
4 replies
11h9m

I don't know a word or phrase for this, but I really enjoy any examples of "thinking outside the box" like this because it's something I struggle with in my professional career. Learning not only the right ways to solve problems, but figuring out the questions to ask that make solving the problems you have easier or even in some cases possible. In this case, it's hey, we don't need exact numbers if we can define a probabilistic range given defined parameters. Other problems are gonna have other questions. I guess my hope is that if I see enough examples I'll be able to eventually internalize the thought process and apply it correctly.

wmwragg
2 replies
10h54m

I think it's generally thought of as "lateral thinking", Edward de Bono has written a few books about it you might find interesting.

coldtea
0 replies
9h11m

And some more commonplace words like "creativity" (as in "creative solution") etc. would apply.

bzuker
0 replies
5h43m

any particular one you'd recommend?

dentemple
0 replies
4h52m

To be fair, this was a university research team. Literally, a team of folks who can, all day everyday, iterate over a single topic using the Scientific Method.

If you were paid by a big company to sit at a whiteboard all day with a team of equally intelligent engineers, I'm sure you'd be come up with SOMETHING that would look like an "outside the box" solution to the rest of the world.

However, most of us are paid to work the JIRA factory line instead, which limits the amount of time we can spend experimenting on just one single problem.

112233
4 replies
10h51m

After skimming Knuth's paper — does the algorithm work if values are hashed, that is, the "uniform deviate" is selected deterministically for each unique value of the stream?

devnonymous
3 replies
10h3m

Not sure which Knuth paper you're referring to but skimming through the article my understanding is this algorithm works /only/ if the values are hashable. IOW how else does one define unique/distinct values ?

hcs
0 replies
9h32m

I think the analysis relies on independent random "u"s, even for the same key.

dist-epoch
3 replies
9h19m

Do they assume any particular distribution of the items?

Otherwise Nassim Taleb would like a word with them.

coldtea
2 replies
9h8m

No, and Taleb is not relevant for this.

dist-epoch
1 replies
8h44m

It is if the distribution is extreme.

coldtea
0 replies
8h38m

If Hamlet was 100000000 times the word Hamlet and 1 time the word pancake, it would still give an good estimate measurement - even if pancake gets 0.

cb321
3 replies
6h37m

This Nim program might clarify some things for someone.

    import std/[random, math, sets, times]; type F = float
    
    template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped =
      ## Return (unbiased estim. cardinality of `it`, 1-alpha CI ebound*(1 +- e)).
      var x1 = initHashSet[type(xfm)](k); var x  = x1.addr
      var x2 = initHashSet[type(xfm)](k); var y  = x2.addr
      var p  = 1.0          # p always 2^-i means could ditch FP
      var n  = 0            #..RNG & arith for bit shifts/masks.
      let t0 = epochTime()
      for e in it:
        inc n   #; let e = xfm # to compute randState.next here
        x[].excl e          # 'd be nice to reduce to 1 hash by
        if rand(1.0) < p:   #..excl w/prob 1 - p but then cannot
          x[].incl e        #..add new keys.  p -> 0 anyway.
        while x[].len == k: # KnuthLoop not CVMIf guards against
          for e in x[]:     #..unlikely case of freeing nothing.
            if rand(1.0) < 0.5: y[].incl e
          p *= 0.5
          swap x, y; y[].clear # ↓ e.bound conservative by ~15X
      (int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F),
       (epochTime() - t0)*1e9/n.F)
    
    when isMainModule:  # Takes nItem/trial dupMask space nTrial
      when defined danger: randomize()
      import std/[strutils, os, strformat]
      let n   =  if paramCount()>0: parseInt(paramStr(1)) else: 100000
      let msk = (if paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64
      let k   =  if paramCount()>2: parseInt(paramStr(3)) else: 512 # 32KiB
      let m   =  if paramCount()>3: parseInt(paramStr(4)) else: 35
      var ex  = initHashSet[uint64]()
      var xs  = newSeq[uint64](n)
      for j in 1..m:
        xs.setLen 0; ex.clear
        for i in 1..n: xs.add randState.next and msk
        for e in xs: ex.incl e
        let (apx, err, t) = uniqCEcvm(xs, uint64, k)
        let ap = apx.F; let an = ex.len.F
        echo ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f} {t:.1f}ns"
First few lines of laptop output:

    51160 49152 eb: 0.6235 |x-a|/x: 0.0392 10.5ns
    51300 51840 eb: 0.6235 |x-a|/x: 0.0105 10.7ns
    51250 55168 eb: 0.6235 |x-a|/x: 0.0764 10.9ns
    51388 52352 eb: 0.6235 |x-a|/x: 0.0188 10.5ns
Ditching the floating point RNG & arithmetic might be able to lower that time per element cost to more like 5 ns or maybe 2GiB/s for 8B ints. The toggle back & forth between old & new HashSet is just to re-use the same L1 CPU cache memory.

Note also that the error bound in the CVM paper seems very conservative. My guess is that it could maybe be tightened by 15+X (at the scale of those example numbers), but that might also require some kind of special function evaluation. Anyone have any ideas on that?

Also note that since this is just estimating unique identities, you can always just count unique 64-bit hashes of keys (or maybe even 32-bit hashes depending on your accuracy needs and cardinalities) if you care about the key space/key compares are slow.

menthe
2 replies
5h18m

Frankly, who can read this!? I am not sure what's worse, the multi-line comments spanning multiple lines of code, having multiple instructions on a single line, or the apparent disconnect between the pseudo-code of the article.

V1ndaar
1 replies
3h31m

I would blame the majority of your criticism on the fact that HN is not the best place to read code. Also, syntax highlighting & basic familiarity with Nim helps.

His code is doing a few more things than necessary. The actual algorithm is inside the `uniqCEcvm` template. The `it` it receives is anything you can iterate over (a collection or an iterator). Multiple things in one line really only appear where they directly relate to the part at the beginning of the line.

The `when isMainModule` is Nim's way of Python's `if __name__ == "__main__"`. The entire part below that is really just a mini CL interface to bench different (random) examples. Final thing to note maybe, the last expression of a block (e.g. of the template here) will be returned.

And well, the style of comments is just personal preference of course. Whether you prefer to stay strictly below 80 cols or not, shrug.

I grant you that the usage of 2 sets + pointer access to them + swapping makes it harder to follow than necessary. But I assume the point of it was not on "how to write the simplest looking implementation of the algorithm as it appears in the paper". But rather to showcase a full implementation of a reasonably optimized version.

Here's a version (only the algorithm) following the paper directly:

    proc estimate[T](A: seq[T], ε, δ: float): float =
      let m = A.len
      var p = 1.0
      var thr = ceil(12.0 / ε^2 * log2(8*m.float / δ))
      var χ = initHashSet[T](thr.round.int)
      for i in 0 ..< m:
        χ.excl A[i]
        if rand(1.0) < p:
          χ.incl A[i]
        if χ.card.float >= thr:
          for el in toSeq(χ): # clean out ~half probabilistically
            if rand(1.0) < 0.5:
              χ.excl el
          p /= 2.0
          if χ.card.float >= thr:
            return -1.0
      result = χ.card.float / p

menthe
0 replies
2h13m

Thank you very much for having taken the time.. Your comments and function are both very helpful!

renonce
2 replies
7h45m

At first find this paragraph confusing:

Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

Why would you delete the word already on the list by flipping coins? Doesn't this reduce the accuracy by counting less words than expected? And will the word be added to the list later?

After thinking about it for a while and reading the paper, I've finally developed a good mental model for how this algorithm works as below, which should convince even a high schooler why the algorithm works:

1. You are given a streaming set of elements from [n] (a finite set of n distinct elements). Now let's give each element a random uniform real number in [0,1]. This helps us choose the elements we want: if we choose all elements with the number below 0.5, we get about half of the elements.

2. Assume for a moment we have unbounded memory. Now we maintain a table of already seen elements, and for each element we keep the last real number attached with that element: so if an element A appears three times with the number 0.3, 0.7, 0.6, we keep A with the number 0.6.

3. Our memory is bounded! So we keep only the elements below a threshold 2^-r, that is, 1, 0.5, 0.25, etc. So at first we will keep all elements, but when we don't have enough memory, we will need to filter existing elements according to the threshold. It's easy to see that when we decrease threshold, only elements already in memory will meet the criteria and continue to exist in memory. No other elements in the stream will be below threshold. Also note that whether an element is in memory depends only on its last occurence. It can exist in memory for a while, get dropped because a later element does not meet the threshold, and get back in.

4. We don't actually have a real number attached to every element! But we can pretend as if there is one. For each new element X from the stream, we replace the number in memory with its attached number, and we only care if its attached number is below 2^-r or not. If it is, it should be in our memory, and if it's not, it should be out. Once the number is in memory, it's a random number in [0,2^-r] and we care no further.

When increasing r, we only care about whether the number is in [0,2^{-r-1}]. This has exactly 1/2 probability. So each number has 1/2 probability of getting out, and 1/2 probability of being kept in memory.

5. Now it's easy to see that whether an element is in the memory depends solely on the real number attached to its last occurence. That is, each distinct element has exactly 2^-r probability of being in memory. 2^r multiplied by the number of elements in memory gives a good estimate of number of distinct elements.

They criticized previous approaches as relying on hash functions. A simplified version of the approach is as follows:

1. Make a hash function that maps distinct elements to different uniform real numbers. So all As compute to 0.3, all Bs compute to 0.7, etc.

2. Use that hash function to transform all elements. The same element maps to the same real number, and different elements map to different real numbers.

3. Keep a set of N smallest distinct real numbers. This works by inserting numbers from the stream to the set. It's easy to see that adding the same element multiple times has exactly no effect; it's either too big to be in the set or the same as an existing number in the set.

4. This gives the Nth smallest real number in the set as K. The number of distinct elements is approximated as N/K.

This algorithm is remarkable, but not in that it's efficient or it's new. Both algorithm using a hash function and algorithms without hash functions has been around and is optimal. I assume this algorithm is the first optimal algorithm without hash functions that is very easy to understand even by undergraduates.

majikandy
1 replies
6h41m

I had the same problem with the same paragraph and still don’t quite get it.

Unfortunately I struggle to follow the detailed explanation you gave… since you seem to understand it… can you confirm that they really do mean to throw away the word in the list they just found?

Eg

ACT I SCENE Elsinore A platform before the castle FRANCISCO at his post. Enter to him

If buffer max is 16, I am supposed to randomly half it first?

Act Scene Elisnore Platform Before The His Post

Now what? The next words are “Bernado Bernado who’s there”

Majromax
0 replies
5h22m

Yes, you're supposed to throw it away.

The key insight is that words should appear in your scratch space with equal probability, no matter how often they appear in the source text. If you have a scratch space of size one, then the sequence of "apple" x 16 + "banana" x 1 should have equal chances of of the scratch space containing [apple] or [banana] at the end of the sequence, at least averaged over all 17 permutations.

One way to achieve this result is to make the scratch space memory-free. Rather than think of it as "remove the word with probability x", think of it as "always remove the word, then re-add it with probability (1-x)".

ceving
2 replies
7h18m

Can some native speaker tell me: is "count" the new "guess"?

dekhn
0 replies
3h33m

It's estimation, not counting.

novaRom
1 replies
8h27m

I wish we had a theory connecting randomness with time/space complexity.

Intuitively I think the key to many computational limitations is making use of randomness.

kromem
1 replies
9h29m

Ah, so Thanos was just conducting a census.

rcarmo
0 replies
5h3m

I see what you did there.

ilya_m
1 replies
8h25m

"In a recent paper" - really? The paper first appeared in ESA 2022. The revised version (with some errors fixed) is from May 2023. To be fair, compared to the rate of progress in this area between Flajolet & Martin (1985) and the previous SOTA (Blasiok, 2018), a couple of years of sitting on this news is not a lot.

davidmurdoch
0 replies
4h47m

Sounds like this qualifies as "recent" to me.

gnfargbl
1 replies
7h25m

On the topic of counting things, I would like to mention this efficient and easily-implemented algorithm for finding the top-k items in a stream, which I think is perhaps not as well known as it should be:

A Simple Algorithm for Finding Frequent Elements in Streams and Bags

Karp, Shenker & Papadimitriou

https://www.cs.umd.edu/~samir/498/karp.pdf

vanderZwan
0 replies
5h8m

the top-k items in a stream

Hmm, this is phrased in a way that sounds different (to my ears) than the abstract, which says:

it is often desirable to identify from a very long sequence of symbols (or tuples, or packets) coming from a large alphabet those symbols whose frequency is above a given threshold

Your description suggests a finding fixed nr of k items, with the guarantee that it will be the top ones. The abstract sounds like if determines an a priori unknown number of items that meet the criteria of having a particular value greater than k.

So "find the 100 oldest users" vs "find all users older than 30".

Am I misunderstanding you or the abstract? (English is not my first language)

eterevsky
1 replies
6h23m

Computer scientists invent a memory-efficient way to estimate the size of a subset

m3kw9
0 replies
4h43m

It seems fast too as you can use less rounds of flips and get an estimate meaning you may not need to go thru the entire “book” to get an estimate of distinct words

deadeye
1 replies
2h53m

Am I correct in thinking the accuracy of this method has more to do with the distribution in the sample than the algo itself. Meaning, given a 100% unique list of items, how far off would this estimate be?

MarkusQ
0 replies
2h3m

It doesn't appear to be heavily dependent on the distribution. It mostly depends on how large your buffer is compared to the number of unique items in the list. If the buffer is the same size or larger, it would be exact. If it's half the size needed to hold all the unique items, you drop O(1 bit) of precision; at one quarter, O(2 bits), etc.

zero_k
0 replies
3m

I was involved with implementing the DNF volume counting version of this with the authors. You can see my blog post of it here:

https://www.msoos.org/2023/09/pepin-our-probabilistic-approx...

And the code here: https://github.com/meelgroup/pepin

Often, 30% of the time is spent in IO of reading the file, that's how incredibly fast this algorithm is. Crazy stuff.

BTW, Knuth contributed to the algo, Knuths' notes: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

He actually took time off (a whole month) from TAOCP to do this. Also, he is exactly as crazy good as you'd imagine. Just mind-blowing.

xpe
0 replies
38m

I have questions for Quanta Magazine -- this is probably a shot in the dark, but if anyone can ask them a few questions, here would be three of mine:

1. Have independent researchers evaluated the effectiveness of Quanta Magazine at their mission? [1]

2. In the case of this article, did the editors consider including visualization? If not, why not?

3. Does Quanta have interactive designers and/or data scientists on staff? How many? Why not more?

## Background & Biases

I'm trying to overcome my disappointment in Quanta by being curious about their mission, organization, and constraints. I am glad they exist, but sometimes your "closest allies" can be one's harshest critics.

I'm greatly troubled by the level of scientific, mathematical, and rational understanding among the denizens of the world. But for some reason, the state of science writing bothers me more. It would seem that I hold out hope that science writers could do better. This may be unfair, I admit.

Anyhow, rather than just bash Quanta for being, say, not as good at the best math blogs or YouTube channels (such as 3Blue1Brown), I really want to figure out (a) if I'm missing something; or (2) if they are actively trying to improve; and (3) what we can all learn from their experience.

[1] From https://www.quantamagazine.org/about/ : "Quanta Magazine is an editorially independent online publication launched by the Simons Foundation in 2012 to enhance public understanding of science. Why Quanta? Albert Einstein called photons “quanta of light.” Our goal is to “illuminate science.”"

vitus
0 replies
5h24m

From the paper [0]:

We state the following well-known concentration bound, Chernoff bound, for completeness.

Which variant of the Chernoff bound is this? This is almost the (looser variant of the) multiplicative form, but it's not quite right (per the use of 1+delta instead of a single parameter beta). In particular, that bound is only guaranteed to hold for delta >= 0 (not beta = 1 + delta > 0 as asserted in the paper)

[0] https://arxiv.org/pdf/2301.10191

[1] https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_...

edit: to be clear: I'm not sure at all whether this breaks the proof of correctness, although I'm having a bit of difficulty following the actual details (I think I'd need to work through the intermediate steps on paper).

unnouinceput
0 replies
8h6m

"Computer scientists invent an efficient new way to approximate counting large unique entries" - fixed the title

unbalancedevh
0 replies
3h0m

I wonder how the error changes if you start a new round just before finishing the stream, compared to if the last word in the stream just fills the buffer and ends a round.

thesz
0 replies
9h52m

HyperLogLog uses additions, it keeps sums. Thus, you can subtract one HLL sums from other. This is useful if stream supports deletion. Streams with deletions can be found in log-structured merge trees, for one example, so one can estimate count of distinct elements in all of the LSM tree hierarchy.

The algorithm in the paper does not allow for deletions.

Also, if one counts statistics of the stream of large elements (say, SHA-512 hashes, 64 bytes per hash), this algorithm requires some storage for elements from this stream, so memory requirement is O(table size * element size).

theelous3
0 replies
4h32m

Imagine that you’re sent to a pristine rainforest to carry out a wildlife census.

alt-f4

saulrh
0 replies
10h50m

Huh, that's a clever twist on reservoir sampling. Neat.

rep_lodsb
0 replies
8h29m

What if you’re Facebook, and you want to count the number of distinct users who log in each day, even if some of them log in from multiple devices and at multiple times?

Seems like a bad example of when this algorithm could be useful in practice.

If you already know you will want this info when designing the login process, it's simple: keep track of the date of last login for each account, and increment the unique user counter when the stored value is different from the current one.

And even if not, it should still be possible to "replay" a stream of login events from the database later to do this analysis. Unless maybe you already had years worth of data?

prerok
0 replies
3h7m

While the algorithm is interesting and I commend the authors for the algorithm, it should be noted that this is a guesstimate only. So, it's not really an efficient way to count distinct values but an efficient way to get an approximate count of distinct values.

ngrilly
0 replies
8h21m

It is quite amazing that after we had so many talented researchers working for decades on this problem, the authors were able to come up with such an elegant and simple solution. Another proof that research is very necessary and useful.

mring33621
0 replies
4h30m

Feels like reservoir sampling to me

minkzilla
0 replies
19m

There is a big practicality problem I see with this algorithm. The thresh defined in the paper relies on the length of the stream. It seems to me that in a scenario where you have a big enough data set to desire a solution that doesn't just store every unique value you don't know the length. I did not make it through all the proofs but I believe they use the fact that the defined threshold has the length in it to prove error bounds. If I were to use this in a scenario where I need to know error bounds i would probably ballpark the length of my stream to estimate error bounds and then use the algorithm with a ballpark threshold depending on my systems memory.

Another practical thing is the "exception" if nothing is removed on line 6 in the original algorithm. This also seems needed for the proof but you would not want in production, though the chance of hitting it should be vanishingly small so maybe worth the gamble?

Here is my faithful interpretation of the algorithm. And then a re-interpretation with some "practical" improvements that almost certainly make the provability of the correctness impossible.

    func CountUnique(scanner *bufio.Scanner, epsilon float64, delta float64, m int) int {

    X := make(map[string]bool)
    p := 1.0
    thresh := int(math.Ceil((12 / (epsilon * epsilon)) \* math.Log(8*float64(m)/delta)))

    for scanner.Scan() {
        a := scanner.Text()
        delete(X, a)
        if rand.Float64() < p {
            X[a] = true
        }

        if len(X) == thresh {
            for key := range X {
                if rand.Float64() < 0.5 {
                    delete(X, key)
                }
            }
            p /= 2
            if len(X) == thresh {
                panic("Error")
            }
        }
    }

    return int(float64(len(X)) / p)
}

  func CountUnique2(scanner *bufio.Scanner, thresh int) int {

     //threshold passed in, based on system memory / estimates
    X := make(map[string]bool)
    p := 1.0

    for scanner.Scan() {
        a := scanner.Text()
        delete(X, a)
        if rand.Float64() < p {
            X[a] = true
        }

        if len(X) >= thresh {  // >= instead of == and remove the panic below
            for key := range X {
                if rand.Float64() < 0.5 {
                    delete(X, key)
                }
            }
            p /= 2
        }
    }

    return int(float64(len(X)) / p)
}

I tested it with Shakespeare's work. The actual unique word count is 71,595. With the second algorithm it is interesting to play with the threshold. Here are some examples.

threshold 1000 Mean Absolute Error: 2150.44 Root Mean Squared Error: 2758.33 Standard Deviation: 2732.61

threshold 2000 Mean Absolute Error: 1723.72 Root Mean Squared Error: 2212.74 Standard Deviation: 2199.39

threshold 10000 Mean Absolute Error: 442.76 Root Mean Squared Error: 556.74 Standard Deviation: 555.53

threshold 50000 Mean Absolute Error: 217.28 Root Mean Squared Error: 267.39 Standard Deviation: 262.84

klaussilveira
0 replies
5h6m

This actually comes at a good time. I'm currently refactoring a system that counts visits using inserts into a table, with a tuple of date and IP. I was planning to replace it with a HLL approach, but this is really interesting.

hum3hum3
0 replies
10h25m

The counting/estimstion technique is rather like a floating point number. An integer exponent k and a mantissa of a population.

gh0stcloud
0 replies
8h59m

very interesting solution. A perfect example of how even some very complicated problems can sometimes have simple solutions. Will definitely try to write an implementation of this.

The only minor downside to this is that it's obviously not deterministic since it depends on chance. But for many applications where the dataset is so big it doesn't fit in memory, that's probably just a tiny rounding error anyway.

fnord77
0 replies
27m

With other count-distinct algorithms, you can do unions and intersections on the "sets". (theta sketches, even bloom filiters)

Can you do this with CVM?

burjui
0 replies
20m

The algorithm uses less memory, but more CPU time because of rather frequent deletions, so it's a tradeoff, not just generally better algorithm, as article may suggest.

brianhorakh
0 replies
50m

Wondering if this approach could I found myself be applied to CFD (computational flow dynamics) methods to reduce the total volume of data points while still getting approximately close to the correct final answer?

arn3n
0 replies
7h3m

Does anyone else notice that quanta consistently has good illustrations and diagrams for their articles? Good visualizations and graphics can do extraordinary things for the understandability of a topic -- people like 3Blue1Brown understand this, and I'm glad quanta puts effort into it.

SoftTalker
0 replies
2h45m

New interview question: Tell me how you would estimate the number of unique words in Hamlet using only 100 elements.

Gbotex
0 replies
7h0m

nice read!