10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.
It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.
As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.
The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.
Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.
(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)
The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.
Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.
Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.
As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.
I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"
What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative.
It’s more like 2NM where M is loading the data from disk/memory. One pass is 2N+M.
Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p
This is true, but a big advantage of the single pass method is data reuse. Instead of loading each element twice, you load each element once. Per bit, reading/writing to external memory is massively more energy intensive than any compute op. Orders of magnitude more.
Does C# have the plumbing for this built in? It's been 7 years since using it so I might not be remembering correctly.
IEnumerable<T>.Max and .Min are the same as they were just significantly faster through the use of SIMD: https://github.com/dotnet/runtime/blob/ebbebaca1184940f06df6...
You could implement a similar (and simpler) fast path for types with contiguous memory by performing min and max per iteration instead.
what's preventing you from doing it in one pass?
Actually, seeking the bias numbers can be quite illuminating.
You're absolutely right! Some learning that we came to later that isn't unrelated to what you're saying... don't just look at metrics (in the case I've described above, it was timings of operations in a large system), but look at histograms for them. You should be able to explain why they have the shape they do. The distributions are so often multimodal, and understanding the modes helps you understand a lot more nuance about your system.
We were an infra-focused team building up a scalable data handling / computation platform in preparation to apply ML at a non-trivial scale. When we finally hired some people with deep stats knowledge 10 or 12 years ago, there was a lot of great learning for everyone about how each of our areas of expertise helped the others. I wish we'd found the right stats people earlier to help us understand things like this more deeply, more quickly.
Do you have a source for that claim?
I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.
True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.
Asymptotic properties of quantile estimators are widely studied [1]. The key is to have a sufficiently large sample size.
[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.
Yet, for any given distribution the sample size can be arbitrarily close to infinite. Unless I've missed something, I don't see the relevance.
If you want the n-9s rate of failure (eg., n=5, 99.999) for a system with a power-law performance distribution, you could be waiting much more than a billion samples to see a failure.
eg., 3E10 ms (30 bn samples) in a year, at 5-9s failure rate has 3E3 ms of failure (3 thousand samples) -- which will be lower given a sampling strategy.
the key word is “uniformly”. If your data is not uniformly distributed, then you just have to pick random values non-uniformly. There are many ways to do that, and once you have your algo you’ll be able to reliably find an approximation of the median much faster than you would find the actual median.
https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...
I wasn't saying that you could get within 1% of the true median, I was saying you could find an element in the 49th to 51st percentile. In your example, the 49th percentile would be -90 and the 51st percentile would be 90, so the guarantee is that you would (with very high probability) find a number between -90 and 90.
That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.
You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.
I was thinking the same thing.
In all use-cases I've seen a close estimate of the median was enough.
I’m not sure why you use a dictionary with keys 0…999, instead of an array indexed 0…999.
That's just a dict/map with less flexibility on the keys :D
I was using the term dictionary for illustration purposes. Remember, this was all in the context of MapReduce. Computation within MapReduce is built around grouping values by keys, which makes dictionaries a natural way to think about many MapReduce oriented algorithms, at least for me. The key/value pairs appear as streams of two-tuples, not as dictionaries or arrays.
Where were you working? Sounds like you got lucky to work on some fun problems!
Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.
But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).
Was this by any chance for generating availability metrics, and were you an intern at the time? The system sounds, ah, very familiar.
The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)
I am not sure. But from the outside, it looks like what Prometheus does behind the scenes. It seems to me that Prometheus works like that because it has a limit on latency time around 10s on some systems I worked. So when we had requests above that limit it got all on 10s, even though it could be higher than that. Interesting.