The One Billion Row Challenge

Ooh fun, Advent of Code chaser!

A fair comparison between languages should include the make and build times. I haven't used Java / Maven for years, and I'm reminded why, heading into minute 2 of downloads for './mvnw clean verify'.

Java build times are very fast. You are just measuring your internet speed here.

(Also, gradle is faster as a build tool for incremental compilation)

Gradle is faster than what? Than maven? Maybe. But not than Go or Cargo.

What step are we talking about? Javac itself is absolutely on the same order of magnitude speed as Go per loc, while rust is significantly slower (which makes sense, the latter is a properly optimizing compiler with heavy static analysis, while the former two just spews out java byte code/machine code).

Gradle with a daemon is also pretty fast, you are just probably used to some complex project with hundreds of dependencies and compare it to a cargo file with a single dependency.

Javac itself is absolutely on the same order of magnitude speed as Go per loc

Not really when it has to run on a cold JVM. And before it warms up, it’s already done.

Then you are in territory of keeping the compiler process between the runs, but that is a memory hog and also not always realiable (gradle often decides to run a fresh daemon for whatever reason).

So theoretically, in lab conditions yes, in practice no.

while rust is significantly slower

Everybody repeats that but there is surprisingly little evidence in form of benchmarks. I can see rustic compiles over 50k Loc per second on average on my M2, which is roughly the same order of magnitude as Java. And checking without building is faster.

Javac doesn't run in JVM. It is a C (C++?) application.

https://github.com/openjdk/jdk/tree/master/src/jdk.compiler/...,

Wiki citation for good measures https://en.wikipedia.org/wiki/Javac.

Maybe you're thinking of Hotspot, which is written in C++.

Just a nitpick, but the static analysis in Rust doesn’t account for most of why compilation is slow, as proven by the fact that cargo check is much faster than cargo build.

Yeah, I know. I believe it’s mostly the amount of LLVM IR that is the problem (e.g. each generic instantiation outputs by default a new copy of roughly the same code), isn’t it?

Gradle with a daemon is also pretty fast

I've yet to see a project where Gradle daemon a) does anything useful and b) is acutally used by gradle itself (instead of seemingly doing everything from scratch, no idea what it does in the seconds it takes for it to start up).

This. Gradle wastes tremendous amounts of time on non-compile tasks. So even if Java takes a few seconds to compile, the whole build is often much, much longer. Interestingly, my impression is that maven is significantly more snappy.

Itself.

gradle is faster as a build tool for incremental compilation

Implicit:

…than it is building from scratch, where it needs to download lots of stuff

I mean, yes, saying Java builds are fast does seem a bit “rolls eyes, yes technically by loc when the compiler is actually running” …but, ^_^! let’s not start banging on about how great cargo/rust compile times are… they’re really terrible once procedural macros are used, or sys dependencies invoke some heinous c dependency build like autoconf and lots of crates do… and there’s still a subpar incremental compilation story.

So, you know. Eh. Live and let live. Gradle isn’t that bad.

So maybe I was just unlucky with gradle and lucky with cargo. A project that is a mixture of 20k LoC Scala and 300k LoC Java took 6 minutes to compile, and incremental was still several tens of seconds. Cargo/Rust cold compiles a project of 1M LoC (all dependencies) in about 1:30 on the same hardware and incremental is like 2-4 seconds.

As for precedural macros - yes they can be slow to compile but so are Java annotation processors.

Scala is a much different case, it has one of the most advanced type systems, many implicit scoping rules etc. Even a java change that might affect the scala codebase could result in a slow incremental build. Also, the build system may not have been as modular as it could.

In my experience, java annotations gets processed very fast.

it has one of the most advanced type systems

Well, that's a bit of a stretch. It is definitely one of the most complex type systems due to interactions between nominal sub-typing and type classes which is a bit hairy. But in terms of what this type system can express or prove, it's both inferior (cannot express lifetimes or exclusive access) and superior (HKTs, path dependent types) to Rust in some aspects. Let's say they are close.

It is definitely one of the most complex type systems due to interactions between nominal sub-typing and type classes

That’s what I meant mostly. Though if I’m being honest, I’m not familiar with the implementation of either language’s type system.

Scala is the problem here. Scala has several issues filed for slow compilation.

300K LOC Java should be done within ~1.5 minutes flat from zero. Can be even faster if you are running maven daemon for example. Or even within ~30 seconds if everything is within one module and javac is only invoked once.

That depends on project and scope of the project.

I don't even think it's faster than Maven.

Gradle and fast should not be in the same sentence.

I have no idea what gradle and its daemon do except spending orders of magnitude more time starting up than running the actual build.

You're much better off running javac directly. Then it's fast.

Then someone in your team failed learning the bare minimum to write a sane gradle file, and are putting imperative stuff into the configuration phase.

For big projects, compiling only what’s necessary is a huge time win.

Then someone in your team failed learning the bare minimum to write a sane gradle file

Then someone in over 15 years and 8 versions gradle failed to make a system that doesn't require "learning to write sane gradle files" to make sure that it's only two orders of magnitude and not five orders of magnitude slower than it can be.

and are putting imperative stuff into the configuration phase

Has nothing to do with the slowness that is gradle.

For big projects, compiling only what’s necessary is a huge time win.

It is, and no one is arguing with that

It is your own failure if you beat the tool over yourself, for not having learnt it.

A tool that:

- in 15 years is still slow by default

- in 15 years could not acquire any sensible defaults that shouldn't need extra work to configure

- is written in an esoteric language with next to zero tools to debug and introspect

- randomly changes and moves thing around every couple of years for no discernible reason and making zero impact on reducing its slowness

- has no configuration specification to speak of because the config is written in that same esoteric programming language

And its zealots blame its failures on its users

Build systems are a complex problem. Like, you can compare it to cargo/go’s build system etc, but these forget about 90% of the problem, and come up with a hard-coded for the happy-path rust/go project, with zero non-native dependency, all downloaded from the repository. This is not a bad thing, don’t get me wrong. But it is not enough to compile, say, a browser.

Meanwhile, there are only a couple, truly general build systems, all of them quite complex — as you can’t get away with less. Gradle and bazel are pretty much the only truly generic build systems (I’m sure there are a couple more, but not many). You can actually compile a mixed java, c++, whatever project with gradle, for example, just fine, with proper parallelism, caching, etc.

As for your points:

in 15 years is still slow by default

Absolutely false. A 3-4 lines gradle build file for java will be fast, and correctly parallelize and maximally utilize previously built artifacts.

in 15 years could not acquire any sensible defaults that shouldn't need extra work to configure

It’s literally 4 lines for a minimal project, maybe even less. I’m getting the feeling that you have zero idea about what you talk about.

is written in an esoteric language with next to zero tools to debug and introspect

I somewhat agree, though nowadays kotlin is also an alternative with strong typing, proper auto-complete, the only trade off is a tiny bit slower first compile of the config file. Also, both could/can be debugged by a normal java debugger.

randomly changes and moves thing around every couple of years for no discernible reason

There is some truth to this. But the speed has definitely improved.

has no configuration specification

Kotlin is statically typed. Also, the API has okayish documentation, people just like to copy-paste everything and wonder why it fails. You wouldn’t be able to fly a plane either from watching 3 sitcoms showing a cockpit, would you?

Absolutely false. A 3-4 lines gradle build file for java will be fast, and correctly parallelize and maximally utilize previously built artifacts.

I've never seen this on any project that utilizes gradle. Every time, without fail, it's a multi-second startup of gradle that eventually invokes something that is actually fast: javac.

I’m getting the feeling that you have zero idea about what you talk about.

Only 7 years dealing with java projects, both simple and complex

> has no configuration specification

Kotlin is statically typed.

Kotlin being statically typed has literally nothing to do with a specification.

Becuase Gradle doesn't have a configuration. It has a Groovy/Kotlin module that grew organically, haphazardly and without any long term plans. So writing gradle configuration is akin to reverse-engineering what the authors intended to do with a particular piece of code or API.

Also, the API has okayish documentation

"okayish" is a synonim for bad. You'd think that a 15-year project in its 8th major version would have excellent API documentation

people just like to copy-paste everything and wonder why it fails. You wouldn’t be able to fly a plane either from watching 3 sitcoms

So, a project with no specification to speak of, with "okayish" API that is slow as molasses by default out of the gate [1] keeps blaming its users for its own failure because somehow it's a plane, and not a glorified overengineered Makefile.

[1] A reminder, if you will, that gradle managed to come up with caching configs to speed up its startup time only in version 8, after 15 years of development.

They are trying to make gradle better, by implementing a "maven over it" - declarative build configuration.

But for time being maven is way easier to configure and understand.

It could be if tool is work of good design and engineering. But not when tool's claim to fame is that it does not use XML to describe build config like Maven/Ant.

There are not many alternatives that are truly general build tools (not limited to single language/repository, etc), and properly parallelize and cache.

So, besides gradle and bazel, what else is there?

But run time is slow, is that you want to convey?

No, I would even go as far as to say that a naive, non-microbenchmark java program can often perform better than a naive, low-level language one.

...very relative

Don't need maven

No external dependencies may be used

maven is just a build tool. It creates a single uberjar which you run with the JDK directly - you don't ship the maven project.

But the real meaning is that you're not allowed to use external libraries, rather than build tool related dependency.

A fair comparison between languages should include the make and build times

if you do that, you should also include programming time, and divide both by the number of runs the code will have over its lifetime. Also add an appropriate fraction of the time needed to learn to program.

In such a challenge, that likely would make a very naive version win.

Apart from being impractical, I think that would be against the idea behind this challenge.

In such a challenge, that likely would make a very naive version win.

Not if you give a realistic number for an actual database of this scale. Tens of hours of dev time isn't much after you divide it by a few million.

Why would you clean?

“I discard cache and it is so slow, aarghhh”

I believe the whole thing can be done in 0.3 seconds with the following approach:

(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)

* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.

* We also have a finite number of place names. (~400)

* Just make a lookup table of all temps and all place names. (~160,000)

* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.

* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.

* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.

* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).

* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

You don't need the look up table. All you are asked for is min/mean/max which can all be computed in one pass without storing the data. All you need is a hash table with 400 entries and 3 floats (running min, mean and max) and and int (count, for updating running mean). That's just 16 bytes, if you use 16 bytes for name of station you can fit everything under 16K.

IO will dominate the running time for this, and JSON parsing will be second.

I don't have a CS background and when I eventually had to do interviews for Google, "Calculate mean/median/mode of temperatures" was the interview question I went with, intending to avoid BS leetcode*

I always worried it was too easy, but I'm heartened by how many comments miss the insight I always looked for and you named: you don't need to store a single thing.

I do wonder if it'll work here, at least as simply as the tracking vars you mention, with so many rows, and the implicit spirit of "you should be able to handle _all_ data", overflow might become a legitimate concern - ex. we might not be able to track mean as simply as maintaining sum + count vars.

* "oh sorry times up, but the right answer is a red black self balancing terenary tree recursive breadth-first search", ironically, was one of my Apple interviews.

Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case, and I'm not aware of any way to calculate the mode at all. Any pointers to the methods you used in your interview answers?

If you came up with them on the fly, then... well, sign here for your bonus. Can you start Monday?

I am silly and wrote 'mode' and shouldn't have :P (wetware error: saw list of 3 items corresponding to leetcode and temperature dataset, my 3 were min/max/average, their 3 are mean/median/mode)

In general, median and mode are much harder than min/avg/max. You can't compute the former with constant memory in one pass (you can do approximate median, but not exact median).

(Here there is a restricted range for the temperature with only 199 possible values (-99.9 to 99.9 with 0.1 increment) so you could do it constant memory, need something like 4*199 bytes per unique place name))

For the sum overflow is not an issue if you use 64-bit integers. Parse everything to integers in tenths of degree and even if all 1 billion rows are 99.9 temperature for same place name (worst possible casE), you are very far from overflowing.

Maintaining sum+count is more expensive than you'd imagine... Because to maintain the sum, you need to convert the numbers from ascii to decimal. And to do that you at least need to know the memory alignment of the number and the position of the decimal point.

All far more expensive than the state machine approach which is pretty much 'alignment doesn't matter, ascii doesn't matter, we're gonna just handle all possible states that could end up in a 32 bit register').

Merely finding the start/end of each line will use more computation (in the inner loop) than the approach I outlined. Let alone converting the number from ascii to a float, or looking up the place name in a has table (oh, and the name is variable length, so you're gonna need to find how long the name is first).

I deleted two previous comments because I realized I misunderstood your proposal. I understand it better now, but I am still confused about something...

Your state machine would need at least 2*160,000 states (you need an extra bit to flag whether you have reached a newline in the last word and need to increment a counter or not), correct? And you are assuming the input is 4 bytes, so won't your transition table need (2^32)*2*160,000 ~ 10^15 entries (at least 3 bytes each)?

IO will dominate the running time for this, and JSON parsing will be second.

Memory bandwidth might dominate, but probably not I/O. The input file is ~12GB, the machine being used for the test has 32GB, and the fastest and slowest of five runs are discarded. The slowest run will usually be the first run (if the file is not already cached in memory), after which there should be little or no file I/O.

s/JSON/string/

Where does your ~160,000 come from? There's a billion rows, couldn't there be a different place on every row giving a billion place names?

There are only ~400 place names in the generated data. 160,000 = 400*400

The state machine generation would need to know all 400 - but it's easy enough to scan the first few hundred thousand rows to find them all, and have a fallback incase a name is seen that has never been seen before. (the fallback would be done by having the state machine jump to an 'invalid' state, and then at the end you check that that invalid state's counter is zero, and if it isn't, you redo everything with slow code).

There are only ~400 place names in the generated data

The README says the maximum number of unique place names is 10 000, so you should probably design for the fast case to handle that many instead of just going by what is in the database that it happens to be measuring with at the moment.

This code is only fast in the 'happy' case (ie. ~400 unique places, with no odd distributions, temperatures with a std dev of 10 from a mean between 0 and 40). It is still correct in the 'unusual' cases, but would be slow because it would revert to fallback code.

I would ask for clarification about the rules - it isn't clear to me whether code that is special cased for the known 400 place names (even if it supports additional names via a slower path) would be considered valid:

Q: Can I make assumptions on the names of the weather stations showing up in the data set?

A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; character).

The code wouldn't be special cased... it would run through the first 10,000 or so rows to find out the dataset to generate the state transition table. If any invalid entry is found later, you'll end up in an invalid state, and can detect that case and fall back to slow code (or even theoretically update the state transition table and return to the fast path).

Even simpler. Just have a Dictionary mapping place name to ID. The slow path is "I haven't seen this name before and I need to store it, making sure that I'm not racing any other thread doing the same."

You'd have an upper limit on how many place names you can have. But no hard coding of what they are.

I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

A single core can't saturate RAM bandwidth. Cores are limited by memory parallelism and latency. Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2), so AVX-512 is not necessary to max out per-core bandwidth - but if you're loading from DRAM you'll likely cap out much, much earlier (closer to 10-16GB/s on commodity servers).

As long as the majority of your data spills into RAM, you'll see a massive throughput cliff on a single core. With large streaming use-cases, you'll almost always benefit from multi-core parallelism.

It's very easy to test: allocate a large block of memory (ideally orders of magnitude larger than your L3 cache), pre-fault it, so you're not stalling on page faults, and do an unrolled vector load (AVX2/AVX-512) in a tight loop.

A single core can't saturate RAM bandwidth.

I was also intrigued by the same statement and was wondering about it as well. For AMD I don't know but on Intel I believe that each L1 cache-miss is going to be routed through the line-fill-buffers (LFB) first before going to L2/L3. Since this LFB is only 10-12 entries large, this number practically becomes the theoretical limit on the number of (load) memory operations each core can issue at a time.

Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2),

I think in general yes but it appears that ALU on Alderlake and Saphire Rapids has 3 SIMD load ports so they can execute up to 3x SIMD AVX2 mm256_load_si256 per clock cycle.

The goal is that none of the intermediate data spills - the only user of RAM is reading the input data.

But for that, you're right, the workload must be split across as many cores as necessary to keep DRAM maxed out or you're throwing away performance. Luckily this problem is easily split.

Go for it!

You nerd sniper you!

But more seriously, the JVM's support for vector intrinsics is very basic right now, and I think I'd spend far more time battling the JVM to output the code I want it to output than is fun. Java just isn't the right language if you need to superoptimize stuff.

Theoretically all of the above is super simple SIMD stuff, but I have a suspicion that SIMD scatter/gather (needed for the state lookup table) isn't implemented.

If you really want to nerd snipe, write an optimized version in a non-JVM language to compare to the fastest Java one. But that's kind of boring anyway, we've seen this play out a thousand times.

I still appreciate the intention of seeing how far we can squeeze performance of modern Java+JVM though. Too bad very few folks have the freedom to apply that at their day jobs though.

autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

You are basically describing a perfect hash, right?

nearly but not quite - you can deliberately 'collide' when there is no data you need to keep going forward (the obvious example being when you end on a '\r' character). This is necessary to keep an upper bound on the state space.

Remember that in the extreme, a state could mean "city A, temp zero, city b" - "0\rB" so it isn't a direct mapping from state machine state to city.

run through all the data, at RAM speed,

Is the data given in RAM or do you have to wait for spinning rust I/0 for some GB amount of data?

The test procedure is to run the code 5 times. Even if it isn't in cache the first time, it will be in the OS cache (ie. RAM) for the 2nd-5th runs. The test machine has 32GB RAM, while the data file is only 12GB, so it ought to be fully cached.

You have to read and parse the entire file to find the place names.

Isn't this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.

It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I/O.

A few things:

Disk access can certainly be parallelized, and NVMe is blazing fast, so the bottleneck is more the CPU than disk. There are systems that are built around around modern hardware that realize this (redpanda.com, where I work is one such example)

Parsing is a lot of the compute time and some SIMD tricks like SWAR for finding delimiters can be helpful. Stringzilla is a cool library if you want to see a clean implementation of these algorithms: https://github.com/ashvardanian/StringZilla

See my reply here about the file being cached completely in memory after the first run: https://news.ycombinator.com/item?id=38864034

I would hope that any reasonably performant implementation would be faster not only than NVMe, but also faster than CPU to RAM data transfers.

The AMD EPYC-Milan in the test server supports memory reads at 150 gigabytes/sec, but thats a 32 core machine, and our test only gets 8 of those cores, so we probably can't expect more than 37 gigabytes per second of read bandwidth.

The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

There are only ~400 weather stations, so all the results fit in cache. It's unclear if it might be faster to avoid converting ascii to a float - it might be faster to add up all the ascii numbers, and figure out how to convert to a float only when you have the total.

The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

How? EPYC-Milan features PCIe 4.0 and theoretical maximum throughput of x4 sequential read is ~8GB/s.

In bare-metal machines, this figure is rather closer to 5-6GB/s, and in cloud environments such as CCX33, I imagine it could be even less.

So, unless I am missing something you'd need ~2 seconds only to read the contents of the ~12GB file.

This assumes the file is already in page cache, and therefore already in RAM. The 0.3 seconds is the time to get it from RAM to the CPU L3 cache.

That sounds like cheating. The exercise says to process a large file that's on disk. If the data is in RAM then that's a whole different game.

The benchmark is run 5 times without flushing file system caches. If you mmap it, it will still be in memory between process runs. The author intended this FWIW

If the cache is not flushed, it will still be in memory between runs whether you mmap it or not.

That depends a lot on the OS. I would not count on a 13 GB file being cached in RAM after having been read through once.

That sounds like a benchmarking flaw.

But then, writing good benchmarks is an art in itself.

Isn’t it run multiple times and the slowest is dropped? Why would you expect the file to be evicted from ram so quickly?

That makes sense now, thanks.

Are you sure that would give the same result? The first thing that struck me about this challenge is that there's going to be floating point precision issues.

All the data is only to one decimal place. You can shift the decimal place and do the whole thing with int's and get a 100% precise result.

You can, but will it be the same result? :) binary floating point can't represent 0.1 or 0.2 exactly.

To be fair, with only additions, and only a billion of them, you're probably fine working in floating point and rounding to one decimal in the end.

any reasonably performant implementation would be faster not only than NVMe

Saturating NVMe bandwidth is wicked hard if you don't know what you're doing. Most people think they're I/O bound because profiling hinted at some read() function somewhere, when in fact they're CPU bound in a runtime layer they don't understand.

Can you elaborate on the runtime layers and how to affect them?

It looks like all existing solutions are too slow by a huge factor then :)

Yes, this is an extremely trivial problem. Anybody who knows how to program in more than one language is going to find this silly. awk or perl would finish it before jit compilation gets started.

It's a trivial problem to implement, but not trivial to implement faster than others. Awk won't beat solutions that are clever about multiple read queues and parallel processing.

Please try it out in awk. It would be interesting to see whether awk can do this in a few seconds.

Daniel Lemire has an interesting talk on this: https://www.youtube.com/watch?v=wlvKAT7SZIQ. They key takeaway is that the disk is rarely the bottleneck.

It's an old bit of wisdom that's hard to stamp out. That's partly because anything that you don't understand below you in the runtime stack looks like "I/O" during profiling, so the misunderstanding perpetuates even among people who attempt to take a closer look, but don't know enough.

Isn't this simply bound by the speed of the disk?

Depends on the OS and filesystem being used. The input file is ~12GB, and it's being run 5 times on a machine with 32GB, so after the first run the file could be cached entirely in memory. For example, Linux using ext2 would probably cache the entire file after the first run, but using ZFS it probably would not.

I agree. From here: https://news.ycombinator.com/item?id=38865942

<< Just clarified this in the README: * Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit >>

If we assume worst case entries: -99.9;<100 char station name>

That is 106 chars plus newline (1 or 2 chars). That could be 108 GB of data! And, I want to be so picky here: The README says "UTF-8" and "max length 100 characters". Eh, we could start a holy war here arguing about what does a character mean in UTF-8. Let's be generous and assume it means one (8-bit) byte (like C). I have good faith in the original author!

Some Googling tells me that the current fastest PCIe Gen5 NVMe drive is Crucial T700 w/ Sequential read (max., MB/s) @ 12,400. That is 8.7 seconds to read 108 GB of data. A few other comments mentioned that the sample file is only 12 GB, so that is 1 second, and 100% can be cached by Linux kernel for 2nd and later reads. However, if the test file is greater than 32 GB, you are definitely screwed for file system caching. You will have to pay a multi-second "I/O tax" to read the file each time. This seems to far outweigh any computation time.

Thoughts?

Last, I still think this is a great challenge. I think I just found my next interview question. This question is easy to understand and program a basic implementation. Further discussion can be fractally complex about the myriad of optimizations that could be applied.

No trolling: Can you imagine if someone tries this exercise in Python? There are a goofy amount of optimizations to deal with huge data in that language thanks to SciPy, Pandas and other friends. I am sure someone can write an insanely performant solution. (I know, I know: The original challenge says no external dependencies.)

Clearly the fastest way to parse this is to load it all into RAM, and work backwards from the end. That has the advantage of putting the digits in ascending order, and then the delimiter, then the string... until EOF or NL is encountered.

Totally depends on your workload and hardware. I have a normal consumer SSD in my desktop and it can easily sustain 7GB/s (56gbps) as long as I only use 700GB of the 2TB available. A normal server has enough PCIe lanes for 15 SSDs like mine so a normal server's IO bandwidth is comparable to its memory bandwidth. More expensive servers have faster lanes (PCIe 5.0) and more of those lanes.

Anyway the OP's file only has 1B rows so it'll be ~1GB compressed which fits in memory after the first discarded run. IO bandwidth shouldn't matter in this scenario:

All submissions will be evaluated by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM). The time program is used for measuring execution times, i.e. end-to-end times are measured. Each contender will be run five times in a row. The slowest and the fastest runs are discarded

Edit: The github repo says it's 12 GB uncompressed. That confirms that IO bandwidth doesn't matter

Interesting challenge, shame its only java. Can't wait till people start hand rolling their own JVM bytecode.

Check out the discussion[0], looks like there are submissions in several languages. Go, Rust, Python, and C++, to name a few

[0] https://github.com/gunnarmorling/1brc/discussions

It looks like the problem is dominated by reading in the data file. Some fast solutions just read the whole file into memory.

Rather than read the file into memory, memory mapping can be used.

You wouldn't want to do this for a huge file. A very fast solution would use a small number of buffers and io_uring (or equivalent), keeping the page table and cache footprint small.

Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because having the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).

Code: https://github.com/rockwotj/1brc

Discussion on Filesystem cache: https://x.com/rockwotj/status/1742168024776430041?s=20

double parsing

In case you haven't noticed yet, the input format guarantees exactly one fractional digit, so you can read a single signed integer followed by `.` and one digit instead.

could you just add the character values eg 49 for ascii 1, and then subtract off the offset once at the end instead of doing atoi on each line?

edit: doh that works for min and max but the average overflows.

Yes. I'm not sure it'll help, but def worth a try.

Yeah I missed this originally, and stuff could be faster with this assumption without a full double parser. The fastest java solution dies some near branchless decoding for these

Wow, that's pretty fast considering how simple main.cc looks. I do love c++. Nice use of coroutines, too.

I missed the "5 times in a row." If you do that, yeah, keeping the whole thing in memory is far better.

So you are basically at the mercy of the OS caching algorithm. That sounds like a bad plan for a benchmark. You are not measuring what you think you are (your code), you are measuring the OS caching policy.

Dumb question. With io_uring, how do you handle lines that straddle between chunks? I'm asking since, AFAIU, the submitted requests are not guaranteed to be completed in order.

(The easiest I can think of is submitting reads for "overlapped" chunks, I'm not sure there is an easier way and I'm not sure of how much performance overhead there is to it.)

You have to handle it manually. Remember the partial lines at the beginning/end of your chunks and merge them when their mates become available.

What is the downside of memory mapping in this scenario? Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages? Accessing the contents of a file doesn't seem like something caching would matter for. Do you mean that reading of sequential pages will keep adding to the cache compared to reading from a single page? That seems like a similar thing as before where they will be the first things gone from the cache anyways.

Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages?

That's what I used to think, too. But the kernel ain't that smart.

Memory mapping (at least on Linux) isn't actually faster than reading the file manually. Especially if you use appropriately sized buffers.

(Of course, the five times in a row might mess with that.)

But would that really be faster when you need to read every byte of a file?

I thought memory mapping solved a different problem.

It would be a more interesting challenge if the data file were larger than the memory. I would love to see what people would come up with on some baby vm with 512 mb of ram.

Even more interesting would be small ram, little local storage and a large file only available via network, I would like to see something other than http but realistically it would be http.

Sixteen thousand Excel 97 spreadsheets?

For this problem, changing the file size to not fit in RAM doesn't really make the optimal solutions more interesting

Alternatively, "must be written in Java" can be interpreted to mean "must use the JVM to begin execution", and you can clearly spawn another process from Java...

Another rule was no external dependencies

I guess you could from Java itself write a new binary and then run that binary, but it would be against the spirit of the challenge.

There's sun.misc.Unsafe which is a builtin that gives you raw read/write. Turn that into ROP, mmap your native program, and now you're full native :)

Just use JNA to allocate an RWX buffer with mmap, unpack a bit of clever code into the buffer, and forge a native function pointer to call it :)

As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?

[0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.

[0] https://twitter.com/mtopolnik/status/1742652716919251052

The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.

It seems like you have to run the slow collision-resistant thing over the whole file.

Collision detection is far cheaper. One method:

Simply add up all the bytes of all seen placenames while running your fast algorithm. This is effectively a checksum of all bytes of placename data.

Then at the end, calculate the 'correct' name-sum (which can be done cheaply). If it doesn't match, a collision occurred.

You can design inputs that defeat this scheme though. I thought the point was to be correct for all valid inputs.

But how would you detect that a station name is colliding or not colliding? With a hash set?

Search for cuckoo hashing. It's a whole thing for data structures.

Have you considered having two datasets? The "development" dataset which is public, and the "competition" dataset which is private?

That might help against algorithms that are (accidentally, or on purpose) tuned to the specific dataset.

Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?

Your code is submitted (per parent comment)

Sure, but who is manually checking every entry?

You don't need to check every entry, just the top few.

What are the constraints on station names - i.e. min length, max length, maximum number of different names?

Just clarified this in the README:

* Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported

Q: Can I make assumptions on the names of the weather stations showing up in the data set?

A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no `;` character).

I'm unsure if it's intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.

Given this requirement, they would be wise to have the test data be different from the example data. That would prevent overfitting optimizations.

I'm not sure that it is overfitting to optimize your code for the test set. The requirement is just that you don't break compatibility for arbitrary names in the process.

This sort of thing shows up all the time in the real world - where, for example, 90% of traffic will hit one endpoint. Or 90% of a database is one specific table. Discovering and microoptimizing for the common case is an important skill.

That's why I was so unsure. I think though, that it is generally better to have the input generator seeded and do not disclose the chosen seed in advance, which prevents overfitting to a single input and yet encourages overfitting to a much bigger class of inputs.

That invites probing submissions to figure out a good hash for all names from that one seed.

I think having the names for performance tests public is fine. But there should be correctness tests on sets with random names.

Of course, I assume names are also randomly generated. But it might be the case that they are all ASCII-only, which can be exploited.

Their should be no optimizations which rely on a-priori knowledge of the dataset. I.e. if you calculate a perfect hash function at runtime by inspecting the dataset, that's fine (not sure whether it's beneficial), but hard coding it, is not.

That is a tricky rule to enforce. For example, some solutions assume that the temperature values fit into an int, which could be interpreted as relying on a priori knowledge.

That's true. Some assumptions always have to be made.

In this case, we can assume place names that can be arbitrary strings and temperatures that occur naturally on earth. Optimizing around those constraints should be fine. We could probably limit the string length to something like 100 characters or so.

Bad assumptions would for example be to assume that all possible place names are found within the example data.

UTF-8 makes this far harder... But if you stick to the letter but not spirit of the rules, you can simply fall back to a slow implementation if you ever detect any byte > 127 (indicating a multibyte UTF-8 character).

Only if you validate the UTF-8 as being valid. If you just accept that it is you can treat it as just some bytes. Nothing in the spec I see requires processing UTF-8 as an actual Unicode string.

The easiest way to handle Unicode is to not handle it at all, and just shove it down the line. This is often even correct, as long as you don't need to do any string operations on it.

If the author wanted to play Unicode games, requiring normalization would be the way to go, but that would turn this into a very different challenge.

Since the tail of the line has a known format I guess we are rescued by the fact that the last 0x3B is the semicolon as the rest is just a decimal number. We can’t know the first 0x3B byte is the semicolon since the place names are only guaranteed to not contain 0x3B but can contain 0x013B. So a parser should start from the rear of the line and read the number up to the semicolon and then it can treat the place name as byte soup. Had two places shared line this challenge would have required real utf parsing and been much harder.

I'm not sure scanning backwards this helps. Running in reverse you still need to look for a newline scanning over an UTF-8 string which might plausibly contain a newline byte.

I'm no UTF-8 guru, but I think you might be possible to do this sort of a springboard for skipping over multi-byte codepoints, since as far as I understand the upper bits of the first byte encodes the length:

    byte utfByte1 = (byte) (val & 0xF0);

    if (utfByte1 == (byte) 0xF0) { // 4 byte codepoint 
      // ignore 3
    }
    else if (utfByte1 == (byte) 0xE0) { // 3 byte codepoint
      // ignore 2
    }
    else if (utfByte1 == (byte) 0xC0) { // 2 byte codepoint
      // ignore 1
    }

Yes, sorry, I'm so used to this sort of thing with the work I've been doing lately I forgot to lay it out like that. Thank you for filling it in for me.

I don't understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.

https://stackoverflow.com/questions/277309/java-floating-poi...

The difficulty is creating the fastest implementation. If you look at the results of the submissions so far you’ll see a big difference in duration, between 11 seconds and more than 4 minutes.

11 seconds seems pretty impressive for a 12Gb file. Would be interesting to know what programming language could do it faster. For a database comparison you’d probably want to include loading the data into your database for a fair comparrison.

Perl would do it quite fast and it has the benefit of accessing posix primitives directly.

A naive perl solution is really really slow compared to even the reference Java implementation. (I know, I've tried)

That's strange, you should be able to stream the file right into a tiny perl executable at the same speed as the bottlenecking hardware. The kernel will take care of all the logistics. You're probably trying to do too much explicitly. Just use a pipe. Perl should be done before Jit completes.

Using cat to redirect the file to /dev/null takes 18s on my machine (a low-end NUC). Just running a noop on the file in Perl (ie. feeding it into a `while (<>)` loop but not acting on the contents) takes ~2 minutes.

1B lines is a lot, and Java ain't a slouch.

Why are you using cat at all? Use a pipe. This isn't hard stuff. Don't use <>, feed the file into a scalar or array. it should only take a few seconds to process a billion lines.

https://www.perl.com/pub/2003/11/21/slurp.html/#:~:text=Anot....

If it isn't hard, then perhaps you could demonstrate with a complete perl program that you think should beat java.

I profiled my attempt, actually reading each line is the bottleneck.

Perl is always going to be much faster than Java at tasks like this. Use stdin and chomp() instead of reading each line explicitly.

This is really a small, trivial task for a perl script. Even with a billion lines this is nothing for a modern cpu and perl.

It’s easy to solve but even fizzbuzz becomes complicated if you want double digit GB/s output.

It's really not. We're talking about gigahertz CPUs and likely solid state storage that can stream many gb/s.. running through a perl script. There really isn't much that is faster than that.

This is just a map/reduce problem. Use Hadoop. It's Java, isn't it?

Why would I use Hadoop for such a small number of rows…?

1 billion is small for hadoop?

If it fits on one computer it's not a hadoop problem.

It fits on a dusty ten year old USB stick

Sounds like an awk problem tbh.

A Hadoop submission may help people realize that. But since you only have one machine to work with it should be obvious that you're not going to get any speed-up via divide and conquer.

Anything that fits in RAM on one machine is easily too small for Hadoop. In those cases, the overhead of Hadoop is going to make it get destroyed by a single beefy machine. The only times where this might not be the case is when you're doing a crazy amount of computation relative to the data you have.

Note that you can easily reach 1TB of RAM on (enterprise) commodity hardware now, and SSDs are pretty fast too.

Old but gold post from 2014: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

Also from 2013: https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

~14GB file? it's on the small side for hadoop

No external dependencies may be used

The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).

write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station

Depending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.

Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.

Edit: if we don't like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.

Edit 2: while we're here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure

I think that would violate this rule

The computation must happen at application runtime, i.e. you cannot process the measurements file at build time (for instance, when using GraalVM) and just bake the result into the binary

Yeah, and I'd count caching results between runs into the same category, i.e. against the spirit of this challenge. Folks should keep in mind that the goal is to learn something new, rather than "winning" by cheating the rules.

Though the caching done by the OS (I/O for instance) is fair game

Yes, exactly. Retrieving the source data file from the page cache != storing and reloading precomputed results.

Ah I didn't actually follow the link to the github repo. I don't think that percludes pre-processing the input into something that needs no parsing, however, or to spawn a background daemon that prefetches everything in the right order.

The first run is indeed runtime, not build time, so technically I think I still count with pre-computation, sending that to the daemon (or just stashing it somewhere in /tmp, or even crazier patching the jarfile), and just printing it back out on the second run onwards.

I don't think that percludes pre-processing the input into something that needs no parsing

Why not preprocess it into a file that contains the answer? (-: My read of the description was just: you're meant to read the file as part of the run.

There is a real issue with providing the contestant the real exact file that will be used in the contest. Because there are 1e9 shades of gray between hard coding the correct answer in a single line program which doesn't even read the input, and processing the file as if we don't know what it contains.

This might become a contest of judging what is fair pre-computing and what is not.

That's why machine learning contests don't let participants see the final data.

Fix:

A run consists of:

  * I generate random measurements with the provided script, which will be stored in a readonly file.

  * I run your program to calculate the results (and time it).

  * I run my baseline program and your entry is valid if my baseline program calculates the same results.

The filesystem has to be read-only as well, otherwise the first run can simply write the results to ~/.me_cheater and the program first reads that file before attempting to read the dataset.

I think the rules should stipulate each run is on its own tmpfs and all processes and pagecaches are cleared between runs.

Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?

This is done to simulate real-world performance. Your binary is not the only binary in the system and other services may be running as well. So fastest time is the happiest path and slowest is the unluckiest.

The range of remaining three is what you expect to get 99% of the time on a real world system.

Your binary is not the only binary in the system and other services may be running as well.

Technically yes, but these days most of my machines are single purpose VMs; database/load balancer/app server/etc, so it still seems weird not to take the fastest.

Then, your VM is not the only VM sharing the bare metal. Same thing applies, only on a slightly higher level.

As long as you share the metal with other stuff (be it containers, binaries, VMs), there's always competition for resources, and your average time becomes your real world time.

Physical memory is not fungible like that across VMs. So, you can expect stuff loaded into memory to stay there unless your kernel inside the VM decides it not to.

No, it's. VirtIO's balooning device can "inflate" to pseudo-allocate memory on a VM to free physical memory for other hosts.

Moreover, even if your files in memory, you cannot reserve "memory controller bandwidth". A VM using tons of memory bandwidth or a couple of cores in the same NUMA node with your VM will inevitably cause some traffic and slow you down.

That VM is hardly "single purpose".

There's logrotate and other cleanup tasks, monitoring, dns, a firewall, and many more stuff running on that server. No matter how much you offload to the host (or forego), there's always a kernel and supporting deamons running alongside or under your app.

I said "technically yes" :)

Any production where you're routinely scanning a text file with a million records is probably a batch process, and I'd be shocked if the usual performance wasn't much closer to the worst case than the average.

Why?

The author does explicitly want all the files to be cached in memory for the later runs: https://x.com/gunnarmorling/status/1742181941409882151?s=20

Just for fun. Speed testing awk vs Java.

  awk -F';' '{
      station = $1
      temperature = $2
      sum[station] += temperature
      count[station]++
      if (temperature < min[station] || count[station] == 1) {
          min[station] = temperature
      }
      if (temperature > max[station] || count[station] == 1) {
          max[station] = temperature
      }
  } 
  END {
      for (s in sum) {
          mean = sum[s] / count[s]
          printf "{%s=%.1f/%.1f/%.1f", s, min[s], mean, max[s]
          printf (s == PROCINFO["sorted_in"][length(PROCINFO["sorted_in"])] ? "}\n" : ", ")
      }
  }' measurement.txt

I'd like to see it speed tested against an instance of Postgres using the file Foreign Data Wrapper https://www.postgresql.org/docs/current/file-fdw.html

    CREATE EXTENSION file_fdw;
    CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
    CREATE FOREIGN TABLE records (
      station_name text,
      temperature float
    ) SERVER stations OPTIONS (filename 'path/to/file.csv', format 'csv', delimiter ';');
    SELECT station_name, MIN(temperature) AS temp_min, AVG(temperature) AS temp_mean, MAX(temperature) AS temp_max
    FROM records
    GROUP BY station_name
    ORDER BY station_name;

Man, Postgres is so cool and powerful.

Using FDWs on a daily basis I fully realize their power and appeal but at this exclamation I paused and thought - is this really how we think today? That reading a CSV file directly is a cool feature, state of the art? Sure, FDWs are much more than that, but I would assume we could achieve much more with Machine Learning, and not even just the current wave of LLMs.

Why not have the machine consider the data it is currently seeing (type, even actual values), think about what end-to-end operation is required, how often it needs to be repeated, make a time estimate (then verify the estimate, change it for the next run if needed, keep a history for future needs), choose one of methods it has at its disposal (index autogeneration, conversion of raw data, denormalization, efficient allocation of memory hierarchy, ...). Yeah, I'm not focusing on this specific one billion rows challenge but rather what computers today should be able to do for us.

Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/

For some reason I'm impressed that the previous page in the documentation is https://www.postgresql.org/docs/current/earthdistance.html. Lotta fun stuff in Postgres these days.

Your sum variable could get pretty large, consider using the streaming mean function, something like: new_mean = ((n*old_mean)+temp)/(n+1)

Had this same thought as well. Definitely comes up when calculating variance this way in streaming data.

https://stats.stackexchange.com/a/235151/1036

I am not sure if `n*old_mean` is a good idea. Wellford's is typically something like inside the loop

count += 1; delta = current - mean; mean += delta/count;

Using ClickHouse local:

    time clickhouse local -q "SELECT concat('{', arrayStringConcat(groupArray(v), ', '), '}')
    FROM
    (
        SELECT concat(station, '=', min(t), '/', max(t), '/', avg(t)) AS v
        FROM file('measurements.txt', 'CSV', 'station String, t Float32')
        GROUP BY station
        ORDER BY station ASC
    )
    SETTINGS format_csv_delimiter = ';', max_threads = 8" >/dev/null

    real 0m15.201s
    user 2m16.124s
    sys 0m2.351

Most of the time is spent parsing the file

The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

I think it's better to discard the two slowest, or simply accept the fastest as the correct. There's (in my opinion) no good reason to discard the best runs.

This is a pretty standard measure called the Trimmed Mean: https://statisticsbyjim.com/basics/trimmed-mean/

For performance benchmarking the minimal runtime is typically the best estimator if the computations are identical, cause it measures perf w/o interrupts.

If the language is garbage collected, or if the test is randomized you obviously don't want to look at the minimum.

the minimal runtime is typically the best estimator

Depends what you’re estimating. The minimum is usually not representative of “real world” performance, which is why we use measures of central tendency over many runs for performance benchmarks.

Variability in software runtime arises mostly from other software running on the same system.

If you are looking for a real-world, whole-system benchmark (like a database or app server), then taking the average makes sense.

If you are benchmarking an individual algorithm or program and its optimisations, then taking the fastest run makes sense - that was the run with least external interference. The only exception might be if you want to benchmark with cold caches, but then you need to reset these carefully between runs as well.

If you don't think discarding the fastest is acceptable then why are you in favor of discarding the slowest?

Because there can be outliers in terms of slowness; perhaps the CPU gets a hickup, the GC runs an abnormal amount of extra times, the kernel decides to reschedule stuff, whatever, I don't know much about these things.

There can't be outliers in terms of fastness; the CPU doesn't accidentally run the program much faster.

But then again, what the hell do I know...

It's the same logic both ways. The OS runs 10 background tasks on average at any given time. On one of your runs it was doing 15 things, while on another it was doing 5 things. They are both outliers.

There are good reasons to discard the best runs. If you think of your system as having predictable behavior that's slowed down by background processing happening on the machine, it might make sense to use the best run. But if there are any intrinsic sources of non-determinism in the program (and it's far more likely than you might think), taking the best time is likely to be unrepresentative.

https://tratt.net/laurie/blog/2019/minimum_times_tend_to_mis... is a good piece on the subject.

first search result for averaging streaming data:

https://nestedsoftware.com/2018/03/20/calculating-a-moving-a...

so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.

actually i think you can also just average each chunk and then add it to existing data. like read N rows(say all have one location to keep it simple), average the data from the chunk, update/save min and max, move on to next chunk, do the same but now update the average by adding to existing/previously computed average and divide by two. the result will be the same - disk IO will be the most limiting aspect. this "challenge" is not really a challenge. there is nothing complicated about it. it just seems "cool" when you say "process 1 billion rows the fastest you can".

Wouldn't this end up reducing the weight of earlier records by repeatedly dividing them into smaller chunks?

I.e. avg of {22.5, 23, 24} = 23.17... But:

1. 22.5

2. (22.5 + 23)/2 = 22.75

3. (22.75 + 24)/2 = 23.375

you have to weight the previous results appropriately then it works

say you load 1 million records and you average them at 5.1. you then load another million and average them at 4.5. so you 5.1+4.5=9.6/2=4.8. rinse and repeat. as long as you keep the amount of records processed per each run about the same, your numbers will not be skewed. only the last chunk will most likely be smaller and it will introduce small rounding error, like if it has only 10k records instead of 1M. but still it is the simplest solution with good enough outcome.

Yes, just read a file in chunks and spread the math across cores. How many ways could you possibly implement that?? :)

Custom number parsing, minimising the number of memory allocations to not be punished by the garbage collector. All sort of micro optimisations that make those solutions a terrible way to showcase a language (i.e. you can write much clearer and concise code but obviously slower).

Looking at some solutions they seem to include their own double parsing implementation. I built a home made serializer for csv files, I am using the default .net parsing functions and I find that parsing numbers/dates is by far the slowest part of the process on large files.

but… can I use pandas?

From a pure performance perspective it’d probably be quite slow in comparison to a dedicated implementation for this task. Especially as it wouldn’t fit in memory.

why wouldn't it fit into memory? the data set is only 12gb.

for my SQL implementation, I normalized the city names and temps to 2 bytes apiece, so it's even less!

https://news.ycombinator.com/item?id=38866073

It takes 330 seconds on a machine where the top OpenJDK version takes 5.75 secs and my .NET version takes 4.8 seconds. However it's just several lines of code and I used ChatGPT as a fun use case for how long it would take to have something working. It took around 5 mins. So if one needs to run this code only once Pandas would be a huge win considering development time. Interestingly the RAM usage was much more than 14 GB input file, probably 2-2.5x of that.

BTW asking ChatGPT to utilize all cores did not yield anything working in reasonable time.

hahaha... nice! Out of interest, could you share more about your .NET solution? (the specifics but not the code)

It's all on GitHub

Multiple languages: https://github.com/gunnarmorling/1brc/discussions My repo: https://github.com/buybackoff/1brc

looking at the sample code I’m quite glad I’ve never really touched Java. What a clunky language

Ok, I’ll bite - why?

It looks like they mostly work in JS, so maybe it's the type definitions that look clunky.

Java is explicit, yes. This is intentional :)

yea, I guess it's the lack of human-friendly default assumptions mixed with prescriptive design patterns. definitely an anachronistic design intent :(

it feels like the characters-typed to functionality ratio is quite low. at the same time, the code isn't much cleaner and there's a lot of mental overhead in deciphering the abstractions (in this case Collector).

Java is probably the language I use the most but I hate how much boilerplate it has. Something I’ve found to help with that is using Java Records instead of classes, a record is kind of like a struct, but I don’t think it’s exactly the same, there’s no boilerplate and something I quite like about them is that their fields are immutable (more specifically “private final”) which works well with more functional programming, something that can be quite clunky with classes in java

Anyway I’m not a software engineer so take what I say about software with a grain of salt. Also I feel like other java programmers will hate you if they need to read or use your code. For example lets say you had a 2D point data type that is a record. instead of things being like “pos.add(5)” if pos is a record you need to reassign it like “pos = pos.add(5)” where add returns the pos as an object. Similar to how BigInteger works in java.

Anyway I love records because there’s zero boilerplate, it feels very similar to me to how structs are written in rust and go, or how classes are done in scala. I just never see anyone talk about it. I guess that might be because most people programming new projects on JVM are probably not programming in java?

Unfortunately I have no time to code this up but things that I would try to make it fast:

- Read with O_DIRECT (but compare it with the mmap approach). I know O_DIRECT gets much hate, but this could be one of the rare cases where it helps.

- Use a simple array with sentinel and linear search for the station names. This is dumb, but if the number of stations is small enough this could beat the hash. (In the back of my head I have a rule of thumb that linear search is faster for up to 1000 elements, but I'm not sure anymore where I got this from).

Read with O_DIRECT

apparently the challenge explicitly allow for warming up the os cache on a prior run, so O_DIRECT would be disadvantaged here as the working set fits in memory.

edit: also in my very simple tests a good hashmap can beat linear search already at <10 items. It depends a lot on the hash function cost.

beat linear search already at <10 items

The sources I found corroborate this. My 1000 elements rule of thumb is apparently way off.

linear search is never faster assuming your hash algorithm is cheap enough. However, if your list only contains 2 items, then your hash algorithm better be "Is first character > 'm'".

When you want real speed like this and are happy with insane complexity, code that writes and self-compiles a hash function based on the data can make sense.

I tried the linear search by station name in my first naive approach. Using a hashmap was at least 2-3x as fast with the ~415 distinct keys in the 1BRC dataset.

Single line solve using clickhouse-local or duckdb.

No external dependencies may be used

The Java runtime is an external dependency, isn’t it?

the whole premise is silly though. why would anyone use plain java to compute this when databases were built for this or at least are the most finely tuned for it

correct. like this:

time duckdb -list -c "select map_from_entries(list((name,x))) as result from (select name, printf('%.1f/%.1f/%.1f',min(value), mean(value),max(value)) as x from read_csv('measurements.txt', delim=';', columns={'name': 'varchar', 'value':'float'}) group by name order by name)"

takes about 20 seconds

This has been super fun. My current PR[1] is another 15% faster than my entry in the README. Sadly I haven't been able to make any progress on using SIMD to accelerate any part of it.

I think the issues with hashing could be easily covered by having the 500 city names in the test data also be randomly generated at test time. There is no way to ensure there aren't hash collisions without doing a complete comparison between the names.

[1] https://github.com/gunnarmorling/1brc/pull/56

1 billion rows, 500 unique values...

It becomes very possible to find an instance of each unique value, then runtime-design a hash algorithm where those 500 values don't collide.

Java allows self modifying code after all (and this challenge also allows native code, which can also be compiled-at-runtime)

You have to compare the keys in order to figure out if you have a hash collision or a new key. Without first scanning the entire file to know what the list of keys are there isn't a way around it. Even determining that list of keys involves doing key comparisons for every row.

I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway.

Agreed, the aggregations chosen here are embarrassingly parallel, you just keep the count to aggregate means.

Would have been more interesting with something like median/k-th percentile, or some other aggregation not as easy.

Not sure if this what you meant, but there are SIMD min/max instructions.

https://www.felixcloutier.com/x86/phminposuw

This would make for a really fun challenge in SQL, too.

May be of interest: https://benchmark.clickhouse.com

1:05 in PostgreSQL 16 but it was harder than I thought to saturate the CPUs and not be disk-bound. Also, I ran on GCP not Hetzner, so maybe different hardware.

SQL in theory makes this trivial, handles many of the big optimizations and looking at the repo, cuts 1000+ LOC down to a handful. Modern SQL engines handle everything for you, which is the whole damned point of SQL. Any decent engine will handle parallelism, caching, I/O, etc. Some exotic engines can leverage GPU but for N=1 billion, I doubt GPU will be faster. Here's the basic query:

   SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps GROUP BY 1 ORDER BY 1;

In practice, generic SQL engines like PostgreSQL bloat the storage which is a Big Problem for queries like this - in my test, even with INT2 normalization (see below), pgsql took 37 bytes per record which is insane (23+ bytes of overhead to support transactions: https://www.postgresql.org/docs/current/storage-page-layout....). The big trick is to use PostgreSQL arrays to store the data by city, which removes this overhead and reduces the table size from 34GB (doesn't fit in memory) to 2GB (which does).

The first optimization is to observe that the cardinality of cities is small and can be normalized into integers (INTEGER aka INT4), and that the temps can as well (1 decimal of precision). Using SMALLINT (aka INT2) is probably not faster on modern CPUs but should use less RAM, which is better for both caching on smaller systems and cache hitrate on all systems. NUMERIC generally isn't faster or tighter on most engines.

To see the query plan, use EXPLAIN:

   postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
                                                  QUERY PLAN
   ---------------------------------------------------------------------------------------------------------------
   Limit  (cost=13828444.05..13828445.37 rows=5 width=38)
     ->  Finalize GroupAggregate  (cost=13828444.05..13828497.22 rows=200 width=38)
         Group Key: city
         ->  Gather Merge  (cost=13828444.05..13828490.72 rows=400 width=38)
               Workers Planned: 2
               ->  Sort  (cost=13827444.02..13827444.52 rows=200 width=38)
                     Sort Key: city
                     ->  Partial HashAggregate  (cost=13827434.38..13827436.38 rows=200 width=38)
                           Group Key: city
                           ->  Parallel Seq Scan on temps_int2  (cost=0.00..9126106.69 rows=470132769 width=4)
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true

(13 rows)

Sigh, pg16 is still pretty conservative about parallelism, so let's crank it up.

   SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
   SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
   SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
   SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution

   postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
   ...
               Workers Planned: 14
   ...

top(1) is showing that we're burying the CPU:

   top - 10:09:48 up 22 min,  3 users,  load average: 5.36, 1.87, 0.95
   Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
   %Cpu(s): 12.6 us,  4.8 sy,  0.0 ni,  7.8 id, 74.2 wa,  0.0 hi,  0.7 si,  0.0 st
   MiB Mem :  32084.9 total,    258.7 free,    576.7 used,  31249.5 buff/cache
   MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30902.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1062 postgres  20   0  357200 227952 197700 D  17.9   0.7   9:35.88 postgres: 16/main: postgres postgres [local] SELECT
   1516 postgres  20   0  355384  91232  62384 D  17.6   0.3   0:08.79 postgres: 16/main: parallel worker for PID 1062
   1522 postgres  20   0  355384  93336  64544 D  17.6   0.3   0:08.53 postgres: 16/main: parallel worker for PID 1062
   1518 postgres  20   0  355384  90148  61300 D  17.3   0.3   0:08.53 postgres: 16/main: parallel worker for PID 1062
   1521 postgres  20   0  355384  92624  63776 D  17.3   0.3   0:08.54 postgres: 16/main: parallel worker for PID 1062
   1519 postgres  20   0  355384  90440  61592 D  16.6   0.3   0:08.58 postgres: 16/main: parallel worker for PID 1062
   1520 postgres  20   0  355384  92732  63884 D  16.6   0.3   0:08.49 postgres: 16/main: parallel worker for PID 1062
   1517 postgres  20   0  355384  91544  62696 D  16.3   0.3   0:08.55 postgres: 16/main: parallel worker for PID 1062

interestingly, when we match workers to CPU cores, we don't %CPU drops to 14% i.e. we don't leverage the hardware.

OK enough pre-optimization, here's the baseline:

   postgres=# SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
    city | min |         avg          | max
   ------+-----+----------------------+------
    0 |   0 | 276.0853550961625011 | 1099
    1 |   0 | 275.3679265859715333 | 1098
    2 |   0 | 274.6485567539599619 | 1098
    3 |   0 | 274.9825584419741823 | 1099
    4 |   0 | 275.0633718875598229 | 1097
   (5 rows)

   Time: 140642.641 ms (02:20.643)

I also tried to leverage a B-tree covering index (CREATE INDEX temps_by_city ON temps_int2 (city) INCLUDE (temp) ) but it wasn't faster - I killed the job after 4 minutes. Yes, I checked that it pg16 used a parallel index-only scan (SET random_page_cost =0.0001; set min_parallel_index_scan_size=0; set enable_seqscan = false; SET enable_parallel_index_scan = ON;) - top(1) shows ~1.7% CPU, suggesting that we were I/O bound.

Instead, to amortize the tuple overhead, we can store the data as arrays:

   CREATE TABLE temps_by_city AS SELECT city, array_agg(temp) from temps_int2 group by city;

   $ ./table_sizes.sh
   ...total...   | 36 GB
   temps_int2    | 34 GB
   temps_by_city | 1980 MB

Yay, it now fits in RAM.

   -- https://stackoverflow.com/a/18964261/430938 adding IMMUTABLE PARALLEL SAFE
   CREATE OR REPLACE FUNCTION array_min(_data ANYARRAY) RETURNS NUMERIC AS $$
       SELECT min(a) FROM UNNEST(_data) AS a
   $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;

   SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
   SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
   SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
   SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution

   postgres=# create table tmp1 as select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
   SELECT 1001
   Time: 132616.944 ms (02:12.617)

   postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
                                           QUERY PLAN
   -------------------------------------------------------------------------------------------------
   Sort  (cost=338.31..338.81 rows=200 width=98)
   Sort Key: city
   ->  Finalize HashAggregate  (cost=330.14..330.66 rows=200 width=98)
         Group Key: city
         ->  Gather  (cost=321.52..322.64 rows=600 width=98)
               Workers Planned: 3
               ->  Partial HashAggregate  (cost=321.52..322.04 rows=200 width=98)
                     Group Key: city
                     ->  Parallel Seq Scan on temps_by_city  (cost=0.00..0.04 rows=423 width=34)

(9 rows)

Ah, only using 3 cores of the 8...

---

   CREATE TABLE temps_by_city3 AS SELECT city, temp % 1000, array_agg(temp) from temps_int2 group by 1,2;

   postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city3 group by 1 order by 1 ;
                                               QUERY PLAN
   ---------------------------------------------------------------------------------------------------------
   Finalize GroupAggregate  (cost=854300.99..854378.73 rows=200 width=98)
   Group Key: city
   ->  Gather Merge  (cost=854300.99..854348.73 rows=2200 width=98)
         Workers Planned: 11
         ->  Sort  (cost=854300.77..854301.27 rows=200 width=98)
               Sort Key: city
               ->  Partial HashAggregate  (cost=854290.63..854293.13 rows=200 width=98)
                     Group Key: city
                     ->  Parallel Seq Scan on temps_by_city3  (cost=0.00..98836.19 rows=994019 width=34)
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true

(12 rows)

This 100% saturates the CPU and runs in ~65 secs (about 2x faster).

---

Creating the data

Here's a very fast lousy first pass for PostgreSQL that works on most versions - for pg16, there's random_normal(). I'm not on Hetzner so I used GCP (c2d-standard-8 with 16vCPU, 64GB, and 150GB disk, ubuntu 22.04 and

   CREATE TABLE temps_int2 (city int2, temp int2);

   -- random()*random() is a cheap distribution. 1100 = 110 * 10 to provide one decimal of precision.
   INSERT INTO temps_int2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e9)i;

This is an IO bound problem. Trying to "go faster" by using multiple threads or vectorisation isn't going to achieve much.

No, it is not. The file is smaller than the available RAM, so will be cached in subsequent runs.

You're not up to date with how fast IO can be these days. Nobody who cares primarily about performance uses spinning rust anymore.

I suppose it defeats the spirit of the game to, knowing your worst run is discarded, calculate the results on the first run by whatever slow method you want, save them somewhere useful, and just read the results and print them out on the following runs?

Or at the very least, convert the input into a more convenient binary format for the following runs.

I thought the same, but to ensure fairness, I would suggest that the application should run in a stateless container without internet access, and the infrastructure (Hetzner VM) should be recreated from scratch for each run to eliminate all caches.

You had 32GB more than expected 16GB RAM but for Java I doubt that 32GB is fine enough. Java is aboslutely good on older days but nowadays there is lot of opportunities than that poor guy.

Meaningless to talk about how much RAM you need for a JVM program as it will (obviously) depend on what the program is doing.

Anyone up to also do this in Rust ?

See the "Show & Tell", where folks are discussing solutions in different languages, including Rust: https://github.com/gunnarmorling/1brc/discussions/categories....

Is there a way to create this file without having java ?

This is for use on a BSD where Java may not work too well.

Thanks

You can run the bin/create-sample program from this C implementation here: https://github.com/dannyvankooten/1brc

It’s just the city names + averages from the official repository using a normal distribution to generate 1B random rows.

I suspect Java is not the fastest language for this. I’d love to see unofficial contenders tackle this challenge using different languages.

There are a handful of implementations in other languages already. Here’s mine in C99: https://github.com/dannyvankooten/1brc

I’ve also seen versions in Rust, Go, Python, Clickhouse and DuckDB. The discussions tab on the GitHub repo lists some of these.

how to make someone else to do your homework for free

This must be Decodale's engineering problem disguised as a challenge.

I've idly been considering how to score this with more languages, with far less memory. I'm thinking, buildpacks on a Raspberry Pi.

At the Czech technical university C course, we've had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).

Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.

Should run about 10-20% faster than that on the mentioned Hetzner hardware.

- Since we only do one decimal of floating point precision it uses integer math right from the get-go.

- FNV1-a hash with linear probing and a load factor well under 0.5.

- Data file is mmap’d into memory.

- Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.

1: https://github.com/dannyvankooten/1brc

Any elixir devs in here find this funny?

Comparing solutions from the Rstats world: https://twitter.com/RandVegan/status/1742721195781267650

Would offering a cash prize increase or decrease the quality of submissions?

Would have been nice to accept other JVM languages like Scala, Clojure, Kotlin, etc and compare against Java, which I expect to have slightly better performance.

This can turn into a nice benchmark if done in multiple languages.