return to table of content

The One Billion Row Challenge

vessenes
36 replies
11h26m

Ooh fun, Advent of Code chaser!

A fair comparison between languages should include the make and build times. I haven't used Java / Maven for years, and I'm reminded why, heading into minute 2 of downloads for './mvnw clean verify'.

kaba0
30 replies
11h22m

Java build times are very fast. You are just measuring your internet speed here.

(Also, gradle is faster as a build tool for incremental compilation)

pkolaczk
16 replies
11h15m

Gradle is faster than what? Than maven? Maybe. But not than Go or Cargo.

kaba0
7 replies
10h10m

What step are we talking about? Javac itself is absolutely on the same order of magnitude speed as Go per loc, while rust is significantly slower (which makes sense, the latter is a properly optimizing compiler with heavy static analysis, while the former two just spews out java byte code/machine code).

Gradle with a daemon is also pretty fast, you are just probably used to some complex project with hundreds of dependencies and compare it to a cargo file with a single dependency.

pkolaczk
2 replies
7h23m

Javac itself is absolutely on the same order of magnitude speed as Go per loc

Not really when it has to run on a cold JVM. And before it warms up, it’s already done.

Then you are in territory of keeping the compiler process between the runs, but that is a memory hog and also not always realiable (gradle often decides to run a fresh daemon for whatever reason).

So theoretically, in lab conditions yes, in practice no.

while rust is significantly slower

Everybody repeats that but there is surprisingly little evidence in form of benchmarks. I can see rustic compiles over 50k Loc per second on average on my M2, which is roughly the same order of magnitude as Java. And checking without building is faster.

blackoil
1 replies
5h16m

Javac doesn't run in JVM. It is a C (C++?) application.

hyperpape
0 replies
56m

https://github.com/openjdk/jdk/tree/master/src/jdk.compiler/...,

Wiki citation for good measures https://en.wikipedia.org/wiki/Javac.

Maybe you're thinking of Hotspot, which is written in C++.

umanwizard
1 replies
9h7m

Just a nitpick, but the static analysis in Rust doesn’t account for most of why compilation is slow, as proven by the fact that cargo check is much faster than cargo build.

kaba0
0 replies
7h23m

Yeah, I know. I believe it’s mostly the amount of LLVM IR that is the problem (e.g. each generic instantiation outputs by default a new copy of roughly the same code), isn’t it?

troupo
1 replies
7h21m

Gradle with a daemon is also pretty fast

I've yet to see a project where Gradle daemon a) does anything useful and b) is acutally used by gradle itself (instead of seemingly doing everything from scratch, no idea what it does in the seconds it takes for it to start up).

pkolaczk
0 replies
7h13m

This. Gradle wastes tremendous amounts of time on non-compile tasks. So even if Java takes a few seconds to compile, the whole build is often much, much longer. Interestingly, my impression is that maven is significantly more snappy.

wokwokwok
5 replies
11h6m

Itself.

gradle is faster as a build tool for incremental compilation

Implicit:

…than it is building from scratch, where it needs to download lots of stuff

I mean, yes, saying Java builds are fast does seem a bit “rolls eyes, yes technically by loc when the compiler is actually running” …but, ^_^! let’s not start banging on about how great cargo/rust compile times are… they’re really terrible once procedural macros are used, or sys dependencies invoke some heinous c dependency build like autoconf and lots of crates do… and there’s still a subpar incremental compilation story.

So, you know. Eh. Live and let live. Gradle isn’t that bad.

pkolaczk
4 replies
7h29m

So maybe I was just unlucky with gradle and lucky with cargo. A project that is a mixture of 20k LoC Scala and 300k LoC Java took 6 minutes to compile, and incremental was still several tens of seconds. Cargo/Rust cold compiles a project of 1M LoC (all dependencies) in about 1:30 on the same hardware and incremental is like 2-4 seconds.

As for precedural macros - yes they can be slow to compile but so are Java annotation processors.

kaba0
2 replies
7h9m

Scala is a much different case, it has one of the most advanced type systems, many implicit scoping rules etc. Even a java change that might affect the scala codebase could result in a slow incremental build. Also, the build system may not have been as modular as it could.

In my experience, java annotations gets processed very fast.

pkolaczk
1 replies
5h49m

it has one of the most advanced type systems

Well, that's a bit of a stretch. It is definitely one of the most complex type systems due to interactions between nominal sub-typing and type classes which is a bit hairy. But in terms of what this type system can express or prove, it's both inferior (cannot express lifetimes or exclusive access) and superior (HKTs, path dependent types) to Rust in some aspects. Let's say they are close.

kaba0
0 replies
3h46m

It is definitely one of the most complex type systems due to interactions between nominal sub-typing and type classes

That’s what I meant mostly. Though if I’m being honest, I’m not familiar with the implementation of either language’s type system.

lenkite
0 replies
3h42m

Scala is the problem here. Scala has several issues filed for slow compilation.

300K LOC Java should be done within ~1.5 minutes flat from zero. Can be even faster if you are running maven daemon for example. Or even within ~30 seconds if everything is within one module and javac is only invoked once.

wiseowise
0 replies
10h50m

That depends on project and scope of the project.

troupo
0 replies
7h20m

I don't even think it's faster than Maven.

troupo
9 replies
9h12m

Gradle and fast should not be in the same sentence.

I have no idea what gradle and its daemon do except spending orders of magnitude more time starting up than running the actual build.

You're much better off running javac directly. Then it's fast.

kaba0
8 replies
7h28m

Then someone in your team failed learning the bare minimum to write a sane gradle file, and are putting imperative stuff into the configuration phase.

For big projects, compiling only what’s necessary is a huge time win.

troupo
7 replies
7h23m

Then someone in your team failed learning the bare minimum to write a sane gradle file

Then someone in over 15 years and 8 versions gradle failed to make a system that doesn't require "learning to write sane gradle files" to make sure that it's only two orders of magnitude and not five orders of magnitude slower than it can be.

and are putting imperative stuff into the configuration phase

Has nothing to do with the slowness that is gradle.

For big projects, compiling only what’s necessary is a huge time win.

It is, and no one is arguing with that

kaba0
6 replies
7h7m

It is your own failure if you beat the tool over yourself, for not having learnt it.

troupo
3 replies
5h35m

A tool that:

- in 15 years is still slow by default

- in 15 years could not acquire any sensible defaults that shouldn't need extra work to configure

- is written in an esoteric language with next to zero tools to debug and introspect

- randomly changes and moves thing around every couple of years for no discernible reason and making zero impact on reducing its slowness

- has no configuration specification to speak of because the config is written in that same esoteric programming language

And its zealots blame its failures on its users

kaba0
2 replies
3h34m

Build systems are a complex problem. Like, you can compare it to cargo/go’s build system etc, but these forget about 90% of the problem, and come up with a hard-coded for the happy-path rust/go project, with zero non-native dependency, all downloaded from the repository. This is not a bad thing, don’t get me wrong. But it is not enough to compile, say, a browser.

Meanwhile, there are only a couple, truly general build systems, all of them quite complex — as you can’t get away with less. Gradle and bazel are pretty much the only truly generic build systems (I’m sure there are a couple more, but not many). You can actually compile a mixed java, c++, whatever project with gradle, for example, just fine, with proper parallelism, caching, etc.

As for your points:

in 15 years is still slow by default

Absolutely false. A 3-4 lines gradle build file for java will be fast, and correctly parallelize and maximally utilize previously built artifacts.

in 15 years could not acquire any sensible defaults that shouldn't need extra work to configure

It’s literally 4 lines for a minimal project, maybe even less. I’m getting the feeling that you have zero idea about what you talk about.

is written in an esoteric language with next to zero tools to debug and introspect

I somewhat agree, though nowadays kotlin is also an alternative with strong typing, proper auto-complete, the only trade off is a tiny bit slower first compile of the config file. Also, both could/can be debugged by a normal java debugger.

randomly changes and moves thing around every couple of years for no discernible reason

There is some truth to this. But the speed has definitely improved.

has no configuration specification

Kotlin is statically typed. Also, the API has okayish documentation, people just like to copy-paste everything and wonder why it fails. You wouldn’t be able to fly a plane either from watching 3 sitcoms showing a cockpit, would you?

troupo
1 replies
2h51m

Absolutely false. A 3-4 lines gradle build file for java will be fast, and correctly parallelize and maximally utilize previously built artifacts.

I've never seen this on any project that utilizes gradle. Every time, without fail, it's a multi-second startup of gradle that eventually invokes something that is actually fast: javac.

I’m getting the feeling that you have zero idea about what you talk about.

Only 7 years dealing with java projects, both simple and complex

> has no configuration specification

Kotlin is statically typed.

Kotlin being statically typed has literally nothing to do with a specification.

Becuase Gradle doesn't have a configuration. It has a Groovy/Kotlin module that grew organically, haphazardly and without any long term plans. So writing gradle configuration is akin to reverse-engineering what the authors intended to do with a particular piece of code or API.

Also, the API has okayish documentation

"okayish" is a synonim for bad. You'd think that a 15-year project in its 8th major version would have excellent API documentation

people just like to copy-paste everything and wonder why it fails. You wouldn’t be able to fly a plane either from watching 3 sitcoms

So, a project with no specification to speak of, with "okayish" API that is slow as molasses by default out of the gate [1] keeps blaming its users for its own failure because somehow it's a plane, and not a glorified overengineered Makefile.

[1] A reminder, if you will, that gradle managed to come up with caching configs to speed up its startup time only in version 8, after 15 years of development.

krzyk
0 replies
1h16m

They are trying to make gradle better, by implementing a "maven over it" - declarative build configuration.

But for time being maven is way easier to configure and understand.

geodel
1 replies
3h30m

It could be if tool is work of good design and engineering. But not when tool's claim to fame is that it does not use XML to describe build config like Maven/Ant.

kaba0
0 replies
2h59m

There are not many alternatives that are truly general build tools (not limited to single language/repository, etc), and properly parallelize and cache.

So, besides gradle and bazel, what else is there?

sixlelyt321
1 replies
10h25m

But run time is slow, is that you want to convey?

kaba0
0 replies
10h8m

No, I would even go as far as to say that a naive, non-microbenchmark java program can often perform better than a naive, low-level language one.

occams_chainsaw
0 replies
11h19m

...very relative

mrkeen
1 replies
11h17m

Don't need maven

No external dependencies may be used
chii
0 replies
10h59m

maven is just a build tool. It creates a single uberjar which you run with the JDK directly - you don't ship the maven project.

But the real meaning is that you're not allowed to use external libraries, rather than build tool related dependency.

Someone
1 replies
8h57m

A fair comparison between languages should include the make and build times

if you do that, you should also include programming time, and divide both by the number of runs the code will have over its lifetime. Also add an appropriate fraction of the time needed to learn to program.

In such a challenge, that likely would make a very naive version win.

Apart from being impractical, I think that would be against the idea behind this challenge.

Dylan16807
0 replies
7h27m

In such a challenge, that likely would make a very naive version win.

Not if you give a realistic number for an actual database of this scale. Tens of hours of dev time isn't much after you divide it by a few million.

wiseowise
0 replies
10h52m

Why would you clean?

“I discard cache and it is so slow, aarghhh”

londons_explore
28 replies
4h22m

I believe the whole thing can be done in 0.3 seconds with the following approach:

(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)

* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.

* We also have a finite number of place names. (~400)

* Just make a lookup table of all temps and all place names. (~160,000)

* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.

* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.

* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.

* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).

* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

toth
9 replies
2h45m

You don't need the look up table. All you are asked for is min/mean/max which can all be computed in one pass without storing the data. All you need is a hash table with 400 entries and 3 floats (running min, mean and max) and and int (count, for updating running mean). That's just 16 bytes, if you use 16 bytes for name of station you can fit everything under 16K.

IO will dominate the running time for this, and JSON parsing will be second.

refulgentis
4 replies
2h26m

I don't have a CS background and when I eventually had to do interviews for Google, "Calculate mean/median/mode of temperatures" was the interview question I went with, intending to avoid BS leetcode*

I always worried it was too easy, but I'm heartened by how many comments miss the insight I always looked for and you named: you don't need to store a single thing.

I do wonder if it'll work here, at least as simply as the tracking vars you mention, with so many rows, and the implicit spirit of "you should be able to handle _all_ data", overflow might become a legitimate concern - ex. we might not be able to track mean as simply as maintaining sum + count vars.

* "oh sorry times up, but the right answer is a red black self balancing terenary tree recursive breadth-first search", ironically, was one of my Apple interviews.

CamperBob2
1 replies
1h5m

Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case, and I'm not aware of any way to calculate the mode at all. Any pointers to the methods you used in your interview answers?

If you came up with them on the fly, then... well, sign here for your bonus. Can you start Monday?

refulgentis
0 replies
12m

I am silly and wrote 'mode' and shouldn't have :P (wetware error: saw list of 3 items corresponding to leetcode and temperature dataset, my 3 were min/max/average, their 3 are mean/median/mode)

toth
0 replies
1h8m

In general, median and mode are much harder than min/avg/max. You can't compute the former with constant memory in one pass (you can do approximate median, but not exact median).

(Here there is a restricted range for the temperature with only 199 possible values (-99.9 to 99.9 with 0.1 increment) so you could do it constant memory, need something like 4*199 bytes per unique place name))

For the sum overflow is not an issue if you use 64-bit integers. Parse everything to integers in tenths of degree and even if all 1 billion rows are 99.9 temperature for same place name (worst possible casE), you are very far from overflowing.

londons_explore
0 replies
2h5m

Maintaining sum+count is more expensive than you'd imagine... Because to maintain the sum, you need to convert the numbers from ascii to decimal. And to do that you at least need to know the memory alignment of the number and the position of the decimal point.

All far more expensive than the state machine approach which is pretty much 'alignment doesn't matter, ascii doesn't matter, we're gonna just handle all possible states that could end up in a 32 bit register').

londons_explore
1 replies
2h9m

Merely finding the start/end of each line will use more computation (in the inner loop) than the approach I outlined. Let alone converting the number from ascii to a float, or looking up the place name in a has table (oh, and the name is variable length, so you're gonna need to find how long the name is first).

toth
0 replies
46m

I deleted two previous comments because I realized I misunderstood your proposal. I understand it better now, but I am still confused about something...

Your state machine would need at least 2*160,000 states (you need an extra bit to flag whether you have reached a newline in the last word and need to increment a counter or not), correct? And you are assuming the input is 4 bytes, so won't your transition table need (2^32)*2*160,000 ~ 10^15 entries (at least 3 bytes each)?

usefulcat
0 replies
2h27m

IO will dominate the running time for this, and JSON parsing will be second.

Memory bandwidth might dominate, but probably not I/O. The input file is ~12GB, the machine being used for the test has 32GB, and the fastest and slowest of five runs are discarded. The slowest run will usually be the first run (if the file is not already cached in memory), after which there should be little or no file I/O.

toth
0 replies
2h38m

s/JSON/string/

jodrellblank
3 replies
3h52m

Where does your ~160,000 come from? There's a billion rows, couldn't there be a different place on every row giving a billion place names?

londons_explore
1 replies
3h48m

There are only ~400 place names in the generated data. 160,000 = 400*400

The state machine generation would need to know all 400 - but it's easy enough to scan the first few hundred thousand rows to find them all, and have a fallback incase a name is seen that has never been seen before. (the fallback would be done by having the state machine jump to an 'invalid' state, and then at the end you check that that invalid state's counter is zero, and if it isn't, you redo everything with slow code).

tzs
0 replies
2h21m

There are only ~400 place names in the generated data

The README says the maximum number of unique place names is 10 000, so you should probably design for the fast case to handle that many instead of just going by what is in the database that it happens to be measuring with at the moment.

londons_explore
0 replies
3h44m

This code is only fast in the 'happy' case (ie. ~400 unique places, with no odd distributions, temperatures with a std dev of 10 from a mean between 0 and 40). It is still correct in the 'unusual' cases, but would be slow because it would revert to fallback code.

floobertoober
2 replies
3h34m

I would ask for clarification about the rules - it isn't clear to me whether code that is special cased for the known 400 place names (even if it supports additional names via a slower path) would be considered valid:

Q: Can I make assumptions on the names of the weather stations showing up in the data set?

A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; character).
londons_explore
1 replies
3h28m

The code wouldn't be special cased... it would run through the first 10,000 or so rows to find out the dataset to generate the state transition table. If any invalid entry is found later, you'll end up in an invalid state, and can detect that case and fall back to slow code (or even theoretically update the state transition table and return to the fast path).

btilly
0 replies
2h29m

Even simpler. Just have a Dictionary mapping place name to ID. The slow path is "I haven't seen this name before and I need to store it, making sure that I'm not racing any other thread doing the same."

You'd have an upper limit on how many place names you can have. But no hard coding of what they are.

dist1ll
2 replies
3h24m

I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

A single core can't saturate RAM bandwidth. Cores are limited by memory parallelism and latency. Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2), so AVX-512 is not necessary to max out per-core bandwidth - but if you're loading from DRAM you'll likely cap out much, much earlier (closer to 10-16GB/s on commodity servers).

As long as the majority of your data spills into RAM, you'll see a massive throughput cliff on a single core. With large streaming use-cases, you'll almost always benefit from multi-core parallelism.

It's very easy to test: allocate a large block of memory (ideally orders of magnitude larger than your L3 cache), pre-fault it, so you're not stalling on page faults, and do an unrolled vector load (AVX2/AVX-512) in a tight loop.

menaerus
0 replies
1h17m

A single core can't saturate RAM bandwidth.

I was also intrigued by the same statement and was wondering about it as well. For AMD I don't know but on Intel I believe that each L1 cache-miss is going to be routed through the line-fill-buffers (LFB) first before going to L2/L3. Since this LFB is only 10-12 entries large, this number practically becomes the theoretical limit on the number of (load) memory operations each core can issue at a time.

Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2),

I think in general yes but it appears that ALU on Alderlake and Saphire Rapids has 3 SIMD load ports so they can execute up to 3x SIMD AVX2 mm256_load_si256 per clock cycle.

londons_explore
0 replies
3h11m

The goal is that none of the intermediate data spills - the only user of RAM is reading the input data.

But for that, you're right, the workload must be split across as many cores as necessary to keep DRAM maxed out or you're throwing away performance. Luckily this problem is easily split.

Jaxan
2 replies
3h56m

Go for it!

londons_explore
1 replies
3h31m

You nerd sniper you!

But more seriously, the JVM's support for vector intrinsics is very basic right now, and I think I'd spend far more time battling the JVM to output the code I want it to output than is fun. Java just isn't the right language if you need to superoptimize stuff.

Theoretically all of the above is super simple SIMD stuff, but I have a suspicion that SIMD scatter/gather (needed for the state lookup table) isn't implemented.

switchbak
0 replies
25m

If you really want to nerd snipe, write an optimized version in a non-JVM language to compare to the fastest Java one. But that's kind of boring anyway, we've seen this play out a thousand times.

I still appreciate the intention of seeing how far we can squeeze performance of modern Java+JVM though. Too bad very few folks have the freedom to apply that at their day jobs though.

gpderetta
1 replies
2h30m

autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

You are basically describing a perfect hash, right?

londons_explore
0 replies
1h57m

nearly but not quite - you can deliberately 'collide' when there is no data you need to keep going forward (the obvious example being when you end on a '\r' character). This is necessary to keep an upper bound on the state space.

Remember that in the extreme, a state could mean "city A, temp zero, city b" - "0\rB" so it isn't a direct mapping from state machine state to city.

dmoy
1 replies
3h31m

run through all the data, at RAM speed,

Is the data given in RAM or do you have to wait for spinning rust I/0 for some GB amount of data?

londons_explore
0 replies
3h27m

The test procedure is to run the code 5 times. Even if it isn't in cache the first time, it will be in the OS cache (ie. RAM) for the 2nd-5th runs. The test machine has 32GB RAM, while the data file is only 12GB, so it ought to be fully cached.

spullara
0 replies
2h9m

You have to read and parse the entire file to find the place names.

mgaunard
26 replies
8h52m

Isn't this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.

It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I/O.

rockwotj
16 replies
8h42m

A few things:

Disk access can certainly be parallelized, and NVMe is blazing fast, so the bottleneck is more the CPU than disk. There are systems that are built around around modern hardware that realize this (redpanda.com, where I work is one such example)

Parsing is a lot of the compute time and some SIMD tricks like SWAR for finding delimiters can be helpful. Stringzilla is a cool library if you want to see a clean implementation of these algorithms: https://github.com/ashvardanian/StringZilla

See my reply here about the file being cached completely in memory after the first run: https://news.ycombinator.com/item?id=38864034

londons_explore
15 replies
7h55m

I would hope that any reasonably performant implementation would be faster not only than NVMe, but also faster than CPU to RAM data transfers.

The AMD EPYC-Milan in the test server supports memory reads at 150 gigabytes/sec, but thats a 32 core machine, and our test only gets 8 of those cores, so we probably can't expect more than 37 gigabytes per second of read bandwidth.

The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

There are only ~400 weather stations, so all the results fit in cache. It's unclear if it might be faster to avoid converting ascii to a float - it might be faster to add up all the ascii numbers, and figure out how to convert to a float only when you have the total.

menaerus
8 replies
5h49m

The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

How? EPYC-Milan features PCIe 4.0 and theoretical maximum throughput of x4 sequential read is ~8GB/s.

In bare-metal machines, this figure is rather closer to 5-6GB/s, and in cloud environments such as CCX33, I imagine it could be even less.

So, unless I am missing something you'd need ~2 seconds only to read the contents of the ~12GB file.

londons_explore
7 replies
4h59m

This assumes the file is already in page cache, and therefore already in RAM. The 0.3 seconds is the time to get it from RAM to the CPU L3 cache.

mgaunard
5 replies
4h16m

That sounds like cheating. The exercise says to process a large file that's on disk. If the data is in RAM then that's a whole different game.

rockwotj
3 replies
4h1m

The benchmark is run 5 times without flushing file system caches. If you mmap it, it will still be in memory between process runs. The author intended this FWIW

PaulDavisThe1st
1 replies
2h44m

If the cache is not flushed, it will still be in memory between runs whether you mmap it or not.

marginalia_nu
0 replies
1h10m

That depends a lot on the OS. I would not count on a 13 GB file being cached in RAM after having been read through once.

mgaunard
0 replies
1h47m

That sounds like a benchmarking flaw.

But then, writing good benchmarks is an art in itself.

brrrrrm
0 replies
4h2m

Isn’t it run multiple times and the slowest is dropped? Why would you expect the file to be evicted from ram so quickly?

menaerus
0 replies
3h55m

That makes sense now, thanks.

vintermann
2 replies
7h28m

Are you sure that would give the same result? The first thing that struck me about this challenge is that there's going to be floating point precision issues.

londons_explore
1 replies
6h45m

All the data is only to one decimal place. You can shift the decimal place and do the whole thing with int's and get a 100% precise result.

vintermann
0 replies
6h31m

You can, but will it be the same result? :) binary floating point can't represent 0.1 or 0.2 exactly.

To be fair, with only additions, and only a billion of them, you're probably fine working in floating point and rounding to one decimal in the end.

codeflo
1 replies
6h14m

any reasonably performant implementation would be faster not only than NVMe

Saturating NVMe bandwidth is wicked hard if you don't know what you're doing. Most people think they're I/O bound because profiling hinted at some read() function somewhere, when in fact they're CPU bound in a runtime layer they don't understand.

ykonstant
0 replies
43m

Can you elaborate on the runtime layers and how to affect them?

anonymoushn
0 replies
5h44m

It looks like all existing solutions are too slow by a huge factor then :)

lessbergstein
2 replies
6h17m

Yes, this is an extremely trivial problem. Anybody who knows how to program in more than one language is going to find this silly. awk or perl would finish it before jit compilation gets started.

viraptor
0 replies
6h9m

It's a trivial problem to implement, but not trivial to implement faster than others. Awk won't beat solutions that are clever about multiple read queues and parallel processing.

Jaxan
0 replies
3h14m

Please try it out in awk. It would be interesting to see whether awk can do this in a few seconds.

D4ckard
1 replies
8h7m

Daniel Lemire has an interesting talk on this: https://www.youtube.com/watch?v=wlvKAT7SZIQ. They key takeaway is that the disk is rarely the bottleneck.

codeflo
0 replies
6h11m

It's an old bit of wisdom that's hard to stamp out. That's partly because anything that you don't understand below you in the runtime stack looks like "I/O" during profiling, so the misunderstanding perpetuates even among people who attempt to take a closer look, but don't know enough.

usefulcat
0 replies
2h20m

Isn't this simply bound by the speed of the disk?

Depends on the OS and filesystem being used. The input file is ~12GB, and it's being run 5 times on a machine with 32GB, so after the first run the file could be cached entirely in memory. For example, Linux using ext2 would probably cache the entire file after the first run, but using ZFS it probably would not.

throwaway2037
0 replies
3h2m

I agree. From here: https://news.ycombinator.com/item?id=38865942

<< Just clarified this in the README: * Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit >>

If we assume worst case entries: -99.9;<100 char station name>

That is 106 chars plus newline (1 or 2 chars). That could be 108 GB of data! And, I want to be so picky here: The README says "UTF-8" and "max length 100 characters". Eh, we could start a holy war here arguing about what does a character mean in UTF-8. Let's be generous and assume it means one (8-bit) byte (like C). I have good faith in the original author!

Some Googling tells me that the current fastest PCIe Gen5 NVMe drive is Crucial T700 w/ Sequential read (max., MB/s) @ 12,400. That is 8.7 seconds to read 108 GB of data. A few other comments mentioned that the sample file is only 12 GB, so that is 1 second, and 100% can be cached by Linux kernel for 2nd and later reads. However, if the test file is greater than 32 GB, you are definitely screwed for file system caching. You will have to pay a multi-second "I/O tax" to read the file each time. This seems to far outweigh any computation time.

Thoughts?

Last, I still think this is a great challenge. I think I just found my next interview question. This question is easy to understand and program a basic implementation. Further discussion can be fractally complex about the myriad of optimizations that could be applied.

No trolling: Can you imagine if someone tries this exercise in Python? There are a goofy amount of optimizations to deal with huge data in that language thanks to SciPy, Pandas and other friends. I am sure someone can write an insanely performant solution. (I know, I know: The original challenge says no external dependencies.)

mikewarot
0 replies
3h17m

Clearly the fastest way to parse this is to load it all into RAM, and work backwards from the end. That has the advantage of putting the digits in ascending order, and then the delimiter, then the string... until EOF or NL is encountered.

NavinF
0 replies
8h19m

Totally depends on your workload and hardware. I have a normal consumer SSD in my desktop and it can easily sustain 7GB/s (56gbps) as long as I only use 700GB of the 2TB available. A normal server has enough PCIe lanes for 15 SSDs like mine so a normal server's IO bandwidth is comparable to its memory bandwidth. More expensive servers have faster lanes (PCIe 5.0) and more of those lanes.

Anyway the OP's file only has 1B rows so it'll be ~1GB compressed which fits in memory after the first discarded run. IO bandwidth shouldn't matter in this scenario:

All submissions will be evaluated by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM). The time program is used for measuring execution times, i.e. end-to-end times are measured. Each contender will be run five times in a row. The slowest and the fastest runs are discarded

Edit: The github repo says it's 12 GB uncompressed. That confirms that IO bandwidth doesn't matter

worthless-trash
25 replies
12h35m

Interesting challenge, shame its only java. Can't wait till people start hand rolling their own JVM bytecode.

kriskrunch
20 replies
12h24m

Check out the discussion[0], looks like there are submissions in several languages. Go, Rust, Python, and C++, to name a few

[0] https://github.com/gunnarmorling/1brc/discussions

Animats
19 replies
12h13m

It looks like the problem is dominated by reading in the data file. Some fast solutions just read the whole file into memory.

fny
15 replies
12h9m

Rather than read the file into memory, memory mapping can be used.

pclmulqdq
12 replies
11h43m

You wouldn't want to do this for a huge file. A very fast solution would use a small number of buffers and io_uring (or equivalent), keeping the page table and cache footprint small.

rockwotj
7 replies
11h21m

Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because having the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).

Code: https://github.com/rockwotj/1brc

Discussion on Filesystem cache: https://x.com/rockwotj/status/1742168024776430041?s=20

lifthrasiir
3 replies
11h9m

double parsing

In case you haven't noticed yet, the input format guarantees exactly one fractional digit, so you can read a single signed integer followed by `.` and one digit instead.

fragmede
1 replies
8h5m

could you just add the character values eg 49 for ascii 1, and then subtract off the offset once at the end instead of doing atoi on each line?

edit: doh that works for min and max but the average overflows.

gpvos
0 replies
7h35m

Yes. I'm not sure it'll help, but def worth a try.

rockwotj
0 replies
10h29m

Yeah I missed this originally, and stuff could be faster with this assumption without a full double parser. The fastest java solution dies some near branchless decoding for these

winrid
0 replies
9h21m

Wow, that's pretty fast considering how simple main.cc looks. I do love c++. Nice use of coroutines, too.

pclmulqdq
0 replies
11h15m

I missed the "5 times in a row." If you do that, yeah, keeping the whole thing in memory is far better.

alain94040
0 replies
45m

So you are basically at the mercy of the OS caching algorithm. That sounds like a bad plan for a benchmark. You are not measuring what you think you are (your code), you are measuring the OS caching policy.

xoranth
1 replies
9h25m

Dumb question. With io_uring, how do you handle lines that straddle between chunks? I'm asking since, AFAIU, the submitted requests are not guaranteed to be completed in order.

(The easiest I can think of is submitting reads for "overlapped" chunks, I'm not sure there is an easier way and I'm not sure of how much performance overhead there is to it.)

moonchild
0 replies
9h19m

You have to handle it manually. Remember the partial lines at the beginning/end of your chunks and merge them when their mates become available.

charcircuit
1 replies
11h16m

What is the downside of memory mapping in this scenario? Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages? Accessing the contents of a file doesn't seem like something caching would matter for. Do you mean that reading of sequential pages will keep adding to the cache compared to reading from a single page? That seems like a similar thing as before where they will be the first things gone from the cache anyways.

eru
0 replies
10h9m

Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages?

That's what I used to think, too. But the kernel ain't that smart.

eru
0 replies
10h13m

Memory mapping (at least on Linux) isn't actually faster than reading the file manually. Especially if you use appropriately sized buffers.

(Of course, the five times in a row might mess with that.)

capitol_
0 replies
11h47m

But would that really be faster when you need to read every byte of a file?

I thought memory mapping solved a different problem.

somat
2 replies
10h8m

It would be a more interesting challenge if the data file were larger than the memory. I would love to see what people would come up with on some baby vm with 512 mb of ram.

Even more interesting would be small ram, little local storage and a large file only available via network, I would like to see something other than http but realistically it would be http.

throwaway167
0 replies
7h22m

Sixteen thousand Excel 97 spreadsheets?

anonymoushn
0 replies
5h36m

For this problem, changing the file size to not fit in RAM doesn't really make the optimal solutions more interesting

userbinator
3 replies
11h49m

Alternatively, "must be written in Java" can be interpreted to mean "must use the JVM to begin execution", and you can clearly spawn another process from Java...

yardstick
1 replies
11h35m

Another rule was no external dependencies

I guess you could from Java itself write a new binary and then run that binary, but it would be against the spirit of the challenge.

sweetjuly
0 replies
10h32m

There's sun.misc.Unsafe which is a builtin that gives you raw read/write. Turn that into ROP, mmap your native program, and now you're full native :)

nneonneo
0 replies
11h7m

Just use JNA to allocate an RWX buffer with mmap, unpack a bit of clever code into the buffer, and forge a native function pointer to call it :)

planb
14 replies
10h0m

As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?

[0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

gunnarmorling
13 replies
9h47m

Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.

[0] https://twitter.com/mtopolnik/status/1742652716919251052

londons_explore
5 replies
7h45m

The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.

anonymoushn
2 replies
5h46m

It seems like you have to run the slow collision-resistant thing over the whole file.

londons_explore
1 replies
5h1m

Collision detection is far cheaper. One method:

Simply add up all the bytes of all seen placenames while running your fast algorithm. This is effectively a checksum of all bytes of placename data.

Then at the end, calculate the 'correct' name-sum (which can be done cheaply). If it doesn't match, a collision occurred.

anonymoushn
0 replies
4h53m

You can design inputs that defeat this scheme though. I thought the point was to be correct for all valid inputs.

posix86
1 replies
6h26m

But how would you detect that a station name is colliding or not colliding? With a hash set?

viraptor
0 replies
6h6m

Search for cuckoo hashing. It's a whole thing for data structures.

rkangel
4 replies
5h25m

Have you considered having two datasets? The "development" dataset which is public, and the "competition" dataset which is private?

That might help against algorithms that are (accidentally, or on purpose) tuned to the specific dataset.

paxys
3 replies
3h37m

Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?

refulgentis
2 replies
2h29m

Your code is submitted (per parent comment)

paxys
1 replies
1h41m

Sure, but who is manually checking every entry?

946789987649
0 replies
1h14m

You don't need to check every entry, just the top few.

thomaswue
1 replies
8h24m

What are the constraints on station names - i.e. min length, max length, maximum number of different names?

gunnarmorling
0 replies
6h27m

Just clarified this in the README:

* Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported

lifthrasiir
13 replies
12h23m

Q: Can I make assumptions on the names of the weather stations showing up in the data set?

A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no `;` character).

I'm unsure if it's intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.

Hugsun
7 replies
11h59m

Given this requirement, they would be wise to have the test data be different from the example data. That would prevent overfitting optimizations.

josephg
6 replies
11h44m

I'm not sure that it is overfitting to optimize your code for the test set. The requirement is just that you don't break compatibility for arbitrary names in the process.

This sort of thing shows up all the time in the real world - where, for example, 90% of traffic will hit one endpoint. Or 90% of a database is one specific table. Discovering and microoptimizing for the common case is an important skill.

lifthrasiir
2 replies
11h39m

That's why I was so unsure. I think though, that it is generally better to have the input generator seeded and do not disclose the chosen seed in advance, which prevents overfitting to a single input and yet encourages overfitting to a much bigger class of inputs.

rocqua
1 replies
9h9m

That invites probing submissions to figure out a good hash for all names from that one seed.

I think having the names for performance tests public is fine. But there should be correctness tests on sets with random names.

lifthrasiir
0 replies
7h48m

Of course, I assume names are also randomly generated. But it might be the case that they are all ASCII-only, which can be exploited.

gunnarmorling
2 replies
9h39m

Their should be no optimizations which rely on a-priori knowledge of the dataset. I.e. if you calculate a perfect hash function at runtime by inspecting the dataset, that's fine (not sure whether it's beneficial), but hard coding it, is not.

johndough
1 replies
9h12m

That is a tricky rule to enforce. For example, some solutions assume that the temperature values fit into an int, which could be interpreted as relying on a priori knowledge.

Hugsun
0 replies
4h11m

That's true. Some assumptions always have to be made.

In this case, we can assume place names that can be arbitrary strings and temperatures that occur naturally on earth. Optimizing around those constraints should be fine. We could probably limit the string length to something like 100 characters or so.

Bad assumptions would for example be to assume that all possible place names are found within the example data.

londons_explore
4 replies
7h13m

UTF-8 makes this far harder... But if you stick to the letter but not spirit of the rules, you can simply fall back to a slow implementation if you ever detect any byte > 127 (indicating a multibyte UTF-8 character).

jerf
3 replies
6h39m

Only if you validate the UTF-8 as being valid. If you just accept that it is you can treat it as just some bytes. Nothing in the spec I see requires processing UTF-8 as an actual Unicode string.

The easiest way to handle Unicode is to not handle it at all, and just shove it down the line. This is often even correct, as long as you don't need to do any string operations on it.

If the author wanted to play Unicode games, requiring normalization would be the way to go, but that would turn this into a very different challenge.

alkonaut
2 replies
3h54m

Since the tail of the line has a known format I guess we are rescued by the fact that the last 0x3B is the semicolon as the rest is just a decimal number. We can’t know the first 0x3B byte is the semicolon since the place names are only guaranteed to not contain 0x3B but can contain 0x013B. So a parser should start from the rear of the line and read the number up to the semicolon and then it can treat the place name as byte soup. Had two places shared line this challenge would have required real utf parsing and been much harder.

marginalia_nu
0 replies
41m

I'm not sure scanning backwards this helps. Running in reverse you still need to look for a newline scanning over an UTF-8 string which might plausibly contain a newline byte.

I'm no UTF-8 guru, but I think you might be possible to do this sort of a springboard for skipping over multi-byte codepoints, since as far as I understand the upper bits of the first byte encodes the length:

    byte utfByte1 = (byte) (val & 0xF0);

    if (utfByte1 == (byte) 0xF0) { // 4 byte codepoint 
      // ignore 3
    }
    else if (utfByte1 == (byte) 0xE0) { // 3 byte codepoint
      // ignore 2
    }
    else if (utfByte1 == (byte) 0xC0) { // 2 byte codepoint
      // ignore 1
    }

jerf
0 replies
1h59m

Yes, sorry, I'm so used to this sort of thing with the work I've been doing lately I forgot to lay it out like that. Thank you for filling it in for me.

lessbergstein
11 replies
12h16m

I don't understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.

https://stackoverflow.com/questions/277309/java-floating-poi...

ako
8 replies
11h53m

The difficulty is creating the fastest implementation. If you look at the results of the submissions so far you’ll see a big difference in duration, between 11 seconds and more than 4 minutes.

11 seconds seems pretty impressive for a 12Gb file. Would be interesting to know what programming language could do it faster. For a database comparison you’d probably want to include loading the data into your database for a fair comparrison.

lessbergstein
7 replies
10h16m

Perl would do it quite fast and it has the benefit of accessing posix primitives directly.

gerikson
6 replies
10h1m

A naive perl solution is really really slow compared to even the reference Java implementation. (I know, I've tried)

lessbergstein
5 replies
9h21m

That's strange, you should be able to stream the file right into a tiny perl executable at the same speed as the bottlenecking hardware. The kernel will take care of all the logistics. You're probably trying to do too much explicitly. Just use a pipe. Perl should be done before Jit completes.

gerikson
2 replies
6h36m

Using cat to redirect the file to /dev/null takes 18s on my machine (a low-end NUC). Just running a noop on the file in Perl (ie. feeding it into a `while (<>)` loop but not acting on the contents) takes ~2 minutes.

1B lines is a lot, and Java ain't a slouch.

lessbergstein
1 replies
6h31m

Why are you using cat at all? Use a pipe. This isn't hard stuff. Don't use <>, feed the file into a scalar or array. it should only take a few seconds to process a billion lines.

https://www.perl.com/pub/2003/11/21/slurp.html/#:~:text=Anot....

sobellian
0 replies
4h29m

If it isn't hard, then perhaps you could demonstrate with a complete perl program that you think should beat java.

gerikson
1 replies
9h6m

I profiled my attempt, actually reading each line is the bottleneck.

lessbergstein
0 replies
6h34m

Perl is always going to be much faster than Java at tasks like this. Use stdin and chomp() instead of reading each line explicitly.

This is really a small, trivial task for a perl script. Even with a billion lines this is nothing for a modern cpu and perl.

petters
1 replies
10h32m

It’s easy to solve but even fizzbuzz becomes complicated if you want double digit GB/s output.

lessbergstein
0 replies
6h27m

It's really not. We're talking about gigahertz CPUs and likely solid state storage that can stream many gb/s.. running through a perl script. There really isn't much that is faster than that.

scumola
10 replies
12h31m

This is just a map/reduce problem. Use Hadoop. It's Java, isn't it?

baq
8 replies
12h22m

Why would I use Hadoop for such a small number of rows…?

jalino23
7 replies
11h59m

1 billion is small for hadoop?

rapsey
3 replies
11h55m

If it fits on one computer it's not a hadoop problem.

quickthrower2
0 replies
11h15m

It fits on a dusty ten year old USB stick

badgersnake
0 replies
10h47m

Sounds like an awk problem tbh.

8organicbits
0 replies
10h42m

A Hadoop submission may help people realize that. But since you only have one machine to work with it should be obvious that you're not going to get any speed-up via divide and conquer.

chmod775
1 replies
10h6m

Anything that fits in RAM on one machine is easily too small for Hadoop. In those cases, the overhead of Hadoop is going to make it get destroyed by a single beefy machine. The only times where this might not be the case is when you're doing a crazy amount of computation relative to the data you have.

Note that you can easily reach 1TB of RAM on (enterprise) commodity hardware now, and SSDs are pretty fast too.

Old but gold post from 2014: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

makapuf
0 replies
7h14m
brokensegue
0 replies
11h57m

~14GB file? it's on the small side for hadoop

kriskrunch
0 replies
12h28m

No external dependencies may be used
e63f67dd-065b
10 replies
10h44m

The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).

write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station

Depending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.

Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.

Edit: if we don't like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.

Edit 2: while we're here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure

yellow_lead
5 replies
10h37m

I think that would violate this rule

The computation must happen at application runtime, i.e. you cannot process the measurements file at build time (for instance, when using GraalVM) and just bake the result into the binary
gunnarmorling
2 replies
9h43m

Yeah, and I'd count caching results between runs into the same category, i.e. against the spirit of this challenge. Folks should keep in mind that the goal is to learn something new, rather than "winning" by cheating the rules.

cm2187
1 replies
9h15m

Though the caching done by the OS (I/O for instance) is fair game

gunnarmorling
0 replies
9h6m

Yes, exactly. Retrieving the source data file from the page cache != storing and reloading precomputed results.

e63f67dd-065b
1 replies
10h29m

Ah I didn't actually follow the link to the github repo. I don't think that percludes pre-processing the input into something that needs no parsing, however, or to spawn a background daemon that prefetches everything in the right order.

The first run is indeed runtime, not build time, so technically I think I still count with pre-computation, sending that to the daemon (or just stashing it somewhere in /tmp, or even crazier patching the jarfile), and just printing it back out on the second run onwards.

robertlagrant
0 replies
8h26m

I don't think that percludes pre-processing the input into something that needs no parsing

Why not preprocess it into a file that contains the answer? (-: My read of the description was just: you're meant to read the file as part of the run.

camkego
2 replies
9h49m

There is a real issue with providing the contestant the real exact file that will be used in the contest. Because there are 1e9 shades of gray between hard coding the correct answer in a single line program which doesn't even read the input, and processing the file as if we don't know what it contains.

This might become a contest of judging what is fair pre-computing and what is not.

That's why machine learning contests don't let participants see the final data.

londons_explore
1 replies
7h25m

Fix:

A run consists of:

  * I generate random measurements with the provided script, which will be stored in a readonly file.

  * I run your program to calculate the results (and time it).

  * I run my baseline program and your entry is valid if my baseline program calculates the same results.

lelanthran
0 replies
3h5m

The filesystem has to be read-only as well, otherwise the first run can simply write the results to ~/.me_cheater and the program first reads that file before attempting to read the dataset.

rocqua
0 replies
9h13m

I think the rules should stipulate each run is on its own tmpfs and all processes and pagecaches are cleared between runs.

pbh101
9 replies
12h19m

Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?

bayindirh
7 replies
11h38m

This is done to simulate real-world performance. Your binary is not the only binary in the system and other services may be running as well. So fastest time is the happiest path and slowest is the unluckiest.

The range of remaining three is what you expect to get 99% of the time on a real world system.

fragmede
5 replies
10h45m

Your binary is not the only binary in the system and other services may be running as well.

Technically yes, but these days most of my machines are single purpose VMs; database/load balancer/app server/etc, so it still seems weird not to take the fastest.

bayindirh
2 replies
10h43m

Then, your VM is not the only VM sharing the bare metal. Same thing applies, only on a slightly higher level.

As long as you share the metal with other stuff (be it containers, binaries, VMs), there's always competition for resources, and your average time becomes your real world time.

vinay_ys
1 replies
10h22m

Physical memory is not fungible like that across VMs. So, you can expect stuff loaded into memory to stay there unless your kernel inside the VM decides it not to.

bayindirh
0 replies
10h18m

No, it's. VirtIO's balooning device can "inflate" to pseudo-allocate memory on a VM to free physical memory for other hosts.

Moreover, even if your files in memory, you cannot reserve "memory controller bandwidth". A VM using tons of memory bandwidth or a couple of cores in the same NUMA node with your VM will inevitably cause some traffic and slow you down.

berkes
1 replies
9h53m

That VM is hardly "single purpose".

There's logrotate and other cleanup tasks, monitoring, dns, a firewall, and many more stuff running on that server. No matter how much you offload to the host (or forego), there's always a kernel and supporting deamons running alongside or under your app.

fragmede
0 replies
8h9m

I said "technically yes" :)

cwillu
0 replies
11h21m

Any production where you're routinely scanning a text file with a million records is probably a batch process, and I'd be shocked if the usual performance wasn't much closer to the worst case than the average.

rockwotj
0 replies
11h16m

Why?

The author does explicitly want all the files to be cached in memory for the later runs: https://x.com/gunnarmorling/status/1742181941409882151?s=20

solarized
8 replies
8h57m

Just for fun. Speed testing awk vs Java.

  awk -F';' '{
      station = $1
      temperature = $2
      sum[station] += temperature
      count[station]++
      if (temperature < min[station] || count[station] == 1) {
          min[station] = temperature
      }
      if (temperature > max[station] || count[station] == 1) {
          max[station] = temperature
      }
  } 
  END {
      for (s in sum) {
          mean = sum[s] / count[s]
          printf "{%s=%.1f/%.1f/%.1f", s, min[s], mean, max[s]
          printf (s == PROCINFO["sorted_in"][length(PROCINFO["sorted_in"])] ? "}\n" : ", ")
      }
  }' measurement.txt

arkh
4 replies
8h28m

I'd like to see it speed tested against an instance of Postgres using the file Foreign Data Wrapper https://www.postgresql.org/docs/current/file-fdw.html

    CREATE EXTENSION file_fdw;
    CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
    CREATE FOREIGN TABLE records (
      station_name text,
      temperature float
    ) SERVER stations OPTIONS (filename 'path/to/file.csv', format 'csv', delimiter ';');
    SELECT station_name, MIN(temperature) AS temp_min, AVG(temperature) AS temp_mean, MAX(temperature) AS temp_max
    FROM records
    GROUP BY station_name
    ORDER BY station_name;

nathancahill
1 replies
8h14m

Man, Postgres is so cool and powerful.

aargh_aargh
0 replies
6h31m

Using FDWs on a daily basis I fully realize their power and appeal but at this exclamation I paused and thought - is this really how we think today? That reading a CSV file directly is a cool feature, state of the art? Sure, FDWs are much more than that, but I would assume we could achieve much more with Machine Learning, and not even just the current wave of LLMs.

Why not have the machine consider the data it is currently seeing (type, even actual values), think about what end-to-end operation is required, how often it needs to be repeated, make a time estimate (then verify the estimate, change it for the next run if needed, keep a history for future needs), choose one of methods it has at its disposal (index autogeneration, conversion of raw data, denormalization, efficient allocation of memory hierarchy, ...). Yeah, I'm not focusing on this specific one billion rows challenge but rather what computers today should be able to do for us.

ftisiot
0 replies
6h6m

Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/

DiggyJohnson
0 replies
1h49m

For some reason I'm impressed that the previous page in the documentation is https://www.postgresql.org/docs/current/earthdistance.html. Lotta fun stuff in Postgres these days.

jackhalford
1 replies
7h4m

Your sum variable could get pretty large, consider using the streaming mean function, something like: new_mean = ((n*old_mean)+temp)/(n+1)

apwheele
0 replies
4h3m

Had this same thought as well. Definitely comes up when calculating variance this way in streaming data.

https://stats.stackexchange.com/a/235151/1036

I am not sure if `n*old_mean` is a good idea. Wellford's is typically something like inside the loop

count += 1; delta = current - mean; mean += delta/count;

Algunenano
0 replies
5h57m

Using ClickHouse local:

    time clickhouse local -q "SELECT concat('{', arrayStringConcat(groupArray(v), ', '), '}')
    FROM
    (
        SELECT concat(station, '=', min(t), '/', max(t), '/', avg(t)) AS v
        FROM file('measurements.txt', 'CSV', 'station String, t Float32')
        GROUP BY station
        ORDER BY station ASC
    )
    SETTINGS format_csv_delimiter = ';', max_threads = 8" >/dev/null

    real 0m15.201s
    user 2m16.124s
    sys 0m2.351

Most of the time is spent parsing the file

JohnKemeny
8 replies
10h21m

The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

I think it's better to discard the two slowest, or simply accept the fastest as the correct. There's (in my opinion) no good reason to discard the best runs.

mayank
3 replies
9h51m

This is a pretty standard measure called the Trimmed Mean: https://statisticsbyjim.com/basics/trimmed-mean/

stabbles
1 replies
9h44m

For performance benchmarking the minimal runtime is typically the best estimator if the computations are identical, cause it measures perf w/o interrupts.

If the language is garbage collected, or if the test is randomized you obviously don't want to look at the minimum.

mayank
0 replies
9h9m

the minimal runtime is typically the best estimator

Depends what you’re estimating. The minimum is usually not representative of “real world” performance, which is why we use measures of central tendency over many runs for performance benchmarks.

ProblemFactory
0 replies
9h11m

Variability in software runtime arises mostly from other software running on the same system.

If you are looking for a real-world, whole-system benchmark (like a database or app server), then taking the average makes sense.

If you are benchmarking an individual algorithm or program and its optimisations, then taking the fastest run makes sense - that was the run with least external interference. The only exception might be if you want to benchmark with cold caches, but then you need to reset these carefully between runs as well.

paxys
2 replies
1h39m

If you don't think discarding the fastest is acceptable then why are you in favor of discarding the slowest?

JohnKemeny
1 replies
1h15m

Because there can be outliers in terms of slowness; perhaps the CPU gets a hickup, the GC runs an abnormal amount of extra times, the kernel decides to reschedule stuff, whatever, I don't know much about these things.

There can't be outliers in terms of fastness; the CPU doesn't accidentally run the program much faster.

But then again, what the hell do I know...

paxys
0 replies
44m

It's the same logic both ways. The OS runs 10 background tasks on average at any given time. On one of your runs it was doing 15 things, while on another it was doing 5 things. They are both outliers.

hyperpape
0 replies
5h7m

There are good reasons to discard the best runs. If you think of your system as having predictable behavior that's slowed down by background processing happening on the machine, it might make sense to use the best run. But if there are any intrinsic sources of non-determinism in the program (and it's far more likely than you might think), taking the best time is likely to be unrepresentative.

https://tratt.net/laurie/blog/2019/minimum_times_tend_to_mis... is a good piece on the subject.

hknmtt
7 replies
9h25m

first search result for averaging streaming data:

https://nestedsoftware.com/2018/03/20/calculating-a-moving-a...

so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.

hknmtt
3 replies
8h37m

actually i think you can also just average each chunk and then add it to existing data. like read N rows(say all have one location to keep it simple), average the data from the chunk, update/save min and max, move on to next chunk, do the same but now update the average by adding to existing/previously computed average and divide by two. the result will be the same - disk IO will be the most limiting aspect. this "challenge" is not really a challenge. there is nothing complicated about it. it just seems "cool" when you say "process 1 billion rows the fastest you can".

sudodudeo
2 replies
7h5m

Wouldn't this end up reducing the weight of earlier records by repeatedly dividing them into smaller chunks?

I.e. avg of {22.5, 23, 24} = 23.17... But:

1. 22.5

2. (22.5 + 23)/2 = 22.75

3. (22.75 + 24)/2 = 23.375

jdthedisciple
0 replies
4h16m

you have to weight the previous results appropriately then it works

hknmtt
0 replies
4h0m

say you load 1 million records and you average them at 5.1. you then load another million and average them at 4.5. so you 5.1+4.5=9.6/2=4.8. rinse and repeat. as long as you keep the amount of records processed per each run about the same, your numbers will not be skewed. only the last chunk will most likely be smaller and it will introduce small rounding error, like if it has only 10k records instead of 1M. but still it is the simplest solution with good enough outcome.

winrid
1 replies
9h15m

Yes, just read a file in chunks and spread the math across cores. How many ways could you possibly implement that?? :)

cm2187
0 replies
9h5m

Custom number parsing, minimising the number of memory allocations to not be punished by the garbage collector. All sort of micro optimisations that make those solutions a terrible way to showcase a language (i.e. you can write much clearer and concise code but obviously slower).

cm2187
0 replies
9h17m

Looking at some solutions they seem to include their own double parsing implementation. I built a home made serializer for csv files, I am using the default .net parsing functions and I find that parsing numbers/dates is by far the slowest part of the process on large files.

justinl33
6 replies
11h4m

but… can I use pandas?

tempay
2 replies
10h48m

From a pure performance perspective it’d probably be quite slow in comparison to a dedicated implementation for this task. Especially as it wouldn’t fit in memory.

fragmede
1 replies
10h42m

why wouldn't it fit into memory? the data set is only 12gb.

asah
0 replies
6h4m

for my SQL implementation, I normalized the city names and temps to 2 bytes apiece, so it's even less!

https://news.ycombinator.com/item?id=38866073

buybackoff
2 replies
9h52m

It takes 330 seconds on a machine where the top OpenJDK version takes 5.75 secs and my .NET version takes 4.8 seconds. However it's just several lines of code and I used ChatGPT as a fun use case for how long it would take to have something working. It took around 5 mins. So if one needs to run this code only once Pandas would be a huge win considering development time. Interestingly the RAM usage was much more than 14 GB input file, probably 2-2.5x of that.

BTW asking ChatGPT to utilize all cores did not yield anything working in reasonable time.

justinl33
1 replies
8h29m

hahaha... nice! Out of interest, could you share more about your .NET solution? (the specifics but not the code)

buybackoff
0 replies
8h25m
brrrrrm
5 replies
12h11m

looking at the sample code I’m quite glad I’ve never really touched Java. What a clunky language

wiseowise
4 replies
9h40m

Ok, I’ll bite - why?

winrid
1 replies
9h29m

It looks like they mostly work in JS, so maybe it's the type definitions that look clunky.

Java is explicit, yes. This is intentional :)

brrrrrm
0 replies
3h0m

yea, I guess it's the lack of human-friendly default assumptions mixed with prescriptive design patterns. definitely an anachronistic design intent :(

brrrrrm
1 replies
3h6m

it feels like the characters-typed to functionality ratio is quite low. at the same time, the code isn't much cleaner and there's a lot of mental overhead in deciphering the abstractions (in this case Collector).

estiaan
0 replies
1h12m

Java is probably the language I use the most but I hate how much boilerplate it has. Something I’ve found to help with that is using Java Records instead of classes, a record is kind of like a struct, but I don’t think it’s exactly the same, there’s no boilerplate and something I quite like about them is that their fields are immutable (more specifically “private final”) which works well with more functional programming, something that can be quite clunky with classes in java

Anyway I’m not a software engineer so take what I say about software with a grain of salt. Also I feel like other java programmers will hate you if they need to read or use your code. For example lets say you had a 2D point data type that is a record. instead of things being like “pos.add(5)” if pos is a record you need to reassign it like “pos = pos.add(5)” where add returns the pos as an object. Similar to how BigInteger works in java.

Anyway I love records because there’s zero boilerplate, it feels very similar to me to how structs are written in rust and go, or how classes are done in scala. I just never see anyone talk about it. I guess that might be because most people programming new projects on JVM are probably not programming in java?

weinzierl
4 replies
7h17m

Unfortunately I have no time to code this up but things that I would try to make it fast:

- Read with O_DIRECT (but compare it with the mmap approach). I know O_DIRECT gets much hate, but this could be one of the rare cases where it helps.

- Use a simple array with sentinel and linear search for the station names. This is dumb, but if the number of stations is small enough this could beat the hash. (In the back of my head I have a rule of thumb that linear search is faster for up to 1000 elements, but I'm not sure anymore where I got this from).

gpderetta
1 replies
4h29m

Read with O_DIRECT

apparently the challenge explicitly allow for warming up the os cache on a prior run, so O_DIRECT would be disadvantaged here as the working set fits in memory.

edit: also in my very simple tests a good hashmap can beat linear search already at <10 items. It depends a lot on the hash function cost.

weinzierl
0 replies
3h20m

beat linear search already at <10 items

The sources I found corroborate this. My 1000 elements rule of thumb is apparently way off.

londons_explore
0 replies
7h10m

linear search is never faster assuming your hash algorithm is cheap enough. However, if your list only contains 2 items, then your hash algorithm better be "Is first character > 'm'".

When you want real speed like this and are happy with insane complexity, code that writes and self-compiles a hash function based on the data can make sense.

dvko
0 replies
4m

I tried the linear search by station name in my first naive approach. Using a hashmap was at least 2-3x as fast with the ~415 distinct keys in the 1BRC dataset.

gigatexal
4 replies
9h20m

Single line solve using clickhouse-local or duckdb.

lars_francke
2 replies
8h8m

No external dependencies may be used
tatersolid
0 replies
6h19m

The Java runtime is an external dependency, isn’t it?

gigatexal
0 replies
6h29m

the whole premise is silly though. why would anyone use plain java to compute this when databases were built for this or at least are the most finely tuned for it

uwemaurer
0 replies
7h30m

correct. like this:

time duckdb -list -c "select map_from_entries(list((name,x))) as result from (select name, printf('%.1f/%.1f/%.1f',min(value), mean(value),max(value)) as x from read_csv('measurements.txt', delim=';', columns={'name': 'varchar', 'value':'float'}) group by name order by name)"

takes about 20 seconds

spullara
2 replies
9h30m

This has been super fun. My current PR[1] is another 15% faster than my entry in the README. Sadly I haven't been able to make any progress on using SIMD to accelerate any part of it.

I think the issues with hashing could be easily covered by having the 500 city names in the test data also be randomly generated at test time. There is no way to ensure there aren't hash collisions without doing a complete comparison between the names.

[1] https://github.com/gunnarmorling/1brc/pull/56

londons_explore
1 replies
7h21m

1 billion rows, 500 unique values...

It becomes very possible to find an instance of each unique value, then runtime-design a hash algorithm where those 500 values don't collide.

Java allows self modifying code after all (and this challenge also allows native code, which can also be compiled-at-runtime)

spullara
0 replies
17m

You have to compare the keys in order to figure out if you have a hash collision or a new key. Without first scanning the entire file to know what the list of keys are there isn't a way around it. Even determining that list of keys involves doing key comparisons for every row.

jimmyed
2 replies
9h25m

I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway.

qsort
0 replies
9h13m

Agreed, the aggregations chosen here are embarrassingly parallel, you just keep the count to aggregate means.

Would have been more interesting with something like median/k-th percentile, or some other aggregation not as easy.

dist-epoch
0 replies
8h57m

Not sure if this what you meant, but there are SIMD min/max instructions.

https://www.felixcloutier.com/x86/phminposuw

drusepth
2 replies
10h52m

This would make for a really fun challenge in SQL, too.

winrid
0 replies
9h28m
asah
0 replies
6h8m

1:05 in PostgreSQL 16 but it was harder than I thought to saturate the CPUs and not be disk-bound. Also, I ran on GCP not Hetzner, so maybe different hardware.

SQL in theory makes this trivial, handles many of the big optimizations and looking at the repo, cuts 1000+ LOC down to a handful. Modern SQL engines handle everything for you, which is the whole damned point of SQL. Any decent engine will handle parallelism, caching, I/O, etc. Some exotic engines can leverage GPU but for N=1 billion, I doubt GPU will be faster. Here's the basic query:

   SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps GROUP BY 1 ORDER BY 1;
In practice, generic SQL engines like PostgreSQL bloat the storage which is a Big Problem for queries like this - in my test, even with INT2 normalization (see below), pgsql took 37 bytes per record which is insane (23+ bytes of overhead to support transactions: https://www.postgresql.org/docs/current/storage-page-layout....). The big trick is to use PostgreSQL arrays to store the data by city, which removes this overhead and reduces the table size from 34GB (doesn't fit in memory) to 2GB (which does).

The first optimization is to observe that the cardinality of cities is small and can be normalized into integers (INTEGER aka INT4), and that the temps can as well (1 decimal of precision). Using SMALLINT (aka INT2) is probably not faster on modern CPUs but should use less RAM, which is better for both caching on smaller systems and cache hitrate on all systems. NUMERIC generally isn't faster or tighter on most engines.

To see the query plan, use EXPLAIN:

   postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
                                                  QUERY PLAN
   ---------------------------------------------------------------------------------------------------------------
   Limit  (cost=13828444.05..13828445.37 rows=5 width=38)
     ->  Finalize GroupAggregate  (cost=13828444.05..13828497.22 rows=200 width=38)
         Group Key: city
         ->  Gather Merge  (cost=13828444.05..13828490.72 rows=400 width=38)
               Workers Planned: 2
               ->  Sort  (cost=13827444.02..13827444.52 rows=200 width=38)
                     Sort Key: city
                     ->  Partial HashAggregate  (cost=13827434.38..13827436.38 rows=200 width=38)
                           Group Key: city
                           ->  Parallel Seq Scan on temps_int2  (cost=0.00..9126106.69 rows=470132769 width=4)
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(13 rows)

Sigh, pg16 is still pretty conservative about parallelism, so let's crank it up.

   SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
   SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
   SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
   SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution

   postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
   ...
               Workers Planned: 14
   ...
top(1) is showing that we're burying the CPU:

   top - 10:09:48 up 22 min,  3 users,  load average: 5.36, 1.87, 0.95
   Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
   %Cpu(s): 12.6 us,  4.8 sy,  0.0 ni,  7.8 id, 74.2 wa,  0.0 hi,  0.7 si,  0.0 st
   MiB Mem :  32084.9 total,    258.7 free,    576.7 used,  31249.5 buff/cache
   MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30902.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1062 postgres  20   0  357200 227952 197700 D  17.9   0.7   9:35.88 postgres: 16/main: postgres postgres [local] SELECT
   1516 postgres  20   0  355384  91232  62384 D  17.6   0.3   0:08.79 postgres: 16/main: parallel worker for PID 1062
   1522 postgres  20   0  355384  93336  64544 D  17.6   0.3   0:08.53 postgres: 16/main: parallel worker for PID 1062
   1518 postgres  20   0  355384  90148  61300 D  17.3   0.3   0:08.53 postgres: 16/main: parallel worker for PID 1062
   1521 postgres  20   0  355384  92624  63776 D  17.3   0.3   0:08.54 postgres: 16/main: parallel worker for PID 1062
   1519 postgres  20   0  355384  90440  61592 D  16.6   0.3   0:08.58 postgres: 16/main: parallel worker for PID 1062
   1520 postgres  20   0  355384  92732  63884 D  16.6   0.3   0:08.49 postgres: 16/main: parallel worker for PID 1062
   1517 postgres  20   0  355384  91544  62696 D  16.3   0.3   0:08.55 postgres: 16/main: parallel worker for PID 1062
interestingly, when we match workers to CPU cores, we don't %CPU drops to 14% i.e. we don't leverage the hardware.

OK enough pre-optimization, here's the baseline:

   postgres=# SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
    city | min |         avg          | max
   ------+-----+----------------------+------
    0 |   0 | 276.0853550961625011 | 1099
    1 |   0 | 275.3679265859715333 | 1098
    2 |   0 | 274.6485567539599619 | 1098
    3 |   0 | 274.9825584419741823 | 1099
    4 |   0 | 275.0633718875598229 | 1097
   (5 rows)

   Time: 140642.641 ms (02:20.643)
I also tried to leverage a B-tree covering index (CREATE INDEX temps_by_city ON temps_int2 (city) INCLUDE (temp) ) but it wasn't faster - I killed the job after 4 minutes. Yes, I checked that it pg16 used a parallel index-only scan (SET random_page_cost =0.0001; set min_parallel_index_scan_size=0; set enable_seqscan = false; SET enable_parallel_index_scan = ON;) - top(1) shows ~1.7% CPU, suggesting that we were I/O bound.

Instead, to amortize the tuple overhead, we can store the data as arrays:

   CREATE TABLE temps_by_city AS SELECT city, array_agg(temp) from temps_int2 group by city;

   $ ./table_sizes.sh
   ...total...   | 36 GB
   temps_int2    | 34 GB
   temps_by_city | 1980 MB
Yay, it now fits in RAM.

   -- https://stackoverflow.com/a/18964261/430938 adding IMMUTABLE PARALLEL SAFE
   CREATE OR REPLACE FUNCTION array_min(_data ANYARRAY) RETURNS NUMERIC AS $$
       SELECT min(a) FROM UNNEST(_data) AS a
   $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;

   SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
   SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
   SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
   SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution

   postgres=# create table tmp1 as select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
   SELECT 1001
   Time: 132616.944 ms (02:12.617)

   postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
                                           QUERY PLAN
   -------------------------------------------------------------------------------------------------
   Sort  (cost=338.31..338.81 rows=200 width=98)
   Sort Key: city
   ->  Finalize HashAggregate  (cost=330.14..330.66 rows=200 width=98)
         Group Key: city
         ->  Gather  (cost=321.52..322.64 rows=600 width=98)
               Workers Planned: 3
               ->  Partial HashAggregate  (cost=321.52..322.04 rows=200 width=98)
                     Group Key: city
                     ->  Parallel Seq Scan on temps_by_city  (cost=0.00..0.04 rows=423 width=34)
(9 rows)

Ah, only using 3 cores of the 8...

---

   CREATE TABLE temps_by_city3 AS SELECT city, temp % 1000, array_agg(temp) from temps_int2 group by 1,2;

   postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city3 group by 1 order by 1 ;
                                               QUERY PLAN
   ---------------------------------------------------------------------------------------------------------
   Finalize GroupAggregate  (cost=854300.99..854378.73 rows=200 width=98)
   Group Key: city
   ->  Gather Merge  (cost=854300.99..854348.73 rows=2200 width=98)
         Workers Planned: 11
         ->  Sort  (cost=854300.77..854301.27 rows=200 width=98)
               Sort Key: city
               ->  Partial HashAggregate  (cost=854290.63..854293.13 rows=200 width=98)
                     Group Key: city
                     ->  Parallel Seq Scan on temps_by_city3  (cost=0.00..98836.19 rows=994019 width=34)
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(12 rows)

This 100% saturates the CPU and runs in ~65 secs (about 2x faster).

---

Creating the data

Here's a very fast lousy first pass for PostgreSQL that works on most versions - for pg16, there's random_normal(). I'm not on Hetzner so I used GCP (c2d-standard-8 with 16vCPU, 64GB, and 150GB disk, ubuntu 22.04 and

   CREATE TABLE temps_int2 (city int2, temp int2);

   -- random()*random() is a cheap distribution. 1100 = 110 * 10 to provide one decimal of precision.
   INSERT INTO temps_int2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e9)i;

Xophmeister
2 replies
8h44m

This is an IO bound problem. Trying to "go faster" by using multiple threads or vectorisation isn't going to achieve much.

frant-hartm
0 replies
8h27m

No, it is not. The file is smaller than the available RAM, so will be cached in subsequent runs.

brazzy
0 replies
8h27m

You're not up to date with how fast IO can be these days. Nobody who cares primarily about performance uses spinning rust anymore.

wokwokwok
1 replies
10h55m

I suppose it defeats the spirit of the game to, knowing your worst run is discarded, calculate the results on the first run by whatever slow method you want, save them somewhere useful, and just read the results and print them out on the following runs?

Or at the very least, convert the input into a more convenient binary format for the following runs.

tutfbhuf
0 replies
10h22m

I thought the same, but to ensure fairness, I would suggest that the application should run in a stateless container without internet access, and the infrastructure (Hetzner VM) should be recreated from scratch for each run to eliminate all caches.

sixlelyt321
1 replies
10h27m

You had 32GB more than expected 16GB RAM but for Java I doubt that 32GB is fine enough. Java is aboslutely good on older days but nowadays there is lot of opportunities than that poor guy.

EdwardDiego
0 replies
9h53m

Meaningless to talk about how much RAM you need for a JVM program as it will (obviously) depend on what the program is doing.

mgoetzke
1 replies
9h48m

Anyone up to also do this in Rust ?

gunnarmorling
0 replies
9h46m

See the "Show & Tell", where folks are discussing solutions in different languages, including Rust: https://github.com/gunnarmorling/1brc/discussions/categories....

jmclnx
1 replies
1h37m

Is there a way to create this file without having java ?

This is for use on a BSD where Java may not work too well.

Thanks

dvko
0 replies
7m

You can run the bin/create-sample program from this C implementation here: https://github.com/dannyvankooten/1brc

It’s just the city names + averages from the official repository using a normal distribution to generate 1B random rows.

divbzero
1 replies
1h5m

I suspect Java is not the fastest language for this. I’d love to see unofficial contenders tackle this challenge using different languages.

dvko
0 replies
12m

There are a handful of implementations in other languages already. Here’s mine in C99: https://github.com/dannyvankooten/1brc

I’ve also seen versions in Rust, Go, Python, Clickhouse and DuckDB. The discussions tab on the GitHub repo lists some of these.

wakasaka
0 replies
3m

how to make someone else to do your homework for free

unobatbayar
0 replies
8h14m

This must be Decodale's engineering problem disguised as a challenge.

pronoiac
0 replies
10h52m

I've idly been considering how to score this with more languages, with far less memory. I'm thinking, buildpacks on a Raspberry Pi.

kebsup
0 replies
8h55m

At the Czech technical university C course, we've had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).

dvko
0 replies
16m

Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.

Should run about 10-20% faster than that on the mentioned Hetzner hardware.

- Since we only do one decimal of floating point precision it uses integer math right from the get-go.

- FNV1-a hash with linear probing and a load factor well under 0.5.

- Data file is mmap’d into memory.

- Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.

1: https://github.com/dannyvankooten/1brc

daftego
0 replies
50m

Any elixir devs in here find this funny?

countrymile
0 replies
10h24m

Comparing solutions from the Rstats world: https://twitter.com/RandVegan/status/1742721195781267650

brador
0 replies
7h54m

Would offering a cash prize increase or decrease the quality of submissions?

atbpaca
0 replies
4h35m

Would have been nice to accept other JVM languages like Scala, Clojure, Kotlin, etc and compare against Java, which I expect to have slightly better performance.

DeathArrow
0 replies
3h1m

This can turn into a nice benchmark if done in multiple languages.