return to table of content

The One Billion Row Challenge in Go: from 1m45s to 4s in nine solutions

jonahx
39 replies
12h57m

Nice post. Interesting that the fastest Java beats the fastest Go, though they are close:

    AY fastest Go version 2.90s 36.2
    TW fastest Java version 0.953s 110
I would have expected Go to win. That JVM works pretty good...

threeseed
11 replies
12h16m

JVM has always been on par if not often faster than hand-written C code.

Go's advantage has always been that it is good enough at a lot of things.

Mawr
7 replies
11h32m

Please, you've been reading too much PR from the Java side and not looking at benchmarks and real-world performance enough. What you're claiming is inherently not possible, cherry-picked benchmarks notwithstanding.

threeseed
4 replies
11h5m

Can you explain why it's not technically possible.

JVM has had decades of experience at optimally translating bytecode to machine code and can take advantage of SIMD, AVX etc when needed. Most hand-written C code is far from optimal.

10000truths
2 replies
10h26m

C compilers also have decades of experience optimally translating C code into machine code, and they are arguably more capable of emitting SIMD (good luck trying to use cutting edge AVX-512 intrinsics like vpopcntdq with the JVM). The fact is that there is nothing a JIT compiler can do that an AOT compiler can't do, but in the case of AOT, the resources spent compiling the code are amortized to effectively 0, whereas that resource cost is borne upon every program startup for a JIT engine.

pjmlp
0 replies
9h41m

C compilers only have one opportunity to do that once, at compile time, if the developer was lucky with their data set used to train the PGO output, maybe the outcome is greatly improved.

Modern JVMs, not only have the JIT being able to use actual production data, they are able to cache PGO data between execution runs, and reach an optimimal set of heuristics throughout execution time.

And on Android, those PGO files are even shared between devices via Play Store.

neonsunset
0 replies
9h15m

That's not necessarily true, on JIT vs AOT split. I'm mostly going off of how the divergence in available optimization is starting to look like in .NET after introduction of Native AOT with light research into LLVM and various optimization-adjacent Rust crates.

In particular, with JIT, you are able to initialize certain readonly data once, and then, on recompilation to a more optimized version, bake such data as JIT constants right into emitted machine code. This is not possible with AOT. Same applies for all kinds of in-runtime profiling/analysis and recompilation to incorporate a collected profile according to this exact run of an application. JIT also offers the ability to load modules dynamically in the form of bytecode without having to have a strict machine-level ABI, only the bytecode one, which allows for efficient generics that cross modules, as well as cross-module function inlining. And last but not least - there is no need to pick the least common denominator in supported hardware features as the code can be compiled to use the latest features provided by hardware like AVX512.

On the other hand, pure AOT means a frozen world which allows the compiler to know exact types and paths the code can take, performing exact devirtualization and much more aggressive preinitialization on code that accepts constant data. It also means bigger leeway in the time the compiler can spend on optimizing code. Historically, GCC and LLVM have been more advanced than their JIT counterparts because of different tradeoffs more favouring to absolute performance of the emitted code as well as simply higher amount of man hours invested in developing them (e.g. .NET punches above it's weight class despite being worked on by a smaller team vs OpenJDK or LLVM).

dzaima
0 replies
54m

A couple weeks ago I managed to get a nice setup for viewing Java JIT disassembly[1], and did a bunch of ad-hoc tests on OpenJDK 21 (and some on whatever build of 23 was on openjdk.org). do manage to vectorize a decent amount of vectorizable loops, but semi-often missed some improvements, and some trivial things didn't vectorize at all (most amusingly, a loop summing an int[] to an int didn't get vectorized, while int[] to long is). Scalar code codegen is also quite mediocre.

GCC & clang, in comparison, often produce assembly which I'd consider optimal, or close to it, for vectorizable loops, and are noticably better than OpenJDK for scalar code.

[1]: it's a bit messy to use but: https://github.com/dzaima/grr#java-jit

vanviegen
1 replies
11h19m

Sure it's possible. The JVM can do guided optimizations at run time. There is no such thing for native executables.

pjmlp
0 replies
9h40m

And as I mentioned in another comment, you can even cache PGO data between executions, not needed to start always from zero.

ffsm8
1 replies
11h27m

I think the reason why this misconception is so widespread is because there is a grain of truth in it, because almost everyone sees Java synonymous with gigantic framework like spring, quarkus etc.

In go you've got your standard libraries, these are generally quicker than the Java equivalent simply because they do less in the lifecycle of the operation.

This lets Java do funky stuff like enabling full jvm/code tracing just by adding a jar file at runtime. But it does come with a performance penalty.

pjmlp
0 replies
9h45m

Which is a reason why dynamic loading of agents now requires being enabled explicitly.

_ph_
0 replies
9h46m

One has to differentiate here a bit. Java JIT technology has become really great. Highly optimized native code generation which hugely benefits from the ability to use live profiling data to optimize the code. This is why it often beats static compilers at generating faster code. The static compilers can only optimized on the range of possible data, the JIT can optimize based on the data presented to the program.

On the down side, there are quite a few features of the Java language and the JVM, which often make programs slow. Like a lot of details of the object model, lack of value classes, JIT compiling which takes time on startup etc. Also, a lot of Java libraries are pretty heavy weight.

Go is quite different here. It is statically compiled, which allows for fast program startup and the language model makes it quite easy to rather naively write programs which perform reasonally fast. The down side is, that the compiler is static and not so heavily optimizing as other static compilers for fast compilation speed. However recently the ability was added to use profiling data for optimizing the compilation.

dsff3f3f3f
9 replies
12h54m

The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.

avinassh
4 replies
12h45m

The Java version is really optimized and is an interesting read.

is there any similar blog post on the Java optimisations?

dsff3f3f3f
0 replies
12h40m

Not that I know of. I just looked at the code and the commit history but a more in depth article would certainly be interesting.

neonsunset
2 replies
11h12m

The more accurate statement would be is Go implementatation is incapable of accessing optimizations that exist in Java and then Java is incapable of optimizations performed by C# and C++ implementations.

See https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

dsff3f3f3f
1 replies
10h47m

Go is perfectly capable of all of the additional optimizations that are in the fastest Java implementation that is linked in the article.

neonsunset
0 replies
8h4m

Until some time ago Go did not even have inlining profitability logic in the compiler and could only inline leaf functions, something which is worse than JVM implementations could do in...2005 or so I think?

Are you sure?

AtlasBarfed
0 replies
9h57m

Well the Go guy probably didn't read it because "java doesn't interest him"

neonsunset
6 replies
12h19m

Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).

But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.

pjmlp
3 replies
9h48m

It only goes to show how much cargo cult is there in adopting these languages in hipster circles.

hnlmorg
2 replies
9h16m

No. What it actually demonstrates is that people didn’t read the source material properly.

The Java and Go versions use different optimisations. There’s nothing stopping either language from using the same optimisations as the other. It just wasn’t something their respective authors cared to try in their respective exercises.

neonsunset
1 replies
8h1m

There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust. This is lack of SIMD API without dropping to hand writing ASM and overall much weaker compiler. This puts much greater burden on the programmer to match the performance while staying with Go.

hnlmorg
0 replies
6h46m

There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust

...

This puts much greater burden on the programmer to match the performance while staying with Go.

Your second statement contradicts your first. You're not stopped from using SIMD in Go. There are in fact several 3rd party libraries out there to use SIMD. It's just not part of the standard library. So you can still use SIMD in Go without writing Go's dialect of assembly.

It's also worth noting that SIMD isn't even due to drop into std in C++ until C++26. At the moment you either have to use experimental or a 3rd party library.

You're also missing the point of these examples. Nobody writing these exercises are trying to claim that all languages are equal. And they certainly not trying to write idiomatic code either. They're just fun little exercises demonstrating the degree of optimisations one can go through. You wouldn't want to write code like those in the examples in all but a the tiniest of scenarios.

It's silly the amount of people over-analysing this, what is essentially just a game, and then arguing its "proof" about their biases towards different programming languages.

kaba0
1 replies
9h19m

beating C++

Source for that?

threatofrain
4 replies
11h36m

I don't think Go has ever demonstrated that it deserved to be thought of as casually faster than Java.

iraqmtpizza
1 replies
9h32m

Anyone judging Java performance based on starting up a JVM, running for 30 seconds, and shutting down is brain damaged. I'm 99 percent certain this benchmark doesn't use Epsilon GC either lol

not using GraalVM for Java

LOL k

no implementations use multithreading

ah, the cherry on top. this benchmark is literally perfectly useless

Comma2976
0 replies
7h44m

And yet, people will refer to some other synthetic benchmark when arguing that the average Java project, which doesn't disable GC nor uses unsafe reads et al, is comparable.

Can't have and eat it too

kaba0
0 replies
7h33m

So you are comparing a feature-packed enterprise framework with a raw http server ping-ponging hello world?

Also, what even does the last line mean? Go in general is significantly more verbose than Java.

parkcedar
2 replies
12h45m

The fastest Java version is even beating his baseline of `cat` to `/dev/null`

timeagain
0 replies
11h47m

More proof that the JVM is space-age future technology.

JohnBooty
0 replies
11h19m

Yes, though it's also worth noting that the fastest solutions are all doing their work in parallel which is not a thing for `cat`.

tutfbhuf
0 replies
16m

Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.

Mawr
0 replies
11h35m

That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.

The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.

Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."

faizshah
13 replies
11h22m

I was curious how long it would take with Polars (for scale), apparently 33s: https://github.com/Butch78/1BillionRowChallenge/tree/main

I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?

sharno
5 replies
11h14m

That’s the question worth asking imo. I was wondering how fast is the idiomatic Java solution

kaba0
0 replies
9h25m

Is that the same hardware? Otherwise it doesn’t say much.

KingOfCoders
1 replies
8h44m

Java is often slightly faster than Go, has similar (perhaps, older, better optimized Map) constructs, perhaps better GC (older, more optimized), though I don't think the GC is a challenge, has slower startup times - so I'd say roughly the same as the idiomatic Go version?

kaba0
0 replies
7h38m

Java’s GC is just incomparably better, literally every research on the topic is written for Java. Though that’s true that without value types, Java does rely slightly more on it.

gigatexal
3 replies
10h47m

Where’s the source data I’d like to attempt ingesting this and processing it with DuckDb.

nhinck3
0 replies
10h13m

In the original 1BRC, it's a python script that generates the data.

geysersam
1 replies
9h15m

Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.

392
0 replies
7h0m

True in general but some (especially libs aimed at larger datasets processed in batch) are taking advantage of benchmarks like this to do things like: Try the fast way, if it works great Try the slow way, if above fails This makes the slow path 2x slower at worst (and you can advise to always use the slow way with optional params) but the fast path can be 10x faster

JensRantil
13 replies
12h49m

Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.

I submitted a github issue on this for the other implementation I looked at here[1].

[1] https://github.com/shraddhaag/1brc/issues/2

pillusmany
7 replies
12h35m

He already uses custom parsing.

How do you search in the lookup table? If you are thinking of a hash map it will be slower than the few operations of his custom parser.

treyd
6 replies
12h16m

If you're clever about how you initially take apart the input you can just do a single memory load at a computed offset.

praptak
5 replies
11h35m

I don't see how you can make computing the offset faster than just parsing the number.

PeterisP
3 replies
9h36m

You could have a much simpler parsing than ordinary parsing if you know/assume that you definitely have a valid number from -99.99 to 99.99.

For example, you could find whether it starts with a '-' and where the delimiter is to know the length of the number string representation (that's the "simpler parsing" part), and then don't do any work at all to decode the number, simply use these 1-5 bytes of that string (without sign and separator) directly as an index into a very large very sparse memory region in which all the valid values are pre-populated with the proper result.

You'd need to allocate 4 or 8 terabytes of virtual memory address space for that lookup table, but you'll touch only a tiny fraction of the memory pages, so it doesn't require an unacceptable amount of physical memory.

heavenlyblue
1 replies
8h39m

You don't need to cover all bits of the values, just 10 numeric values that can pass a a bounds check. That reduces the space to only 10K elements. With some bit shifting (and pre-validation) that should easily reduce the search space.

nwellnhof
0 replies
6h19m

Creating a LUT index with bit-shifting is essentially the same as parsing into an integer. Even if the LUT fits in L1 cache, I doubt it would be faster. If it doesn't fit, it's certainly slower.

akvadrako
0 replies
9h6m

If that is faster would seem to depend on if you can get most lookups from the L2 cache. Otherwise you're waiting for main memory, which is a few hundred cycles. Even with multiple loads in parallel, it would be hard to beat arithmetic.

https://specbranch.com/posts/lookup-tables/

pletnes
0 replies
6h58m

Take the bytes of the number, «-9.7», and interpret as an integer? That’s a 4-byte int -> array index. (Haven’t tried but…)

KingOfCoders
3 replies
8h40m

Wouldn't you need a fixed length of the temps?

  00.1
 -10.3
 
or

  0.1 (with an ending space)
so you can look up 5 bytes in the map? (+/i, two digits, dot, one digit)

codegladiator
2 replies
8h29m

you can create a perfect hash based on the presence of at least 4 characters. perfect hash is pre calculated based on possible inputs (-99.9 to 99.0 in bytes). the hash is usual byte*seed+hash. "seed" is chosen so that there is no clash (you can find a static seed in a single brute force from 1 to 1m in < 1 min)

KingOfCoders
1 replies
4h56m

I thought the lookup would just be a byte tree not a hash. Wouldn't a hash with it's calculation beat the purpose of being faster than parsing a number?

The idea would be, you have a tree of all values of 0.0 to 99.9 and then just use the bytes to iterate the tree (e.g. in an array) to come up with the int value of e.g. 45.5

codegladiator
0 replies
4h14m

parsing a number contains an (addition + multiplication) *(number of digits) for each row. if you do precalculated perfect hash then multiplication for each row can be avoided. ( you anyways need to read each byte)

K0nserv
0 replies
11h49m

I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.

0: https://github.com/k0nserv/brc/blob/main/src/main.rs#L279

nottorp
10 replies
11h33m

I have a feeling that a naive implementation in Java would be a lot worse than a naive implementation in Go so optimizing matters more there.

Had to parse csvs in Java on a very memory constrained system once... we ended up cutting out a feature because it wasn't worth it.

speedgoose
2 replies
9h2m

Golang is actually not as efficient as Java in quite a few benchmarks.

~~Using LLVM isn’t a magic solution to perform better than something relying on the JVM.~~

Here is a source: https://sites.google.com/view/energy-efficiency-languages

TwentyPosts
1 replies
8h28m

Huh? Go doesn't use LLVM though, where did you get the idea that it does? That's part of why its compile times are so fast.

speedgoose
0 replies
7h40m

You are right, it doesn’t use LLVM. No idea where I got the idea, I was confidently wrong.

cangeroo
2 replies
11h22m

Regarding Java, It probably could be done with arrays and object reuse (arenas). But it's slightly less ergonomic. And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.

nottorp
1 replies
10h19m

Yep, but it wasn't a critical feature and we were in a rush, so the feature was killed instead.

Depends on which Java implementation is used.

... if you have a choice. It was a port of AOSP, so we didn't. In any case it wasn't the jvm's fault, the device just had very little ram.

syspec
0 replies
8h14m

Sounds like a skill issue

masklinn
1 replies
9h54m

Depends what you call “naive”, but the “idiomatic Java solution” from last week’s post (https://questdb.io/blog/billion-row-challenge-step-by-step/) clocked in at 71 seconds, or 1:11. And just running it on graal was down to 66.

“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.

nottorp
0 replies
9h17m

Gigabytes? It was a while ago and i had megabytes for the whole OS :)

Anyway, it's just a fun memory now.

pjmlp
0 replies
10h59m

Depends on which Java implementation is used.

People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.

michalsustr
10 replies
9h13m

the fastest, heavily-optimised Java solution runs in just under a second on my machine

I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?

mainde
1 replies
9h3m

I think this bit in the baseline section applies to the Java one too

Note that that’s a best-of-five measurement, so I’m allowing the file to be cached. Who knows whether Linux will allow all 13GB to be kept in disk cache, though presumably it does, because the first time it took closer to 6 seconds.
nsteel
0 replies
8h35m

Yea, I assumed that. Which makes the parallel version improvements still interesting but surely it's very artificial. You can't processes all the data at the same time if you don't have it all yet.

dietr1ch
1 replies
9h6m

Benchmarking this gets tricky once you realize that the file might be entirely cached if the computer has enough RAM.

klyrs
0 replies
1h44m

Hah, we use different computers. I was thinking, this might be tricky if you aren't careful to keep the file entirely cached in RAM.

whyever
0 replies
9h7m

My guess: If you run the benchmark several times, the OS will cache the file in RAM.

maximus-decimus
0 replies
4h35m

If he runs the test multiple times, the file should all be cached in RAM after the first run.

gozzoo
0 replies
9h9m

most of the file remains in the OS disk cache after the first run

codegladiator
0 replies
8h34m

there are no disk file read times in the original rules. file is in memfs.

chii
0 replies
9h6m

that might be a sustained single threaded read performance.

What if the method of access was concurrent from different parts of the file, and was operating system cached?

Keyframe
0 replies
1h9m

actual 1brc runs with data in RAM disk, he could've/should've done the same.

avinassh
8 replies
13h10m

I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.

I am curious, can it be made even faster than this?

junto
3 replies
11h16m

Wow that’s pretty damn fast. C# has made some improvements in the past years or they have some other advantages?

pjmlp
0 replies
9h39m

His blogs posts go back all the way to .NET 5, for those curious to some deep dive on performance improvements done on each release.

pjmlp
0 replies
10h56m

Yes to both.

.NET team has been doubling down on performance improvements, people forget CLR also has features to support C like languages (hence Managed C++ and C++/CLI), and many of those capabilities are now surfaced into C# as well.

scotty79
0 replies
9h1m

Instead of hash table I'd try sort of "eager" trie inside stack allocated memory. So I can find the slot for the stats of given station after parsing minimal number of characters that differentiate this station from others.

pram
0 replies
12h39m

From looking at the final code it’s probably the performance of copy() as the biggest hurdle.

krallja
0 replies
12h53m

There's a layer of pointer indirection when using slices in Go, you may be able to eke out some time by moving to arrays on the stack.

neonsunset
5 replies
12h16m

The effort and long time it took Go to get to something that 3-6x times slower than other, better languages should be an important reminder to everyone assuming it belongs to the same weight class as Rust, C# or Java.

donor20
1 replies
11h49m

But isn’t the Java version unrolling loops? This seems like some effort on the Java side.

dsff3f3f3f
0 replies
11h43m

The fast Java version is using all the same optimizations as this Go version and then some. It's significantly more complicated.

Mawr
1 replies
11h42m

That you put Rust among those languages says it all. Do some basic research.

neonsunset
0 replies
11h6m

Oh, and what the basic research you speak of constitutes? Surely you looked at ASM emitted by compilers for these languages and HPC-adjacent APIs each of them offers? No? Then let me tell you - Go is pretty much consigned to having to use its special flavour of non-portable bespoke hand-written ASM which is the only way to access SIMD instructions necessary to achieve optimal hardware utilization in the benchmark. This takes a lot of effort and skill, so, as you may have noticed, if you can't do it, Go simply cannot come close to better options you can see on the benchmark chart.

And yet, this is something that can be trivially done in C#, C++ and Rust (albeit C# has the best UX with crossplat SIMD API introduced in .NET 7, with C++ close second with its own take on this being in preview). Java OTOH manages to be in the same category by having extremely advanced JIT that allows it to have comparable codegen quality even though it lacks comparable SIMD API for now (Panama vectors are problematic currently), so benchmarks implementations using it are forced to do SWAR.

My main gripe is of course an extremely common misconception about Go's speed which it just does not have the moment you write anything sufficiently advanced or want to express a particular problem in a terser way than writing thousands of open coded loops.

kitd
0 replies
9h53m

If you read the article, you'll see he doesn't attempt the optimizations that helped those other languages get to 3-6x faster. Your snark is wasted.

michae2
5 replies
10h56m

For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/

In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:

- unsafe.Pointer can be used to read memory without bounds checks

- many functions in the bytes and bits packages in the standard library are written in assembly

- debug.SetGCPercent and SetMemoryLimit to turn off GC

- runtime.LockOSThread to lock a goroutine to a thread

- print is slightly faster than fmt.Printf (but writes to stderr)

benhoyt
4 replies
10h30m

Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.

Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!

markoman
2 replies
10h19m

Could you say why you find using memory-mapped files to be a portability issue? Thanks.

benhoyt
0 replies
10h13m

Well, I guess it's more that the standard library doesn't have a cross-platform way to access them, not that memory-mapped files themselves can't be done on (say) Windows. It looks like there's a fairly popular 3rd party package that supports at least Linux, macOS, and Windows: https://github.com/edsrzf/mmap-go

WesolyKubeczek
0 replies
8h45m

Not all filesystems support memory-mapped files equally, and for some that do, the support comes with caveats and could be slower than non-memory-mapped access.

michae2
0 replies
10h23m

I think we all ended up using unsafe, though there were some solutions without mmap. It would have been interesting if we had adhered to the same constraints you did!

afiodorov
4 replies
10h31m

My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.

masklinn
1 replies
9h48m

This is not a general solution, it’s a specific problem with a specific data shape, and processes specifically 1 billion rows on a 32GB machine.

pletnes
0 replies
6h52m

And this is why file io is not the bottleneck - all the data is in RAM/disk cache.

Alifatisk
1 replies
5h5m

A lot of the solutions assume we have enough ram to keep the stats per station

I’ll give this example in Ruby but you’ll get the point, what you are mentioning is an issue if you chose to use File.read(), because it opens the file and reads its content into the ram. But this can be solved if you use File.readlines() because it streams each row instead which uses much less ram.

afiodorov
0 replies
3h54m

What I am saying is that the solutions presented assume we have N stations which are way less than 1_000_000_000.

Worst case scenario for RAM would be if each line contained a unique station. In this case we'd have to allocate 1_000_000_000 * sizeof(stats) in RAM to contain the result of the computation.

So most of the solutions assume sufficient RAM ahead of reading the file.

In the first solution:

type stats struct { min, max, sum float64 count int64 }

would take up 32GB for the stats alone for all 1E9 stations, and that's ignoring space needed for each station's name!

m3kw9
3 replies
12h33m

I’d just pay my way to 4s by upgrading hw

JohnBooty
1 replies
11h2m

You can't just throw hardware at this one to get to 4s. At least not in 2024.

The author's naive single-threaded Go solution took 1m45s on an "amd64 laptop with fast SSD drive and 32GB of RAM."

So, you'd need something 25x faster than his setup in terms of single-threaded performance. Let us know when you've upgraded your hardware to the equivalent of a 75ghz AMD processor with memory and SSD bandwidth to match!

lmeyerov
0 replies
10h31m

The nice thing about a GPU soln (ex: python dataframes in cudf, just a few loc) is these generally come down to your IO bandwidth, like a single 2GB/s SSD to a 16-32 GB/s PCIe to 1-2 GPUs running crazy fast. And then buy more cheap SSDs to chain together before buying more/better GPUs :)

jiggawatts
0 replies
11h2m

... how?

There aren't any generally-available CPUs that are substantially faster today than were available ten years ago. Maybe double the speed per core, triple at best.

After that, throwing more cores at it also rapidly runs out of steam because parallel code has its own overheads. Any shared state instantly kills performance, no matter the language. Very clever tricks have to be used to get decent scaling past 64 hardware threads (32 cores), and going past 256 is surprisingly difficult. You start having to worry about NUMA, IRQ steering, and core pinning. Bandwidth gets to be an issue, even to L3 and L4 cache, let alone out to main memory.

This notion that you can just "dial up" hardware performance to infinity as a fix for any amount of developer laziness needs to die.

bbkane
3 replies
12h49m

I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.

I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".

timetopay
0 replies
9h6m

A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.

latchkey
0 replies
12h16m

I hate to "me too", but you also nailed that analysis.

jasonwatkinspdx
0 replies
1h39m

It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.

satvikpendem
2 replies
9h48m

Nice. I wonder how Rust would fare, given that it has no garbage collector.

tjpnz
0 replies
9h9m

You can disable GC for Golang but don't think it will improve on 4s.

nicois
2 replies
12h33m

It would be interesting to know how effective Profile Guided Optimisation is here.

neonsunset
0 replies
11h13m

It is only mildly effective because how anemic Go compiler is. And even then it's extremely limited. If you want to see actual good implementations - look into what OpenJDK HotSpot and .NET JIT compilers do with runtime profiling and recompilation (.NET calls it Dynamic PGO).

benhoyt
0 replies
11h8m

Unfortunately it doesn't seem to help at all, I think mainly because (at present) Go's PGO basically inlines hot functions, and the important code here is all in one big function.

camgunz
2 replies
9h40m

I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!

0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...

riku_iki
0 replies
7m

actually that guy can now be sued, per TOS it is illegal to publish benchmarks for Oracle DB.

giantrobot
0 replies
3m

I agree with doing more in the database, you're closer to the data (disk/disk cache/L2 cache) than the application code is. At the same time I get really nervous around doing work in the database because you have to be really disciplined that the in-database code (functions, views, etc) match the code in source control. Also that your testing/QA database contains all the same code and enough test data to actually exercise the performance bounds of that code.

With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.

worldwidelies
1 replies
8h24m

I’d like to see a 1 trillion row challenge.

tonymet
1 replies
12h48m

i saw the custom hashtable, but why was Map slow?

sethammons
0 replies
7h58m

I believe it is because he did not pre allocate the map, so as it grew, it reallocated multiple times. Just a guess.

hoten
1 replies
8h26m

I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!

These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.

benhoyt
0 replies
8h7m

Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.

fizx
1 replies
10h51m

It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.

I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.

benhoyt
0 replies
7h49m

It's a good point, but I believe because of the "-b" option (characters as bytes), using LC_ALL=C doesn't make a difference.

Alifatisk
0 replies
5h8m

How come node performed this good? Any specific reason?

thangalin
0 replies
11h26m

Back in 2010, I used PostgreSQL for a web app that queried 270 million rows of climate data from Environment Canada:

https://www.youtube.com/watch?v=10KEr3sEG80

I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:

https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s

The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.

sireat
0 replies
9h27m

Those super optimized solutions are fun to read about.

However, in real life I would never assume a millions rows of text all have all valid data in a specific format, much less a billion rows.

Thus a slower but more robust solution would be more realistic.

fullstackchris
0 replies
8h57m

performance is great but i would imagine the paralellized version requires a significantly higher minimum amount of RAM than the non paralellized ways... he claims that each more perfomant solution is less compute cost than the one before it, but in the case of paralellization its just the same amount of compute in just a shorter amount of time, right?

WesolyKubeczek
0 replies
8h53m

I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.

Alifatisk
0 replies
5h3m

Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.

1vuio0pswjnm7
0 replies
10h50m

How about kdb+ or shakti