HN comments for: The One Billion Row Challenge in Go: from 1m45s to 4s in nine solutions

jonahx

39 replies

12h57m

2024-03-03 05:26:32 UTC

Nice post. Interesting that the fastest Java beats the fastest Go, though they are close:

    AY fastest Go version 2.90s 36.2
    TW fastest Java version 0.953s 110

I would have expected Go to win. That JVM works pretty good...

threeseed

11 replies

12h16m

2024-03-03 06:07:53 UTC

JVM has always been on par if not often faster than hand-written C code.

Go's advantage has always been that it is good enough at a lot of things.

Mawr

7 replies

11h32m

2024-03-03 06:51:49 UTC

Please, you've been reading too much PR from the Java side and not looking at benchmarks and real-world performance enough. What you're claiming is inherently not possible, cherry-picked benchmarks notwithstanding.

threeseed

4 replies

11h5m

2024-03-03 07:18:33 UTC

Can you explain why it's not technically possible.

JVM has had decades of experience at optimally translating bytecode to machine code and can take advantage of SIMD, AVX etc when needed. Most hand-written C code is far from optimal.

10000truths

2 replies

10h26m

2024-03-03 07:57:43 UTC

C compilers also have decades of experience optimally translating C code into machine code, and they are arguably more capable of emitting SIMD (good luck trying to use cutting edge AVX-512 intrinsics like vpopcntdq with the JVM). The fact is that there is nothing a JIT compiler can do that an AOT compiler can't do, but in the case of AOT, the resources spent compiling the code are amortized to effectively 0, whereas that resource cost is borne upon every program startup for a JIT engine.

pjmlp

0 replies

9h41m

2024-03-03 08:42:18 UTC

C compilers only have one opportunity to do that once, at compile time, if the developer was lucky with their data set used to train the PGO output, maybe the outcome is greatly improved.

Modern JVMs, not only have the JIT being able to use actual production data, they are able to cache PGO data between execution runs, and reach an optimimal set of heuristics throughout execution time.

And on Android, those PGO files are even shared between devices via Play Store.

neonsunset

0 replies

9h15m

2024-03-03 09:08:20 UTC

That's not necessarily true, on JIT vs AOT split. I'm mostly going off of how the divergence in available optimization is starting to look like in .NET after introduction of Native AOT with light research into LLVM and various optimization-adjacent Rust crates.

In particular, with JIT, you are able to initialize certain readonly data once, and then, on recompilation to a more optimized version, bake such data as JIT constants right into emitted machine code. This is not possible with AOT. Same applies for all kinds of in-runtime profiling/analysis and recompilation to incorporate a collected profile according to this exact run of an application. JIT also offers the ability to load modules dynamically in the form of bytecode without having to have a strict machine-level ABI, only the bytecode one, which allows for efficient generics that cross modules, as well as cross-module function inlining. And last but not least - there is no need to pick the least common denominator in supported hardware features as the code can be compiled to use the latest features provided by hardware like AVX512.

On the other hand, pure AOT means a frozen world which allows the compiler to know exact types and paths the code can take, performing exact devirtualization and much more aggressive preinitialization on code that accepts constant data. It also means bigger leeway in the time the compiler can spend on optimizing code. Historically, GCC and LLVM have been more advanced than their JIT counterparts because of different tradeoffs more favouring to absolute performance of the emitted code as well as simply higher amount of man hours invested in developing them (e.g. .NET punches above it's weight class despite being worked on by a smaller team vs OpenJDK or LLVM).

dzaima

0 replies

54m

2024-03-03 17:29:29 UTC

A couple weeks ago I managed to get a nice setup for viewing Java JIT disassembly[1], and did a bunch of ad-hoc tests on OpenJDK 21 (and some on whatever build of 23 was on openjdk.org). do manage to vectorize a decent amount of vectorizable loops, but semi-often missed some improvements, and some trivial things didn't vectorize at all (most amusingly, a loop summing an int[] to an int didn't get vectorized, while int[] to long is). Scalar code codegen is also quite mediocre.

GCC & clang, in comparison, often produce assembly which I'd consider optimal, or close to it, for vectorizable loops, and are noticably better than OpenJDK for scalar code.

[1]: it's a bit messy to use but: https://github.com/dzaima/grr#java-jit

vanviegen

1 replies

11h19m

2024-03-03 07:04:55 UTC

Sure it's possible. The JVM can do guided optimizations at run time. There is no such thing for native executables.

pjmlp

0 replies

9h40m

2024-03-03 08:43:25 UTC

And as I mentioned in another comment, you can even cache PGO data between executions, not needed to start always from zero.

ffsm8

1 replies

11h27m

2024-03-03 06:57:09 UTC

I think the reason why this misconception is so widespread is because there is a grain of truth in it, because almost everyone sees Java synonymous with gigantic framework like spring, quarkus etc.

In go you've got your standard libraries, these are generally quicker than the Java equivalent simply because they do less in the lifecycle of the operation.

This lets Java do funky stuff like enabling full jvm/code tracing just by adding a jar file at runtime. But it does come with a performance penalty.

pjmlp

0 replies

9h45m

2024-03-03 08:38:54 UTC

Which is a reason why dynamic loading of agents now requires being enabled explicitly.

_ph_

0 replies

9h46m

2024-03-03 08:37:39 UTC

One has to differentiate here a bit. Java JIT technology has become really great. Highly optimized native code generation which hugely benefits from the ability to use live profiling data to optimize the code. This is why it often beats static compilers at generating faster code. The static compilers can only optimized on the range of possible data, the JIT can optimize based on the data presented to the program.

On the down side, there are quite a few features of the Java language and the JVM, which often make programs slow. Like a lot of details of the object model, lack of value classes, JIT compiling which takes time on startup etc. Also, a lot of Java libraries are pretty heavy weight.

Go is quite different here. It is statically compiled, which allows for fast program startup and the language model makes it quite easy to rather naively write programs which perform reasonally fast. The down side is, that the compiler is static and not so heavily optimizing as other static compilers for fast compilation speed. However recently the ability was added to use profiling data for optimizing the compilation.

dsff3f3f3f

9 replies

12h54m

2024-03-03 05:29:54 UTC

The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.

avinassh

4 replies

12h45m

2024-03-03 05:38:18 UTC

The Java version is really optimized and is an interesting read.

is there any similar blog post on the Java optimisations?

o11c

0 replies

12h34m

2024-03-03 05:49:36 UTC

https://news.ycombinator.com/item?id=39467885 but it looks like there have been improvements since.

lucianbr

0 replies

9h50m

2024-03-03 08:33:45 UTC

https://questdb.io/blog/billion-row-challenge-step-by-step/

dsff3f3f3f

0 replies

12h40m

2024-03-03 05:44:11 UTC

Not that I know of. I just looked at the code and the commit history but a more in depth article would certainly be interesting.

cempaka

0 replies

12h1m

2024-03-03 06:22:57 UTC

Cliff Click did a walkthrough of his solution on YouTube: https://youtu.be/NJNIbgV6j-Y?si=Wj97f-Imw5nfIzF7

neonsunset

2 replies

11h12m

2024-03-03 07:11:39 UTC

The more accurate statement would be is Go implementatation is incapable of accessing optimizations that exist in Java and then Java is incapable of optimizations performed by C# and C++ implementations.

See https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

dsff3f3f3f

1 replies

10h47m

2024-03-03 07:37:03 UTC

Go is perfectly capable of all of the additional optimizations that are in the fastest Java implementation that is linked in the article.

neonsunset

0 replies

8h4m

2024-03-03 10:20:06 UTC

Until some time ago Go did not even have inlining profitability logic in the compiler and could only inline leaf functions, something which is worse than JVM implementations could do in...2005 or so I think?

Are you sure?

AtlasBarfed

0 replies

9h57m

2024-03-03 08:26:58 UTC

Well the Go guy probably didn't read it because "java doesn't interest him"

neonsunset

6 replies

12h19m

2024-03-03 06:04:52 UTC

Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).

But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.

pjmlp

3 replies

9h48m

2024-03-03 08:35:23 UTC

It only goes to show how much cargo cult is there in adopting these languages in hipster circles.

hnlmorg

2 replies

9h16m

2024-03-03 09:07:22 UTC

No. What it actually demonstrates is that people didn’t read the source material properly.

The Java and Go versions use different optimisations. There’s nothing stopping either language from using the same optimisations as the other. It just wasn’t something their respective authors cared to try in their respective exercises.

neonsunset

1 replies

8h1m

2024-03-03 10:22:36 UTC

There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust. This is lack of SIMD API without dropping to hand writing ASM and overall much weaker compiler. This puts much greater burden on the programmer to match the performance while staying with Go.

hnlmorg

0 replies

6h46m

2024-03-03 11:37:25 UTC

There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust

...

This puts much greater burden on the programmer to match the performance while staying with Go.

Your second statement contradicts your first. You're not stopped from using SIMD in Go. There are in fact several 3rd party libraries out there to use SIMD. It's just not part of the standard library. So you can still use SIMD in Go without writing Go's dialect of assembly.

It's also worth noting that SIMD isn't even due to drop into std in C++ until C++26. At the moment you either have to use experimental or a 3rd party library.

You're also missing the point of these examples. Nobody writing these exercises are trying to claim that all languages are equal. And they certainly not trying to write idiomatic code either. They're just fun little exercises demonstrating the degree of optimisations one can go through. You wouldn't want to write code like those in the examples in all but a the tiniest of scenarios.

It's silly the amount of people over-analysing this, what is essentially just a game, and then arguing its "proof" about their biases towards different programming languages.

kaba0

1 replies

9h19m

2024-03-03 09:05:10 UTC

beating C++

Source for that?

neonsunset

0 replies

9h16m

2024-03-03 09:08:04 UTC

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

threatofrain

4 replies

11h36m

2024-03-03 06:47:37 UTC

I don't think Go has ever demonstrated that it deserved to be thought of as casually faster than Java.

Comma2976

3 replies

9h48m

2024-03-03 08:35:35 UTC

https://github.com/attractivechaos/plb2/blob/master/README.m...

Synthetic benchmarks aside, I think as far as average (spring boots of the world) code goes, Go beats Java almost every time, often in less lines than the usual pom.xml

iraqmtpizza

1 replies

9h32m

2024-03-03 08:51:40 UTC

Anyone judging Java performance based on starting up a JVM, running for 30 seconds, and shutting down is brain damaged. I'm 99 percent certain this benchmark doesn't use Epsilon GC either lol

not using GraalVM for Java

LOL k

no implementations use multithreading

ah, the cherry on top. this benchmark is literally perfectly useless

Comma2976

0 replies

7h44m

2024-03-03 10:39:55 UTC

And yet, people will refer to some other synthetic benchmark when arguing that the average Java project, which doesn't disable GC nor uses unsafe reads et al, is comparable.

Can't have and eat it too

kaba0

0 replies

7h33m

2024-03-03 10:50:21 UTC

So you are comparing a feature-packed enterprise framework with a raw http server ping-ponging hello world?

Also, what even does the last line mean? Go in general is significantly more verbose than Java.

parkcedar

2 replies

12h45m

2024-03-03 05:38:44 UTC

The fastest Java version is even beating his baseline of `cat` to `/dev/null`

timeagain

0 replies

11h47m

2024-03-03 06:36:51 UTC

More proof that the JVM is space-age future technology.

JohnBooty

0 replies

11h19m

2024-03-03 07:04:30 UTC

Yes, though it's also worth noting that the fastest solutions are all doing their work in parallel which is not a thing for `cat`.

tutfbhuf

0 replies

16m

2024-03-03 18:07:30 UTC

Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.

Mawr

0 replies

11h35m

2024-03-03 06:49:07 UTC

That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.

The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.

Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."

faizshah

13 replies

11h22m

2024-03-03 07:01:34 UTC

I was curious how long it would take with Polars (for scale), apparently 33s: https://github.com/Butch78/1BillionRowChallenge/tree/main

I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?

sharno

5 replies

11h14m

2024-03-03 07:09:26 UTC

That’s the question worth asking imo. I was wondering how fast is the idiomatic Java solution

lucianbr

1 replies

9h49m

2024-03-03 08:34:43 UTC

71 seconds

https://questdb.io/blog/billion-row-challenge-step-by-step/

kaba0

0 replies

9h25m

2024-03-03 08:58:39 UTC

Is that the same hardware? Otherwise it doesn’t say much.

KingOfCoders

1 replies

8h44m

2024-03-03 09:40:14 UTC

Java is often slightly faster than Go, has similar (perhaps, older, better optimized Map) constructs, perhaps better GC (older, more optimized), though I don't think the GC is a challenge, has slower startup times - so I'd say roughly the same as the idiomatic Go version?

kaba0

0 replies

7h38m

2024-03-03 10:45:41 UTC

Java’s GC is just incomparably better, literally every research on the topic is written for Java. Though that’s true that without value types, Java does rely slightly more on it.

benhoyt

0 replies

8h2m

2024-03-03 10:21:43 UTC

On my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version.

gigatexal

3 replies

10h47m

2024-03-03 07:37:01 UTC

Where’s the source data I’d like to attempt ingesting this and processing it with DuckDb.

schu

0 replies

10h9m

2024-03-03 08:14:31 UTC

Instructions on how to create it can be found here:

https://github.com/gunnarmorling/1brc?tab=readme-ov-file#run...

The Python version:

https://github.com/gunnarmorling/1brc/blob/main/src/main/pyt...

nhinck3

0 replies

10h13m

2024-03-03 08:11:08 UTC

In the original 1BRC, it's a python script that generates the data.

llimllib

0 replies

1h33m

2024-03-03 16:50:43 UTC

Here’s a thread on results with duckdb, I don’t mean to discourage you taking a shot at all though: https://github.com/gunnarmorling/1brc/discussions/39

geysersam

1 replies

9h15m

2024-03-03 09:08:57 UTC

Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.

392

0 replies

7h0m

2024-03-03 11:23:33 UTC

True in general but some (especially libs aimed at larger datasets processed in batch) are taking advantage of benchmarks like this to do things like: Try the fast way, if it works great Try the slow way, if above fails This makes the slow path 2x slower at worst (and you can advise to always use the slow way with optional params) but the fast path can be 10x faster

jsmith99

0 replies

8h23m

2024-03-03 10:00:25 UTC

I’m surprised at the poor performance of python here. For reference there are several very brief R examples which are just 2-3 seconds. Eg http://blog.schochastics.net/posts/2024-01-08_one-billion-ro...

JensRantil

13 replies

12h49m

2024-03-03 05:34:45 UTC

Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.

I submitted a github issue on this for the other implementation I looked at here[1].

[1] https://github.com/shraddhaag/1brc/issues/2

pillusmany

7 replies

12h35m

2024-03-03 05:48:45 UTC

He already uses custom parsing.

How do you search in the lookup table? If you are thinking of a hash map it will be slower than the few operations of his custom parser.

treyd

6 replies

12h16m

2024-03-03 06:07:27 UTC

If you're clever about how you initially take apart the input you can just do a single memory load at a computed offset.

praptak

5 replies

11h35m

2024-03-03 06:49:05 UTC

I don't see how you can make computing the offset faster than just parsing the number.

PeterisP

3 replies

9h36m

2024-03-03 08:47:15 UTC

You could have a much simpler parsing than ordinary parsing if you know/assume that you definitely have a valid number from -99.99 to 99.99.

For example, you could find whether it starts with a '-' and where the delimiter is to know the length of the number string representation (that's the "simpler parsing" part), and then don't do any work at all to decode the number, simply use these 1-5 bytes of that string (without sign and separator) directly as an index into a very large very sparse memory region in which all the valid values are pre-populated with the proper result.

You'd need to allocate 4 or 8 terabytes of virtual memory address space for that lookup table, but you'll touch only a tiny fraction of the memory pages, so it doesn't require an unacceptable amount of physical memory.

heavenlyblue

1 replies

8h39m

2024-03-03 09:44:17 UTC

You don't need to cover all bits of the values, just 10 numeric values that can pass a a bounds check. That reduces the space to only 10K elements. With some bit shifting (and pre-validation) that should easily reduce the search space.

nwellnhof

0 replies

6h19m

2024-03-03 12:04:36 UTC

Creating a LUT index with bit-shifting is essentially the same as parsing into an integer. Even if the LUT fits in L1 cache, I doubt it would be faster. If it doesn't fit, it's certainly slower.

akvadrako

0 replies

9h6m

2024-03-03 09:17:22 UTC

If that is faster would seem to depend on if you can get most lookups from the L2 cache. Otherwise you're waiting for main memory, which is a few hundred cycles. Even with multiple loads in parallel, it would be hard to beat arithmetic.

https://specbranch.com/posts/lookup-tables/

pletnes

0 replies

6h58m

2024-03-03 11:25:42 UTC

Take the bytes of the number, «-9.7», and interpret as an integer? That’s a 4-byte int -> array index. (Haven’t tried but…)

KingOfCoders

3 replies

8h40m

2024-03-03 09:43:55 UTC

Wouldn't you need a fixed length of the temps?

  00.1
 -10.3

  0.1 (with an ending space)

so you can look up 5 bytes in the map? (+/i, two digits, dot, one digit)

codegladiator

2 replies

8h29m

2024-03-03 09:54:42 UTC

you can create a perfect hash based on the presence of at least 4 characters. perfect hash is pre calculated based on possible inputs (-99.9 to 99.0 in bytes). the hash is usual byte*seed+hash. "seed" is chosen so that there is no clash (you can find a static seed in a single brute force from 1 to 1m in < 1 min)

KingOfCoders

1 replies

4h56m

2024-03-03 13:28:08 UTC

I thought the lookup would just be a byte tree not a hash. Wouldn't a hash with it's calculation beat the purpose of being faster than parsing a number?

The idea would be, you have a tree of all values of 0.0 to 99.9 and then just use the bytes to iterate the tree (e.g. in an array) to come up with the int value of e.g. 45.5

codegladiator

0 replies

4h14m

2024-03-03 14:10:05 UTC

parsing a number contains an (addition + multiplication) *(number of digits) for each row. if you do precalculated perfect hash then multiplication for each row can be avoided. ( you anyways need to read each byte)

K0nserv

0 replies

11h49m

2024-03-03 06:34:32 UTC

I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.

0: https://github.com/k0nserv/brc/blob/main/src/main.rs#L279

nottorp

10 replies

11h33m

2024-03-03 06:50:46 UTC

I have a feeling that a naive implementation in Java would be a lot worse than a naive implementation in Go so optimizing matters more there.

Had to parse csvs in Java on a very memory constrained system once... we ended up cutting out a feature because it wasn't worth it.

speedgoose

2 replies

9h2m

2024-03-03 09:22:07 UTC

Golang is actually not as efficient as Java in quite a few benchmarks.

~~Using LLVM isn’t a magic solution to perform better than something relying on the JVM.~~

Here is a source: https://sites.google.com/view/energy-efficiency-languages

TwentyPosts

1 replies

8h28m

2024-03-03 09:55:36 UTC

Huh? Go doesn't use LLVM though, where did you get the idea that it does? That's part of why its compile times are so fast.

speedgoose

0 replies

7h40m

2024-03-03 10:44:01 UTC

You are right, it doesn’t use LLVM. No idea where I got the idea, I was confidently wrong.

cangeroo

2 replies

11h22m

2024-03-03 07:01:27 UTC

Regarding Java, It probably could be done with arrays and object reuse (arenas). But it's slightly less ergonomic. And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.

nottorp

1 replies

10h19m

2024-03-03 08:05:10 UTC

Yep, but it wasn't a critical feature and we were in a rush, so the feature was killed instead.

Depends on which Java implementation is used.

... if you have a choice. It was a port of AOSP, so we didn't. In any case it wasn't the jvm's fault, the device just had very little ram.

syspec

0 replies

8h14m

2024-03-03 10:10:09 UTC

Sounds like a skill issue

masklinn

1 replies

9h54m

2024-03-03 08:30:06 UTC

Depends what you call “naive”, but the “idiomatic Java solution” from last week’s post (https://questdb.io/blog/billion-row-challenge-step-by-step/) clocked in at 71 seconds, or 1:11. And just running it on graal was down to 66.

“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.

nottorp

0 replies

9h17m

2024-03-03 09:06:50 UTC

Gigabytes? It was a while ago and i had megabytes for the whole OS :)

Anyway, it's just a fun memory now.

pjmlp

0 replies

10h59m

2024-03-03 07:24:20 UTC

Depends on which Java implementation is used.

People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.

benhoyt

0 replies

7h48m

2024-03-03 10:36:13 UTC

For what it's worth, on my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version. So Go is a bit better for the naive version.

michalsustr

10 replies

9h13m

2024-03-03 09:10:45 UTC

the fastest, heavily-optimised Java solution runs in just under a second on my machine

I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?

mainde

1 replies

9h3m

2024-03-03 09:20:21 UTC

I think this bit in the baseline section applies to the Java one too

Note that that’s a best-of-five measurement, so I’m allowing the file to be cached. Who knows whether Linux will allow all 13GB to be kept in disk cache, though presumably it does, because the first time it took closer to 6 seconds.

nsteel

0 replies

8h35m

2024-03-03 09:48:39 UTC

Yea, I assumed that. Which makes the parallel version improvements still interesting but surely it's very artificial. You can't processes all the data at the same time if you don't have it all yet.

dietr1ch

1 replies

9h6m

2024-03-03 09:17:35 UTC

Benchmarking this gets tricky once you realize that the file might be entirely cached if the computer has enough RAM.

klyrs

0 replies

1h44m

2024-03-03 16:40:03 UTC

Hah, we use different computers. I was thinking, this might be tricky if you aren't careful to keep the file entirely cached in RAM.

whyever

0 replies

9h7m

2024-03-03 09:16:47 UTC

My guess: If you run the benchmark several times, the OS will cache the file in RAM.

maximus-decimus

0 replies

4h35m

2024-03-03 13:48:42 UTC

If he runs the test multiple times, the file should all be cached in RAM after the first run.

gozzoo

0 replies

9h9m

2024-03-03 09:15:04 UTC

most of the file remains in the OS disk cache after the first run

codegladiator

0 replies

8h34m

2024-03-03 09:50:01 UTC

there are no disk file read times in the original rules. file is in memfs.

chii

0 replies

9h6m

2024-03-03 09:18:06 UTC

that might be a sustained single threaded read performance.

What if the method of access was concurrent from different parts of the file, and was operating system cached?

Keyframe

0 replies

1h9m

2024-03-03 17:14:51 UTC

actual 1brc runs with data in RAM disk, he could've/should've done the same.

avinassh

8 replies

13h10m

2024-03-03 05:13:48 UTC

I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.

I am curious, can it be made even faster than this?

makotech221

4 replies

12h59m

2024-03-03 05:24:54 UTC

Dunno about Go, but most c# solutions are around 2s and under https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

junto

3 replies

11h16m

2024-03-03 07:07:53 UTC

Wow that’s pretty damn fast. C# has made some improvements in the past years or they have some other advantages?

Anon4Now

1 replies

10h40m

2024-03-03 07:43:33 UTC

Stephen Toub wrote a book-length blog post about all the performance improvements made in .NET 8 [1]. Add the option to compile to native.

[1] https://devblogs.microsoft.com/dotnet/performance-improvemen...

pjmlp

0 replies

9h39m

2024-03-03 08:44:52 UTC

His blogs posts go back all the way to .NET 5, for those curious to some deep dive on performance improvements done on each release.

pjmlp

0 replies

10h56m

2024-03-03 07:27:16 UTC

Yes to both.

.NET team has been doubling down on performance improvements, people forget CLR also has features to support C like languages (hence Managed C++ and C++/CLI), and many of those capabilities are now surfaced into C# as well.

scotty79

0 replies

9h1m

2024-03-03 09:22:42 UTC

Instead of hash table I'd try sort of "eager" trie inside stack allocated memory. So I can find the slot for the stats of given station after parsing minimal number of characters that differentiate this station from others.

pram

0 replies

12h39m

2024-03-03 05:45:05 UTC

From looking at the final code it’s probably the performance of copy() as the biggest hurdle.

krallja

0 replies

12h53m

2024-03-03 05:30:25 UTC

There's a layer of pointer indirection when using slices in Go, you may be able to eke out some time by moving to arrays on the stack.

neonsunset

5 replies

12h16m

2024-03-03 06:07:37 UTC

The effort and long time it took Go to get to something that 3-6x times slower than other, better languages should be an important reminder to everyone assuming it belongs to the same weight class as Rust, C# or Java.

donor20

1 replies

11h49m

2024-03-03 06:35:06 UTC

But isn’t the Java version unrolling loops? This seems like some effort on the Java side.

dsff3f3f3f

0 replies

11h43m

2024-03-03 06:40:51 UTC

The fast Java version is using all the same optimizations as this Go version and then some. It's significantly more complicated.

Mawr

1 replies

11h42m

2024-03-03 06:41:33 UTC

That you put Rust among those languages says it all. Do some basic research.

neonsunset

0 replies

11h6m

2024-03-03 07:17:21 UTC

Oh, and what the basic research you speak of constitutes? Surely you looked at ASM emitted by compilers for these languages and HPC-adjacent APIs each of them offers? No? Then let me tell you - Go is pretty much consigned to having to use its special flavour of non-portable bespoke hand-written ASM which is the only way to access SIMD instructions necessary to achieve optimal hardware utilization in the benchmark. This takes a lot of effort and skill, so, as you may have noticed, if you can't do it, Go simply cannot come close to better options you can see on the benchmark chart.

And yet, this is something that can be trivially done in C#, C++ and Rust (albeit C# has the best UX with crossplat SIMD API introduced in .NET 7, with C++ close second with its own take on this being in preview). Java OTOH manages to be in the same category by having extremely advanced JIT that allows it to have comparable codegen quality even though it lacks comparable SIMD API for now (Panama vectors are problematic currently), so benchmarks implementations using it are forced to do SWAR.

My main gripe is of course an extremely common misconception about Go's speed which it just does not have the moment you write anything sufficiently advanced or want to express a particular problem in a terser way than writing thousands of open coded loops.

kitd

0 replies

9h53m

2024-03-03 08:31:00 UTC

If you read the article, you'll see he doesn't attempt the optimizations that helped those other languages get to 3-6x faster. Your snark is wasted.

michae2

5 replies

10h56m

2024-03-03 07:27:46 UTC

For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/

In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:

- unsafe.Pointer can be used to read memory without bounds checks

- many functions in the bytes and bits packages in the standard library are written in assembly

- debug.SetGCPercent and SetMemoryLimit to turn off GC

- runtime.LockOSThread to lock a goroutine to a thread

- print is slightly faster than fmt.Printf (but writes to stderr)

benhoyt

4 replies

10h30m

2024-03-03 07:53:36 UTC

Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.

Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!

markoman

2 replies

10h19m

2024-03-03 08:04:15 UTC

Could you say why you find using memory-mapped files to be a portability issue? Thanks.

benhoyt

0 replies

10h13m

2024-03-03 08:10:28 UTC

Well, I guess it's more that the standard library doesn't have a cross-platform way to access them, not that memory-mapped files themselves can't be done on (say) Windows. It looks like there's a fairly popular 3rd party package that supports at least Linux, macOS, and Windows: https://github.com/edsrzf/mmap-go

WesolyKubeczek

0 replies

8h45m

2024-03-03 09:38:56 UTC

Not all filesystems support memory-mapped files equally, and for some that do, the support comes with caveats and could be slower than non-memory-mapped access.

michae2

0 replies

10h23m

2024-03-03 08:00:58 UTC

I think we all ended up using unsafe, though there were some solutions without mmap. It would have been interesting if we had adhered to the same constraints you did!

afiodorov

4 replies

10h31m

2024-03-03 07:52:52 UTC

My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.

masklinn

1 replies

9h48m

2024-03-03 08:35:53 UTC

This is not a general solution, it’s a specific problem with a specific data shape, and processes specifically 1 billion rows on a 32GB machine.

pletnes

0 replies

6h52m

2024-03-03 11:31:17 UTC

And this is why file io is not the bottleneck - all the data is in RAM/disk cache.

Alifatisk

1 replies

5h5m

2024-03-03 13:18:51 UTC

A lot of the solutions assume we have enough ram to keep the stats per station

I’ll give this example in Ruby but you’ll get the point, what you are mentioning is an issue if you chose to use File.read(), because it opens the file and reads its content into the ram. But this can be solved if you use File.readlines() because it streams each row instead which uses much less ram.

afiodorov

0 replies

3h54m

2024-03-03 14:29:35 UTC

What I am saying is that the solutions presented assume we have N stations which are way less than 1_000_000_000.

Worst case scenario for RAM would be if each line contained a unique station. In this case we'd have to allocate 1_000_000_000 * sizeof(stats) in RAM to contain the result of the computation.

So most of the solutions assume sufficient RAM ahead of reading the file.

In the first solution:

type stats struct { min, max, sum float64 count int64 }

would take up 32GB for the stats alone for all 1E9 stations, and that's ignoring space needed for each station's name!

m3kw9

3 replies

12h33m

2024-03-03 05:50:21 UTC

I’d just pay my way to 4s by upgrading hw

JohnBooty

1 replies

11h2m

2024-03-03 07:21:31 UTC

You can't just throw hardware at this one to get to 4s. At least not in 2024.

The author's naive single-threaded Go solution took 1m45s on an "amd64 laptop with fast SSD drive and 32GB of RAM."

So, you'd need something 25x faster than his setup in terms of single-threaded performance. Let us know when you've upgraded your hardware to the equivalent of a 75ghz AMD processor with memory and SSD bandwidth to match!

lmeyerov

0 replies

10h31m

2024-03-03 07:52:31 UTC

The nice thing about a GPU soln (ex: python dataframes in cudf, just a few loc) is these generally come down to your IO bandwidth, like a single 2GB/s SSD to a 16-32 GB/s PCIe to 1-2 GPUs running crazy fast. And then buy more cheap SSDs to chain together before buying more/better GPUs :)

jiggawatts

0 replies

11h2m

2024-03-03 07:21:56 UTC

... how?

There aren't any generally-available CPUs that are substantially faster today than were available ten years ago. Maybe double the speed per core, triple at best.

After that, throwing more cores at it also rapidly runs out of steam because parallel code has its own overheads. Any shared state instantly kills performance, no matter the language. Very clever tricks have to be used to get decent scaling past 64 hardware threads (32 cores), and going past 256 is surprisingly difficult. You start having to worry about NUMA, IRQ steering, and core pinning. Bandwidth gets to be an issue, even to L3 and L4 cache, let alone out to main memory.

This notion that you can just "dial up" hardware performance to infinity as a fix for any amount of developer laziness needs to die.

bbkane

3 replies

12h49m

2024-03-03 05:34:29 UTC

I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.

I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".

timetopay

0 replies

9h6m

2024-03-03 09:17:51 UTC

A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.

latchkey

0 replies

12h16m

2024-03-03 06:07:17 UTC

I hate to "me too", but you also nailed that analysis.

jasonwatkinspdx

0 replies

1h39m

2024-03-03 16:44:29 UTC

It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.

satvikpendem

2 replies

9h48m

2024-03-03 08:36:14 UTC

Nice. I wonder how Rust would fare, given that it has no garbage collector.

tjpnz

0 replies

9h9m

2024-03-03 09:15:02 UTC

You can disable GC for Golang but don't think it will improve on 4s.

n8henrie

0 replies

3h44m

2024-03-03 14:39:44 UTC

There are a few rust solutions in the "Show and Tell" linked above, for example this fairly readable one at 15.5s: https://github.com/gunnarmorling/1brc/discussions/57 A comment above referencing Python "polars" actually has rust polars, std, and SIMD solutions as well (SIMD was fasted, but less readable for a hobbyist like me).

EDIT: https://github.com/Butch78/1BillionRowChallenge/tree/main

nicois

2 replies

12h33m

2024-03-03 05:50:41 UTC

It would be interesting to know how effective Profile Guided Optimisation is here.

neonsunset

0 replies

11h13m

2024-03-03 07:10:38 UTC

It is only mildly effective because how anemic Go compiler is. And even then it's extremely limited. If you want to see actual good implementations - look into what OpenJDK HotSpot and .NET JIT compilers do with runtime profiling and recompilation (.NET calls it Dynamic PGO).

benhoyt

0 replies

11h8m

2024-03-03 07:16:00 UTC

Unfortunately it doesn't seem to help at all, I think mainly because (at present) Go's PGO basically inlines hot functions, and the important code here is all in one big function.

camgunz

2 replies

9h40m

2024-03-03 08:44:13 UTC

I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!

0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...

riku_iki

0 replies

2024-03-03 18:17:00 UTC

actually that guy can now be sued, per TOS it is illegal to publish benchmarks for Oracle DB.

giantrobot

0 replies

2024-03-03 18:20:24 UTC

I agree with doing more in the database, you're closer to the data (disk/disk cache/L2 cache) than the application code is. At the same time I get really nervous around doing work in the database because you have to be really disciplined that the in-database code (functions, views, etc) match the code in source control. Also that your testing/QA database contains all the same code and enough test data to actually exercise the performance bounds of that code.

With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.

worldwidelies

1 replies

8h24m

2024-03-03 09:59:35 UTC

I’d like to see a 1 trillion row challenge.

tbragin

0 replies

1h1m

2024-03-03 17:23:10 UTC

One exists https://blog.coiled.io/blog/1trc.html

tonymet

1 replies

12h48m

2024-03-03 05:36:06 UTC

i saw the custom hashtable, but why was Map slow?

sethammons

0 replies

7h58m

2024-03-03 10:26:01 UTC

I believe it is because he did not pre allocate the map, so as it grew, it reallocated multiple times. Just a guess.

hoten

1 replies

8h26m

2024-03-03 09:57:30 UTC

I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!

These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.

benhoyt

0 replies

8h7m

2024-03-03 10:16:17 UTC

Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.

fizx

1 replies

10h51m

2024-03-03 07:32:33 UTC

It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.

I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.

benhoyt

0 replies

7h49m

2024-03-03 10:34:36 UTC

It's a good point, but I believe because of the "-b" option (characters as bytes), using LC_ALL=C doesn't make a difference.

blue_pants

1 replies

10h24m

2024-03-03 07:59:22 UTC

There's a nodejs version which takes 23s

https://github.com/1brc/nodejs

Alifatisk

0 replies

5h8m

2024-03-03 13:15:44 UTC

How come node performed this good? Any specific reason?

thangalin

0 replies

11h26m

2024-03-03 06:58:10 UTC

Back in 2010, I used PostgreSQL for a web app that queried 270 million rows of climate data from Environment Canada:

https://www.youtube.com/watch?v=10KEr3sEG80

I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:

https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s

The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.

sireat

0 replies

9h27m

2024-03-03 08:56:31 UTC

Those super optimized solutions are fun to read about.

However, in real life I would never assume a millions rows of text all have all valid data in a specific format, much less a billion rows.

Thus a slower but more robust solution would be more realistic.

fullstackchris

0 replies

8h57m

2024-03-03 09:26:59 UTC

performance is great but i would imagine the paralellized version requires a significantly higher minimum amount of RAM than the non paralellized ways... he claims that each more perfomant solution is less compute cost than the one before it, but in the case of paralellization its just the same amount of compute in just a shorter amount of time, right?

WesolyKubeczek

0 replies

8h53m

2024-03-03 09:30:51 UTC

I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.

Alifatisk

0 replies

5h3m

2024-03-03 13:21:00 UTC

Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.

1vuio0pswjnm7

0 replies

10h50m

2024-03-03 07:33:18 UTC

How about kdb+ or shakti