Nice post. Interesting that the fastest Java beats the fastest Go, though they are close:
AY fastest Go version 2.90s 36.2
TW fastest Java version 0.953s 110
I would have expected Go to win. That JVM works pretty good...Nice post. Interesting that the fastest Java beats the fastest Go, though they are close:
AY fastest Go version 2.90s 36.2
TW fastest Java version 0.953s 110
I would have expected Go to win. That JVM works pretty good...I was curious how long it would take with Polars (for scale), apparently 33s: https://github.com/Butch78/1BillionRowChallenge/tree/main
I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?
That’s the question worth asking imo. I was wondering how fast is the idiomatic Java solution
Is that the same hardware? Otherwise it doesn’t say much.
Java is often slightly faster than Go, has similar (perhaps, older, better optimized Map) constructs, perhaps better GC (older, more optimized), though I don't think the GC is a challenge, has slower startup times - so I'd say roughly the same as the idiomatic Go version?
Java’s GC is just incomparably better, literally every research on the topic is written for Java. Though that’s true that without value types, Java does rely slightly more on it.
On my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version.
Where’s the source data I’d like to attempt ingesting this and processing it with DuckDb.
Instructions on how to create it can be found here:
https://github.com/gunnarmorling/1brc?tab=readme-ov-file#run...
The Python version:
https://github.com/gunnarmorling/1brc/blob/main/src/main/pyt...
In the original 1BRC, it's a python script that generates the data.
Here’s a thread on results with duckdb, I don’t mean to discourage you taking a shot at all though: https://github.com/gunnarmorling/1brc/discussions/39
Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.
True in general but some (especially libs aimed at larger datasets processed in batch) are taking advantage of benchmarks like this to do things like: Try the fast way, if it works great Try the slow way, if above fails This makes the slow path 2x slower at worst (and you can advise to always use the slow way with optional params) but the fast path can be 10x faster
I’m surprised at the poor performance of python here. For reference there are several very brief R examples which are just 2-3 seconds. Eg http://blog.schochastics.net/posts/2024-01-08_one-billion-ro...
Second article I'm reading on implementing this in Go. Since the temperatures are in the range [-99.9, 99.9] with a tenth of precision (~2k values), I am surprised why no one has implemented a parsing of the numbers using a prepopulated lookup table. Should probably speed things up.
I submitted a github issue on this for the other implementation I looked at here[1].
He already uses custom parsing.
How do you search in the lookup table? If you are thinking of a hash map it will be slower than the few operations of his custom parser.
If you're clever about how you initially take apart the input you can just do a single memory load at a computed offset.
I don't see how you can make computing the offset faster than just parsing the number.
You could have a much simpler parsing than ordinary parsing if you know/assume that you definitely have a valid number from -99.99 to 99.99.
For example, you could find whether it starts with a '-' and where the delimiter is to know the length of the number string representation (that's the "simpler parsing" part), and then don't do any work at all to decode the number, simply use these 1-5 bytes of that string (without sign and separator) directly as an index into a very large very sparse memory region in which all the valid values are pre-populated with the proper result.
You'd need to allocate 4 or 8 terabytes of virtual memory address space for that lookup table, but you'll touch only a tiny fraction of the memory pages, so it doesn't require an unacceptable amount of physical memory.
You don't need to cover all bits of the values, just 10 numeric values that can pass a a bounds check. That reduces the space to only 10K elements. With some bit shifting (and pre-validation) that should easily reduce the search space.
Creating a LUT index with bit-shifting is essentially the same as parsing into an integer. Even if the LUT fits in L1 cache, I doubt it would be faster. If it doesn't fit, it's certainly slower.
If that is faster would seem to depend on if you can get most lookups from the L2 cache. Otherwise you're waiting for main memory, which is a few hundred cycles. Even with multiple loads in parallel, it would be hard to beat arithmetic.
Take the bytes of the number, «-9.7», and interpret as an integer? That’s a 4-byte int -> array index. (Haven’t tried but…)
Wouldn't you need a fixed length of the temps?
00.1
-10.3
or 0.1 (with an ending space)
so you can look up 5 bytes in the map? (+/i, two digits, dot, one digit)you can create a perfect hash based on the presence of at least 4 characters. perfect hash is pre calculated based on possible inputs (-99.9 to 99.0 in bytes). the hash is usual byte*seed+hash. "seed" is chosen so that there is no clash (you can find a static seed in a single brute force from 1 to 1m in < 1 min)
I thought the lookup would just be a byte tree not a hash. Wouldn't a hash with it's calculation beat the purpose of being faster than parsing a number?
The idea would be, you have a tree of all values of 0.0 to 99.9 and then just use the bytes to iterate the tree (e.g. in an array) to come up with the int value of e.g. 45.5
parsing a number contains an (addition + multiplication) *(number of digits) for each row. if you do precalculated perfect hash then multiplication for each row can be avoided. ( you anyways need to read each byte)
I did it with custom parsing[0] and treated the numbers as 16 bit integers, the representation in the file is not a constant number of bytes which complicates the table approach. If you end up computing a hash I think it might be slower than just doing the equivalent parsing I do and a four byte constant table will be very large and mostly empty. Maybe a a trie would be good.
0: https://github.com/k0nserv/brc/blob/main/src/main.rs#L279
I have a feeling that a naive implementation in Java would be a lot worse than a naive implementation in Go so optimizing matters more there.
Had to parse csvs in Java on a very memory constrained system once... we ended up cutting out a feature because it wasn't worth it.
Golang is actually not as efficient as Java in quite a few benchmarks.
~~Using LLVM isn’t a magic solution to perform better than something relying on the JVM.~~
Here is a source: https://sites.google.com/view/energy-efficiency-languages
Huh? Go doesn't use LLVM though, where did you get the idea that it does? That's part of why its compile times are so fast.
You are right, it doesn’t use LLVM. No idea where I got the idea, I was confidently wrong.
Regarding Java, It probably could be done with arrays and object reuse (arenas). But it's slightly less ergonomic. And the ecosystem isn't designed for it, so you'd have to implement your own memory-efficient parser.
Yep, but it wasn't a critical feature and we were in a rush, so the feature was killed instead.
Depends on which Java implementation is used.
... if you have a choice. It was a port of AOSP, so we didn't. In any case it wasn't the jvm's fault, the device just had very little ram.
Sounds like a skill issue
Depends what you call “naive”, but the “idiomatic Java solution” from last week’s post (https://questdb.io/blog/billion-row-challenge-step-by-step/) clocked in at 71 seconds, or 1:11. And just running it on graal was down to 66.
“Very memory constrained” would be a massive factor here, 1BRC is not really constrained (let alone very much so), it has 1 billion rows on a 32GB machine.
Gigabytes? It was a while ago and i had megabytes for the whole OS :)
Anyway, it's just a fun memory now.
Depends on which Java implementation is used.
People keep forgetting Java is like C and C++, plenty of implementations to choose from, each with its own approach to JIT, AOT, GC and escape analysis.
For what it's worth, on my machine the simple/idiomatic "baseline" Java solution (https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...) takes 2m0s, compared to 1m45s for my simple Go version. So Go is a bit better for the naive version.
the fastest, heavily-optimised Java solution runs in just under a second on my machine
I don’t understand how this is possible. The file in question has 13GB, while the fastest commonly available SSDs are 12400 MB/s. Am I missing something?
I think this bit in the baseline section applies to the Java one too
Note that that’s a best-of-five measurement, so I’m allowing the file to be cached. Who knows whether Linux will allow all 13GB to be kept in disk cache, though presumably it does, because the first time it took closer to 6 seconds.
Yea, I assumed that. Which makes the parallel version improvements still interesting but surely it's very artificial. You can't processes all the data at the same time if you don't have it all yet.
Benchmarking this gets tricky once you realize that the file might be entirely cached if the computer has enough RAM.
Hah, we use different computers. I was thinking, this might be tricky if you aren't careful to keep the file entirely cached in RAM.
My guess: If you run the benchmark several times, the OS will cache the file in RAM.
If he runs the test multiple times, the file should all be cached in RAM after the first run.
most of the file remains in the OS disk cache after the first run
there are no disk file read times in the original rules. file is in memfs.
that might be a sustained single threaded read performance.
What if the method of access was concurrent from different parts of the file, and was operating system cached?
actual 1brc runs with data in RAM disk, he could've/should've done the same.
I’m in the same ballpark as Alexander Yastrebov’s Go version. His solution looks similar to mine: break the file into chunks, use a custom hash table (he even uses FNV hashing), and parse temperatures as integers. However, he uses memory-mapped files, which I’d ruled out for portability reasons – I’m guessing that’s why his is a bit faster.
I am curious, can it be made even faster than this?
Dunno about Go, but most c# solutions are around 2s and under https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
Wow that’s pretty damn fast. C# has made some improvements in the past years or they have some other advantages?
Stephen Toub wrote a book-length blog post about all the performance improvements made in .NET 8 [1]. Add the option to compile to native.
[1] https://devblogs.microsoft.com/dotnet/performance-improvemen...
His blogs posts go back all the way to .NET 5, for those curious to some deep dive on performance improvements done on each release.
Yes to both.
.NET team has been doubling down on performance improvements, people forget CLR also has features to support C like languages (hence Managed C++ and C++/CLI), and many of those capabilities are now surfaced into C# as well.
Instead of hash table I'd try sort of "eager" trie inside stack allocated memory. So I can find the slot for the stats of given station after parsing minimal number of characters that differentiate this station from others.
From looking at the final code it’s probably the performance of copy() as the biggest hurdle.
There's a layer of pointer indirection when using slices in Go, you may be able to eke out some time by moving to arrays on the stack.
The effort and long time it took Go to get to something that 3-6x times slower than other, better languages should be an important reminder to everyone assuming it belongs to the same weight class as Rust, C# or Java.
But isn’t the Java version unrolling loops? This seems like some effort on the Java side.
The fast Java version is using all the same optimizations as this Go version and then some. It's significantly more complicated.
That you put Rust among those languages says it all. Do some basic research.
Oh, and what the basic research you speak of constitutes? Surely you looked at ASM emitted by compilers for these languages and HPC-adjacent APIs each of them offers? No? Then let me tell you - Go is pretty much consigned to having to use its special flavour of non-portable bespoke hand-written ASM which is the only way to access SIMD instructions necessary to achieve optimal hardware utilization in the benchmark. This takes a lot of effort and skill, so, as you may have noticed, if you can't do it, Go simply cannot come close to better options you can see on the benchmark chart.
And yet, this is something that can be trivially done in C#, C++ and Rust (albeit C# has the best UX with crossplat SIMD API introduced in .NET 7, with C++ close second with its own take on this being in preview). Java OTOH manages to be in the same category by having extremely advanced JIT that allows it to have comparable codegen quality even though it lacks comparable SIMD API for now (Panama vectors are problematic currently), so benchmarks implementations using it are forced to do SWAR.
My main gripe is of course an extremely common misconception about Go's speed which it just does not have the moment you write anything sufficiently advanced or want to express a particular problem in a terser way than writing thousands of open coded loops.
If you read the article, you'll see he doesn't attempt the optimizations that helped those other languages get to 3-6x faster. Your snark is wasted.
For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/
In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:
- unsafe.Pointer can be used to read memory without bounds checks
- many functions in the bytes and bits packages in the standard library are written in assembly
- debug.SetGCPercent and SetMemoryLimit to turn off GC
- runtime.LockOSThread to lock a goroutine to a thread
- print is slightly faster than fmt.Printf (but writes to stderr)
Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.
Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!
Could you say why you find using memory-mapped files to be a portability issue? Thanks.
Well, I guess it's more that the standard library doesn't have a cross-platform way to access them, not that memory-mapped files themselves can't be done on (say) Windows. It looks like there's a fairly popular 3rd party package that supports at least Linux, macOS, and Windows: https://github.com/edsrzf/mmap-go
Not all filesystems support memory-mapped files equally, and for some that do, the support comes with caveats and could be slower than non-memory-mapped access.
I think we all ended up using unsafe, though there were some solutions without mmap. It would have been interesting if we had adhered to the same constraints you did!
My first instinct would be to spin up a local Postgres and keep station data there. A lot of the solutions assume we have enough ram to keep the stats per station, however that's a bad assumption when dealing with a lot of data.
This is not a general solution, it’s a specific problem with a specific data shape, and processes specifically 1 billion rows on a 32GB machine.
And this is why file io is not the bottleneck - all the data is in RAM/disk cache.
A lot of the solutions assume we have enough ram to keep the stats per station
I’ll give this example in Ruby but you’ll get the point, what you are mentioning is an issue if you chose to use File.read(), because it opens the file and reads its content into the ram. But this can be solved if you use File.readlines() because it streams each row instead which uses much less ram.
What I am saying is that the solutions presented assume we have N stations which are way less than 1_000_000_000.
Worst case scenario for RAM would be if each line contained a unique station. In this case we'd have to allocate 1_000_000_000 * sizeof(stats) in RAM to contain the result of the computation.
So most of the solutions assume sufficient RAM ahead of reading the file.
In the first solution:
type stats struct { min, max, sum float64 count int64 }
would take up 32GB for the stats alone for all 1E9 stations, and that's ignoring space needed for each station's name!
I’d just pay my way to 4s by upgrading hw
You can't just throw hardware at this one to get to 4s. At least not in 2024.
The author's naive single-threaded Go solution took 1m45s on an "amd64 laptop with fast SSD drive and 32GB of RAM."
So, you'd need something 25x faster than his setup in terms of single-threaded performance. Let us know when you've upgraded your hardware to the equivalent of a 75ghz AMD processor with memory and SSD bandwidth to match!
The nice thing about a GPU soln (ex: python dataframes in cudf, just a few loc) is these generally come down to your IO bandwidth, like a single 2GB/s SSD to a 16-32 GB/s PCIe to 1-2 GPUs running crazy fast. And then buy more cheap SSDs to chain together before buying more/better GPUs :)
... how?
There aren't any generally-available CPUs that are substantially faster today than were available ten years ago. Maybe double the speed per core, triple at best.
After that, throwing more cores at it also rapidly runs out of steam because parallel code has its own overheads. Any shared state instantly kills performance, no matter the language. Very clever tricks have to be used to get decent scaling past 64 hardware threads (32 cores), and going past 256 is surprisingly difficult. You start having to worry about NUMA, IRQ steering, and core pinning. Bandwidth gets to be an issue, even to L3 and L4 cache, let alone out to main memory.
This notion that you can just "dial up" hardware performance to infinity as a fix for any amount of developer laziness needs to die.
I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".
A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.
I hate to "me too", but you also nailed that analysis.
It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.
Nice. I wonder how Rust would fare, given that it has no garbage collector.
You can disable GC for Golang but don't think it will improve on 4s.
There are a few rust solutions in the "Show and Tell" linked above, for example this fairly readable one at 15.5s: https://github.com/gunnarmorling/1brc/discussions/57 A comment above referencing Python "polars" actually has rust polars, std, and SIMD solutions as well (SIMD was fasted, but less readable for a hobbyist like me).
EDIT: https://github.com/Butch78/1BillionRowChallenge/tree/main
It would be interesting to know how effective Profile Guided Optimisation is here.
It is only mildly effective because how anemic Go compiler is. And even then it's extremely limited. If you want to see actual good implementations - look into what OpenJDK HotSpot and .NET JIT compilers do with runtime profiling and recompilation (.NET calls it Dynamic PGO).
Unfortunately it doesn't seem to help at all, I think mainly because (at present) Go's PGO basically inlines hot functions, and the important code here is all in one big function.
I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!
0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...
actually that guy can now be sued, per TOS it is illegal to publish benchmarks for Oracle DB.
I agree with doing more in the database, you're closer to the data (disk/disk cache/L2 cache) than the application code is. At the same time I get really nervous around doing work in the database because you have to be really disciplined that the in-database code (functions, views, etc) match the code in source control. Also that your testing/QA database contains all the same code and enough test data to actually exercise the performance bounds of that code.
With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.
I’d like to see a 1 trillion row challenge.
One exists https://blog.coiled.io/blog/1trc.html
i saw the custom hashtable, but why was Map slow?
I believe it is because he did not pre allocate the map, so as it grew, it reallocated multiple times. Just a guess.
I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!
These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.
Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.
It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.
I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.
It's a good point, but I believe because of the "-b" option (characters as bytes), using LC_ALL=C doesn't make a difference.
There's a nodejs version which takes 23s
How come node performed this good? Any specific reason?
Back in 2010, I used PostgreSQL for a web app that queried 270 million rows of climate data from Environment Canada:
https://www.youtube.com/watch?v=10KEr3sEG80
I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:
https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s
The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.
Those super optimized solutions are fun to read about.
However, in real life I would never assume a millions rows of text all have all valid data in a specific format, much less a billion rows.
Thus a slower but more robust solution would be more realistic.
performance is great but i would imagine the paralellized version requires a significantly higher minimum amount of RAM than the non paralellized ways... he claims that each more perfomant solution is less compute cost than the one before it, but in the case of paralellization its just the same amount of compute in just a shorter amount of time, right?
I love the author’s step-by-step approach as very often it so happens that a hyper-optimized solution may be overfitted to the exact dataset it’s operating on. In each step, the tradeoffs are being explained: what we gain, but also what we lose by stripping the functionality away.
Ia there a collection on all the languages that have attempted this challenge? I know comparing and competing languages is somewhat useless but it is still interesting to me.
How about kdb+ or shakti
JVM has always been on par if not often faster than hand-written C code.
Go's advantage has always been that it is good enough at a lot of things.
Please, you've been reading too much PR from the Java side and not looking at benchmarks and real-world performance enough. What you're claiming is inherently not possible, cherry-picked benchmarks notwithstanding.
Can you explain why it's not technically possible.
JVM has had decades of experience at optimally translating bytecode to machine code and can take advantage of SIMD, AVX etc when needed. Most hand-written C code is far from optimal.
C compilers also have decades of experience optimally translating C code into machine code, and they are arguably more capable of emitting SIMD (good luck trying to use cutting edge AVX-512 intrinsics like vpopcntdq with the JVM). The fact is that there is nothing a JIT compiler can do that an AOT compiler can't do, but in the case of AOT, the resources spent compiling the code are amortized to effectively 0, whereas that resource cost is borne upon every program startup for a JIT engine.
C compilers only have one opportunity to do that once, at compile time, if the developer was lucky with their data set used to train the PGO output, maybe the outcome is greatly improved.
Modern JVMs, not only have the JIT being able to use actual production data, they are able to cache PGO data between execution runs, and reach an optimimal set of heuristics throughout execution time.
And on Android, those PGO files are even shared between devices via Play Store.
That's not necessarily true, on JIT vs AOT split. I'm mostly going off of how the divergence in available optimization is starting to look like in .NET after introduction of Native AOT with light research into LLVM and various optimization-adjacent Rust crates.
In particular, with JIT, you are able to initialize certain readonly data once, and then, on recompilation to a more optimized version, bake such data as JIT constants right into emitted machine code. This is not possible with AOT. Same applies for all kinds of in-runtime profiling/analysis and recompilation to incorporate a collected profile according to this exact run of an application. JIT also offers the ability to load modules dynamically in the form of bytecode without having to have a strict machine-level ABI, only the bytecode one, which allows for efficient generics that cross modules, as well as cross-module function inlining. And last but not least - there is no need to pick the least common denominator in supported hardware features as the code can be compiled to use the latest features provided by hardware like AVX512.
On the other hand, pure AOT means a frozen world which allows the compiler to know exact types and paths the code can take, performing exact devirtualization and much more aggressive preinitialization on code that accepts constant data. It also means bigger leeway in the time the compiler can spend on optimizing code. Historically, GCC and LLVM have been more advanced than their JIT counterparts because of different tradeoffs more favouring to absolute performance of the emitted code as well as simply higher amount of man hours invested in developing them (e.g. .NET punches above it's weight class despite being worked on by a smaller team vs OpenJDK or LLVM).
A couple weeks ago I managed to get a nice setup for viewing Java JIT disassembly[1], and did a bunch of ad-hoc tests on OpenJDK 21 (and some on whatever build of 23 was on openjdk.org). do manage to vectorize a decent amount of vectorizable loops, but semi-often missed some improvements, and some trivial things didn't vectorize at all (most amusingly, a loop summing an int[] to an int didn't get vectorized, while int[] to long is). Scalar code codegen is also quite mediocre.
GCC & clang, in comparison, often produce assembly which I'd consider optimal, or close to it, for vectorizable loops, and are noticably better than OpenJDK for scalar code.
[1]: it's a bit messy to use but: https://github.com/dzaima/grr#java-jit
Sure it's possible. The JVM can do guided optimizations at run time. There is no such thing for native executables.
And as I mentioned in another comment, you can even cache PGO data between executions, not needed to start always from zero.
I think the reason why this misconception is so widespread is because there is a grain of truth in it, because almost everyone sees Java synonymous with gigantic framework like spring, quarkus etc.
In go you've got your standard libraries, these are generally quicker than the Java equivalent simply because they do less in the lifecycle of the operation.
This lets Java do funky stuff like enabling full jvm/code tracing just by adding a jar file at runtime. But it does come with a performance penalty.
Which is a reason why dynamic loading of agents now requires being enabled explicitly.
One has to differentiate here a bit. Java JIT technology has become really great. Highly optimized native code generation which hugely benefits from the ability to use live profiling data to optimize the code. This is why it often beats static compilers at generating faster code. The static compilers can only optimized on the range of possible data, the JIT can optimize based on the data presented to the program.
On the down side, there are quite a few features of the Java language and the JVM, which often make programs slow. Like a lot of details of the object model, lack of value classes, JIT compiling which takes time on startup etc. Also, a lot of Java libraries are pretty heavy weight.
Go is quite different here. It is statically compiled, which allows for fast program startup and the language model makes it quite easy to rather naively write programs which perform reasonally fast. The down side is, that the compiler is static and not so heavily optimizing as other static compilers for fast compilation speed. However recently the ability was added to use profiling data for optimizing the compilation.
The Java version does additional optimizations that his Go version doesn't do and he mentions that at the end of the post. The Java version is really optimized and is an interesting read.
is there any similar blog post on the Java optimisations?
https://news.ycombinator.com/item?id=39467885 but it looks like there have been improvements since.
https://questdb.io/blog/billion-row-challenge-step-by-step/
Not that I know of. I just looked at the code and the commit history but a more in depth article would certainly be interesting.
Cliff Click did a walkthrough of his solution on YouTube: https://youtu.be/NJNIbgV6j-Y?si=Wj97f-Imw5nfIzF7
The more accurate statement would be is Go implementatation is incapable of accessing optimizations that exist in Java and then Java is incapable of optimizations performed by C# and C++ implementations.
See https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
Go is perfectly capable of all of the additional optimizations that are in the fastest Java implementation that is linked in the article.
Until some time ago Go did not even have inlining profitability logic in the compiler and could only inline leaf functions, something which is worse than JVM implementations could do in...2005 or so I think?
Are you sure?
Well the Go guy probably didn't read it because "java doesn't interest him"
Still loses to .NET. On reference host Java still closer to 1.7-2s ballpark (and has to use awkward SWAR to get there) while the fastest solution in C# is 1.2s, beating C++ (code can be ported however).
But yes, "I expected Go to win..." is exactly the core of the problem here. Same as with e.g. Swift, which people expect to perform on the level of Rust, when it is even slower than Go. The intuition caused by common misconceptions just does not correspond to reality sadly.
It only goes to show how much cargo cult is there in adopting these languages in hipster circles.
No. What it actually demonstrates is that people didn’t read the source material properly.
The Java and Go versions use different optimisations. There’s nothing stopping either language from using the same optimisations as the other. It just wasn’t something their respective authors cared to try in their respective exercises.
There, however, is something stopping Go from using optimizations present in Java or C#/C++/Rust. This is lack of SIMD API without dropping to hand writing ASM and overall much weaker compiler. This puts much greater burden on the programmer to match the performance while staying with Go.
...
Your second statement contradicts your first. You're not stopped from using SIMD in Go. There are in fact several 3rd party libraries out there to use SIMD. It's just not part of the standard library. So you can still use SIMD in Go without writing Go's dialect of assembly.
It's also worth noting that SIMD isn't even due to drop into std in C++ until C++26. At the moment you either have to use experimental or a 3rd party library.
You're also missing the point of these examples. Nobody writing these exercises are trying to claim that all languages are equal. And they certainly not trying to write idiomatic code either. They're just fun little exercises demonstrating the degree of optimisations one can go through. You wouldn't want to write code like those in the examples in all but a the tiniest of scenarios.
It's silly the amount of people over-analysing this, what is essentially just a game, and then arguing its "proof" about their biases towards different programming languages.
Source for that?
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
I don't think Go has ever demonstrated that it deserved to be thought of as casually faster than Java.
https://github.com/attractivechaos/plb2/blob/master/README.m...
Synthetic benchmarks aside, I think as far as average (spring boots of the world) code goes, Go beats Java almost every time, often in less lines than the usual pom.xml
Anyone judging Java performance based on starting up a JVM, running for 30 seconds, and shutting down is brain damaged. I'm 99 percent certain this benchmark doesn't use Epsilon GC either lol
LOL k
ah, the cherry on top. this benchmark is literally perfectly useless
And yet, people will refer to some other synthetic benchmark when arguing that the average Java project, which doesn't disable GC nor uses unsafe reads et al, is comparable.
Can't have and eat it too
So you are comparing a feature-packed enterprise framework with a raw http server ping-ponging hello world?
Also, what even does the last line mean? Go in general is significantly more verbose than Java.
The fastest Java version is even beating his baseline of `cat` to `/dev/null`
More proof that the JVM is space-age future technology.
Yes, though it's also worth noting that the fastest solutions are all doing their work in parallel which is not a thing for `cat`.
Your expectation is correct. The Java version is more tuned. Here https://github.com/dhartunian/1brcgo/?tab=readme-ov-file#lea..., you can find a version that runs in 1.108 and is almost 3x better than the one you quoted. I think one can reduce it even further. In the end, it depends on how fast the executable can boot up and execute the code. At some point, JVM will lose because it takes quite some time just to initialize the JVM, whereas Go executables can boot up very fast. Here you can see a comparison of Hello World Programs: https://github.com/qznc/hello-benchmark?tab=readme-ov-file#h.... JVM takes a whopping 596 ms to boot up and print Hello World, whereas Go just requires 149 ms.
That's only a valid comparison if the "fastest Java" and "fastest Go" implementations are either the same or at the limit of what each language allows.
The more interesting comparison anyway is performance of the straightforward, idiomatic code, since that's what we all write 99% of the time.
Here's the key insight from the article: "Processing the input file in parallel provides a huge win over r1, taking the time from 1 minute 45 seconds to 24.3 seconds. For comparison, the previous “optimised non-parallel” version, solution 7, took 25.8 seconds. So for this case, parallelisation is a bit faster than optimisation – and quite a bit simpler."