For what it's worth, I ported the sum example to pure python.
def sum(depth, x):
if depth == 0:
return x
else:
fst = sum(depth-1, x*2+0) # adds the fst half
snd = sum(depth-1, x*2+1) # adds the snd half
return fst + snd
print(sum(30, 0))
under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information: CPU, Apple M3 Max, 1 thread: 3.5 minutes
CPU, Apple M3 Max, 16 threads: 10.26 seconds
GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds
The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?
If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.
Running on 42 minutes is mots likely a bug. Yes, we haven't done much testing outside of M3 Max yet. I'm aware it is 2x slower on non-Apple CPUs. We'll work on that.
For the `sum` example, Bend has a huge disadvantage, because it is allocating 2 IC nodes for each numeric operation, while Python is not. This is obviously terribly inefficient. We'll avoid that soon (just like HVM1 did it). It just wasn't implemented in HVM2 yet.
Note most of the work behind Bend went into making the parallel evaluator correct. Running closures and unrestricted recursion on GPUs is extremely hard. We've just finished that part, so, there was basically 0 effort into micro-optimizations. HVM2's codegen is still abysmal. (And I was very clear about it on the docs!)
That said, please try comparing the Bitonic Sort example, where both are doing the same amount of allocations. I think it will give a much fairer idea of how Bend will perform in practice. HVM1 used to be 3x slower than GHC in a single core, which isn't bad. HVM2 should get to that point not far in the future.
Now, I totally acknowledge these "this is still bad but we promise it will get better!!" can be underwhelming, and I understand if you don't believe on my words. But I actually believe that, with the foundation set, these micro optimizations will be the easiest part, and performance will skyrocket from here. In any case, we'll keep working on making it better, and reporting the progress as milestones are reached.
While that's true, Python would be using big integers (PyLongObject) for most of the computations, meaning every number gets allocated on the heap.
If we use a Python implementation that would avoid this, like PyPy or Cython, the results change significantly:
That's on an M2 Pro. I also imagine the result in Bend would not be correct since it only supports 24 bit integers, meaning it'd overflow quite quickly when summing up to 2^30, is that right?[Edit: just noticed the previous comment had already mentioned pypy]
Do you know why? As far as I can tell, HVM has no aarch64/Apple-specific code. Could it be because Apple Silicon has wider decode blocks?
I don't think anyone wants to rain on your parade, but extraordinary claims require extraordinary evidence.
The work you've done in Bend and HVM sounds impressive, but I feel the benchmarks need more evaluation/scrutiny. Since your main competitor would be Mojo and not Python, comparisons to Mojo would be nice as well.
The only claim I made is that it scales linearly with cores. Nothing else!
I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place. Documentation, website, demos. I spent hours in meetings to make sure everything is correct. Yet, sometimes it feels that no matter how much effort I put, people will just find ways to misinterpret it.
We published the real benchmarks, checked and double checked. And then you complained some benchmarks are not so good. Which we acknowledged, and provided causes, and how we plan to address them. And then you said the benchmarks need more evaluation? How does that make sense in the context of them being underwhelming?
We're not going to compare to Mojo or other languages, specifically because it generates hate.
Our only claim is:
HVM2 is the first version of our Interaction Combinator evaluator that runs with linear speedup on GPUs. Running closures on GPUs required colossal amount of correctness work, and we're reporting this milestone. Moreover, we finally managed to compile a Python-like language to it. That is all that is being claimed, and nothing else. The codegen is still abysmal and single-core performance is bad - that's our next focus. If anything else was claimed, it wasn't us!
The only claim I made is that it scales linearly with cores. Nothing else!
The other link on the front page says:
"Welcome to the Parallel Future of Computation"
Scaling with cores is synonym of parallel.
"Future" has some mild speed implications but it sounds like you're doing reasonably there, bug nonwithstanding.
It also has "Not yet" implications ...
I've always taken 'Welcome to the Future' as the thing being presented is futuristic and exists now in the present. Not 'in the future we will welcome you to the future' - while that is a nice sentiment it's utterly useless. To point out the obvious - of course futuristic things exist in the future and of course I have to wait for the future to happen.
But it literally says we believe it is the future of parallel computing! If it was faster than GCC today, we would've written present :')
I think people might interpret something claiming to be the "Future of Parallel Computing" as something that is just waiting on adoption. Perhaps "Towards the Future of Parallel Computing"...
Do you think calling your project parallel is what people have an issue with or do you think it's that you're calling your project the future of parallel computation when it doesn't perform anywhere close to what already exists?
That's true, you never mentioned Python or alternatives in your README, I guess I got Mandela'ed from the comments in Hacker News, so my bad on that.
People are naturally going to compare the timings and function you cite to what's available to the community right now, though, that's the only way we can picture its performance in real-life tasks.
Mojo launched comparing itself to Python and didn't generate much hate, it seems, but I digress
In any case, I hope Bend and HVM can continue to improve even further, it's always nice to see projects like those, specially from another Brazilian
Thanks, and I apologize if I got defensive, it is just that I put so much effort on being truthful, double-checking, putting disclaimers everywhere about every possible misinterpretation. Hell this is behind install instructions:
Yet people still misinterpret. It is frustrating because I don't know what I could've done better
Don't optimize for minimum hate, optimize for actionable feedback and ignore the haters. Easier said than done, though.
Remember you don't need comment trolls on your team, and you'll go insane taking them seriously. Focus on piquing the interest of motivated language nerds. I personally would have really appreciated a "look, were still 10x (or whatever) slower than Python, so now I need all the help I can get working on the codegen, etc." This would have given me quick perspective on why this milestone is meaningful.
I agree...
Just a note: we are NOT 10x slower than Python. I think a lot of people got the wrong message from this thread. HVM is actually quite fast already. It is just that, on this specific program, Python was doing no allocations, while HVM was allocating a lot.
If you compare programs that do the same allocation, HVM already outperforms not just Python but even compiled languages like Haskell/GHC, due to using all cores. See the Bitonic Sort example. And that's extremely relevant, because real world programs in high-level languages allocate a lot!
I think I made a huge mistake of using a "sum" example on the GUIDE, since it is actually one of the few specific programs where HVM will do poorly today.
Introducing novel ideas and making strong statements will almost always generate some anger and denial.
https://paulgraham.com/useful.html
Don't worry about it. Keep at it, this is a very cool project.
FWIW on HN people are inherently going to try to actually use your project and so if it's meant to be (long term) a faster way to run X people evaluate it against that implicit benchmark.
I think the (hidden) reasoning is that it is really easy to have speedups with slow interpreters. However, getting speedups in high-performance level programs is quite hard, mainly due to micro-optimisations.
That's where the comparison to Python comes from: getting speedup on slow interpreters is not very _relevant_. Now if your interpreter has the same optimisations as Python (or v8 or JVM), even a small fraction of what you show would be impressive.
Having said this, the work your team did is a really challenging engineering feat (and with lot more potential). But I do not believe the current speedups will hold if the interpreter/compilers have the level of optimisation that exist in other languages. And while you do not claim it, people expect that.
I think the issue is that there is the implicit claim that this is faster than some alternative. Otherwise what's the point?
If you add some disclaimer like "Note: Bend is currently focused on correctness and scaling. On an absolute scale it may still be slower than single threaded Python. We plan to improve the absolute performance soon." then you won't see these comments.
Also this defensive tone does not come off well:
Right below install instructions, on Bend's README.md:
Second paragraph of Bend's GUIDE.md:
Limitations session on HVM2's paper:
Yeah exactly. I read most of the readme and watched the demo, but I'm not interested in installing it so I missed this. I would recommend moving this to the first section in its own paragraph.
I understand you might not want to focus on this but it's important information and not a bad thing at all.
That's a great feedback actually, thank you.
We'll add the disclaimer before the install instructions instead!
Relatedly, the homepage itself doesnt make it obvious it’s still alpha, or not ready, or not actually going to speed up your code this moment - claims like “automatically achieves near-ideal speedup, up to 1000+ threads” - the point is that it parallelizes code, but the word speedup makes it sound like my code will get 1000x faster.
I think you can avoid this kind of criticism by setting expectations better - just plastering a banner at the top saying that it’s in early stage development and not optimized, but that the future is bright, for example. The current headline saying it’s the “parallel future of computation” isn’t really enough to make people understand that the future isn’t here yet.
Same goes for the README, the fact that it’s not production ready per-se really ought to be at the top to set people’s expectations properly IMO, since a lot of people will not read the whole wall of text and just jump straight into trying it out once they’re on your GitHub page.
They’re critical since they are led to have much higher expectations than what actually exists today.
That said, this is a cool project and wish you the best in making it really good!
It is not in alpha, nor not ready. You can use it in production today, if you want to. It is just not fast. That is different. CPython is still 100x slower than C, and is widely deployed in practice.
Identifying what's parallelizable is valuable in the world of language theory, but pure functional languages are as trivial as it gets, so that research isn't exactly ground-breaking.
And you're just not fast enough for anyone doing HPC, where the problem is not identifying what can be parallelized, but figuring out to make the most of the hardware, i.e. the codegen.
This approach is valuable because it abstracts away certain complexities for the user, allowing them to focus on the code itself. I found it especially beneficial for users who are not willing to learn functional languages or parallelize code in imperative languages. HPC specialists might not be the current target audience, and code generation can always be improved over time, and I trust based on the dev comments that it will be.
Perhaps you can add: "The codegen is still abysmal and single-core performance is bad - that's our next focus." as a disclaimer on the main page/videos/etc. This provides more context about what you claim and also very important what you don't (yet) claim.
Naive question: do you expect the linear scaling to hold with those optimisations to single core performance, or would performance diverge from linear there pending further research advancements?
from reply below:
I just want to say: don't stop. There will always be some people who don't notice or acknowledge the effort to be precise and truthful. But others will. For me, this attitude elevates the project to something I will be watching.
Thank you. I understand in such an early irritation of a language there are going to be lots of bugs.
This seems like a very, very cool project and I really hope it or something like it is successful at making utilizing the GPU less cumbersome.
The claim from the website is "automatically achieves near-ideal speedup".
I think you were being absolutely precise, but I want to give a tiny bit of constructive feedback anyway:
In my experience, to not be misunderstood it is more important to understand the state of mind/frame of reference of your audience, than to be utterly precise.
The problem is, if you have been working on something for a while, it is extremely hard to understand how the world looks to someone who has never seen it (1).
The second problem is that when you hit a site like hacker News your audience is impossibly diverse, and there isn't any one state of mind.
When I present research, it always takes many iterations of reflecting on both points to get to a good place.
(1) https://xkcd.com/2501/
Bitonic sort runs in 0m2.035s. Transpiled to c and compiled it takes 0m0.425s.
that sum example, transpiled to C and compiled takes 1m12.704s, so it looks like it's just the VM case that is having serious issues of some description!
I have no dog in this fight, but feel compelled to defend the authors here. Recursion does not test compute, rather it tests the compiler's/interpreter's efficiency at standing up and tearing down the call stack.
Clearly this language is positioned at using the gpu for compute-heavy applications and it's still in its early stages. Recursion is not the target application and should not be a relevant benchmark.
It's the author's own benchmark, but it shows the system doesn't work. Recursion isn't hard to eliminate for pure functions; there's nothing inherently wrong with it.
The authors made obviously false claims. That this is an extremely fast implementation. When it's not. That's really poor form.
Okay, no. I know I called out performance in my post, but that was just from my observations. It surprised me to see something be that much slower than pure python. If you show me a near-python code example in a new language, as someone who mostly writes python code, I'm going to go and write it in python and see how it compares performance wise.
The authors never made any kind of false claims at all. You're reading a lot in to both their README and my post.
They've updated the README for a bit of clarity, but even re-reading the README as it was when I looked this morning (and even a few from before) it hasn't claimed to be fast. The claims are all related to the features that it does have, around parallelisation.
I don't get it. Are people on HN so completely ignorant of even the most basic CS101 concepts?
That's not how complexity theory works at all. If your constants are massive you can make anything look like it scales well.
Imagine a compiler that adds in a very large sleep that's inversely proportional to the number of cores. There you go. My compiler now emits code that scales linearly. Of course it's very slow. Where do I pick up my Turing award?
I never thought that CS education was this bad.
Where did he claim it is fast? As far as I can see the only claim is that it scales linearly with cores. Which it actually seems to do.
Why `+0`, is this not a pointless no-op?
Yes, but when looking at the source it's more obvious this is a repeating pattern.
"Hey, I'm accessing the 0th element here, just want to make that clear"
Without the +0, that statement looks disconnected from the +1 even though conceptually its the same.
Say somebody adds some special marker/tombstone/whatever into element 0 and now all those additions need to be bumped up by one. Someone else may go and see the +1, +2, +3 and just change them to +2, +3 +4, etc while completely missing the lone variable by itself as its visually dissimilar.
Ive usually seen it used in longer lists of statements. It also keeps everything lined up formatting wise.
Python is really bad at recursion (part of why it's not appropriate for functional programming), so perhaps an unfair benchmark?
A Pythonic implementation would use loops and mutation.
"Thread" term in GPUs and CPUs mean different things, it's more like a SIMD lane in GPUs. A bit like ISPC can compile your code so there's eg 32 invocations of your function per CPU thread running on the same time (if you're using 16-bit datums on AVX512), and you could have 2048 executions going on after multiplying up 32 cores * 2 SMT threads/core * 32 compiler executions.