HN comments for: Bend: a high-level language that runs on GPUs (via HVM2)

Twirrim

42 replies

1d1h

2024-05-17 17:24:28 UTC

For what it's worth, I ported the sum example to pure python.

    def sum(depth, x):
        if depth == 0:
          return x
        else:
          fst = sum(depth-1, x*2+0) # adds the fst half
          snd = sum(depth-1, x*2+1) # adds the snd half
          return fst + snd
        
    print(sum(30, 0))

under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information:

    CPU, Apple M3 Max, 1 thread: 3.5 minutes
    CPU, Apple M3 Max, 16 threads: 10.26 seconds
    GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds

The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?

I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?

If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.

LightMachine

32 replies

2024-05-17 17:55:58 UTC

Running on 42 minutes is mots likely a bug. Yes, we haven't done much testing outside of M3 Max yet. I'm aware it is 2x slower on non-Apple CPUs. We'll work on that.

For the `sum` example, Bend has a huge disadvantage, because it is allocating 2 IC nodes for each numeric operation, while Python is not. This is obviously terribly inefficient. We'll avoid that soon (just like HVM1 did it). It just wasn't implemented in HVM2 yet.

Note most of the work behind Bend went into making the parallel evaluator correct. Running closures and unrestricted recursion on GPUs is extremely hard. We've just finished that part, so, there was basically 0 effort into micro-optimizations. HVM2's codegen is still abysmal. (And I was very clear about it on the docs!)

That said, please try comparing the Bitonic Sort example, where both are doing the same amount of allocations. I think it will give a much fairer idea of how Bend will perform in practice. HVM1 used to be 3x slower than GHC in a single core, which isn't bad. HVM2 should get to that point not far in the future.

Now, I totally acknowledge these "this is still bad but we promise it will get better!!" can be underwhelming, and I understand if you don't believe on my words. But I actually believe that, with the foundation set, these micro optimizations will be the easiest part, and performance will skyrocket from here. In any case, we'll keep working on making it better, and reporting the progress as milestones are reached.

vrmiguel

30 replies

2024-05-17 18:23:52 UTC

it is allocating 2 IC nodes for each numeric operation, while Python is not

While that's true, Python would be using big integers (PyLongObject) for most of the computations, meaning every number gets allocated on the heap.

If we use a Python implementation that would avoid this, like PyPy or Cython, the results change significantly:

    % cat sum.py 
    def sum(depth, x):
        if depth == 0:
            return x
        else:
            fst = sum(depth-1, x*2+0) # adds the fst half
            snd = sum(depth-1, x*2+1) # adds the snd half
        return fst + snd

    if __name__ == '__main__':
        print(sum(30, 0))

    % time pypy sum.py
    576460751766552576
    pypy sum.py  4.26s user 0.06s system 96% cpu 4.464 total

That's on an M2 Pro. I also imagine the result in Bend would not be correct since it only supports 24 bit integers, meaning it'd overflow quite quickly when summing up to 2^30, is that right?

[Edit: just noticed the previous comment had already mentioned pypy]

I'm aware it is 2x slower on non-Apple CPUs.

Do you know why? As far as I can tell, HVM has no aarch64/Apple-specific code. Could it be because Apple Silicon has wider decode blocks?

can be underwhelming, and I understand if you don't believe on my words

I don't think anyone wants to rain on your parade, but extraordinary claims require extraordinary evidence.

The work you've done in Bend and HVM sounds impressive, but I feel the benchmarks need more evaluation/scrutiny. Since your main competitor would be Mojo and not Python, comparisons to Mojo would be nice as well.

LightMachine

29 replies

23h51m

2024-05-17 18:35:18 UTC

The only claim I made is that it scales linearly with cores. Nothing else!

I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place. Documentation, website, demos. I spent hours in meetings to make sure everything is correct. Yet, sometimes it feels that no matter how much effort I put, people will just find ways to misinterpret it.

We published the real benchmarks, checked and double checked. And then you complained some benchmarks are not so good. Which we acknowledged, and provided causes, and how we plan to address them. And then you said the benchmarks need more evaluation? How does that make sense in the context of them being underwhelming?

We're not going to compare to Mojo or other languages, specifically because it generates hate.

Our only claim is:

HVM2 is the first version of our Interaction Combinator evaluator that runs with linear speedup on GPUs. Running closures on GPUs required colossal amount of correctness work, and we're reporting this milestone. Moreover, we finally managed to compile a Python-like language to it. That is all that is being claimed, and nothing else. The codegen is still abysmal and single-core performance is bad - that's our next focus. If anything else was claimed, it wasn't us!

CyberDildonics

7 replies

23h38m

2024-05-17 18:48:25 UTC

The only claim I made is that it scales linearly with cores. Nothing else!

The other link on the front page says:

"Welcome to the Parallel Future of Computation"

LightMachine

6 replies

23h19m

2024-05-17 19:07:46 UTC

Scaling with cores is synonym of parallel.

Dylan16807

4 replies

21h15m

2024-05-17 21:11:30 UTC

"Future" has some mild speed implications but it sounds like you're doing reasonably there, bug nonwithstanding.

singhblom

1 replies

20h42m

2024-05-17 21:44:51 UTC

It also has "Not yet" implications ...

cjbgkagh

0 replies

14h23m

2024-05-18 04:03:41 UTC

I've always taken 'Welcome to the Future' as the thing being presented is futuristic and exists now in the present. Not 'in the future we will welcome you to the future' - while that is a nice sentiment it's utterly useless. To point out the obvious - of course futuristic things exist in the future and of course I have to wait for the future to happen.

LightMachine

1 replies

20h40m

2024-05-17 21:46:26 UTC

But it literally says we believe it is the future of parallel computing! If it was faster than GCC today, we would've written present :')

EgoIncarnate

0 replies

15h6m

2024-05-18 03:20:45 UTC

I think people might interpret something claiming to be the "Future of Parallel Computing" as something that is just waiting on adoption. Perhaps "Towards the Future of Parallel Computing"...

CyberDildonics

0 replies

2h29m

2024-05-18 15:57:20 UTC

Do you think calling your project parallel is what people have an issue with or do you think it's that you're calling your project the future of parallel computation when it doesn't perform anywhere close to what already exists?

vrmiguel

6 replies

23h20m

2024-05-17 19:07:10 UTC

That's true, you never mentioned Python or alternatives in your README, I guess I got Mandela'ed from the comments in Hacker News, so my bad on that.

People are naturally going to compare the timings and function you cite to what's available to the community right now, though, that's the only way we can picture its performance in real-life tasks.

Mojo or other languages, specifically because it generates hate

Mojo launched comparing itself to Python and didn't generate much hate, it seems, but I digress

In any case, I hope Bend and HVM can continue to improve even further, it's always nice to see projects like those, specially from another Brazilian

LightMachine

5 replies

23h2m

2024-05-17 19:24:40 UTC

Thanks, and I apologize if I got defensive, it is just that I put so much effort on being truthful, double-checking, putting disclaimers everywhere about every possible misinterpretation. Hell this is behind install instructions:

our code gen is still on its infancy, and is nowhere as mature as SOTA compilers like GCC and GHC

Yet people still misinterpret. It is frustrating because I don't know what I could've done better

kmacdough

1 replies

7h26m

2024-05-18 11:00:17 UTC

Don't optimize for minimum hate, optimize for actionable feedback and ignore the haters. Easier said than done, though.

Remember you don't need comment trolls on your team, and you'll go insane taking them seriously. Focus on piquing the interest of motivated language nerds. I personally would have really appreciated a "look, were still 10x (or whatever) slower than Python, so now I need all the help I can get working on the codegen, etc." This would have given me quick perspective on why this milestone is meaningful.

LightMachine

0 replies

5h1m

2024-05-18 13:25:33 UTC

I agree...

Just a note: we are NOT 10x slower than Python. I think a lot of people got the wrong message from this thread. HVM is actually quite fast already. It is just that, on this specific program, Python was doing no allocations, while HVM was allocating a lot.

If you compare programs that do the same allocation, HVM already outperforms not just Python but even compiled languages like Haskell/GHC, due to using all cores. See the Bitonic Sort example. And that's extremely relevant, because real world programs in high-level languages allocate a lot!

I think I made a huge mistake of using a "sum" example on the GUIDE, since it is actually one of the few specific programs where HVM will do poorly today.

jhawleypeters

0 replies

19h26m

2024-05-17 23:00:38 UTC

Introducing novel ideas and making strong statements will almost always generate some anger and denial.

https://paulgraham.com/useful.html

alfalfasprout

0 replies

22h16m

2024-05-17 20:10:48 UTC

Don't worry about it. Keep at it, this is a very cool project.

FWIW on HN people are inherently going to try to actually use your project and so if it's meant to be (long term) a faster way to run X people evaluate it against that implicit benchmark.

alcidesfonseca

0 replies

5h30m

2024-05-18 12:56:13 UTC

I think the (hidden) reasoning is that it is really easy to have speedups with slow interpreters. However, getting speedups in high-performance level programs is quite hard, mainly due to micro-optimisations.

That's where the comparison to Python comes from: getting speedup on slow interpreters is not very _relevant_. Now if your interpreter has the same optimisations as Python (or v8 or JVM), even a small fraction of what you show would be impressive.

Having said this, the work your team did is a really challenging engineering feat (and with lot more potential). But I do not believe the current speedups will hold if the interpreter/compilers have the level of optimisation that exist in other languages. And while you do not claim it, people expect that.

IshKebab

5 replies

23h37m

2024-05-17 18:50:09 UTC

I think the issue is that there is the implicit claim that this is faster than some alternative. Otherwise what's the point?

If you add some disclaimer like "Note: Bend is currently focused on correctness and scaling. On an absolute scale it may still be slower than single threaded Python. We plan to improve the absolute performance soon." then you won't see these comments.

Also this defensive tone does not come off well:

We published the real benchmarks, checked and double checked. And then you complained some benchmarks are not so good. Which we acknowledged, and provided causes, and how we plan to address them. And then you said the benchmarks need more evaluation? How does that make sense in the context of them being underwhelming?

LightMachine

4 replies

23h11m

2024-05-17 19:16:08 UTC

Right below install instructions, on Bend's README.md:

But keep in mind our code gen is still on its infancy, and is nowhere as mature as SOTA compilers like GCC and GHC.

Second paragraph of Bend's GUIDE.md:

While cool, Bend is far from perfect. In absolute terms it is still not so fast. Compared to SOTA compilers like GCC or GHC, our code gen is still embarrassingly bad, and there is a lot to improve. That said, it does what it promises: scaling horizontally with cores.

Limitations session on HVM2's paper:

While HVM2 achieves near-linear speedup, its compiler is still extremely immature, and not nearly as fast as state-of-art alternatives like GCC of GHC. In single-thread CPU evaluation, HVM2, is still about 5x slower than GHC, and this number can grow to 100x on programs that involve loops and mutable arrays, since HVM2 doesn’t feature these yet.

IshKebab

3 replies

21h27m

2024-05-17 20:59:44 UTC

Right below install instructions

Yeah exactly. I read most of the readme and watched the demo, but I'm not interested in installing it so I missed this. I would recommend moving this to the first section in its own paragraph.

I understand you might not want to focus on this but it's important information and not a bad thing at all.

LightMachine

2 replies

21h13m

2024-05-17 21:13:54 UTC

That's a great feedback actually, thank you.

We'll add the disclaimer before the install instructions instead!

presentation

1 replies

15h33m

2024-05-18 02:53:57 UTC

Relatedly, the homepage itself doesnt make it obvious it’s still alpha, or not ready, or not actually going to speed up your code this moment - claims like “automatically achieves near-ideal speedup, up to 1000+ threads” - the point is that it parallelizes code, but the word speedup makes it sound like my code will get 1000x faster.

I think you can avoid this kind of criticism by setting expectations better - just plastering a banner at the top saying that it’s in early stage development and not optimized, but that the future is bright, for example. The current headline saying it’s the “parallel future of computation” isn’t really enough to make people understand that the future isn’t here yet.

Same goes for the README, the fact that it’s not production ready per-se really ought to be at the top to set people’s expectations properly IMO, since a lot of people will not read the whole wall of text and just jump straight into trying it out once they’re on your GitHub page.

They’re critical since they are led to have much higher expectations than what actually exists today.

That said, this is a cool project and wish you the best in making it really good!

LightMachine

0 replies

3h39m

2024-05-18 14:47:29 UTC

It is not in alpha, nor not ready. You can use it in production today, if you want to. It is just not fast. That is different. CPython is still 100x slower than C, and is widely deployed in practice.

mgaunard

1 replies

20h7m

2024-05-17 22:20:04 UTC

Identifying what's parallelizable is valuable in the world of language theory, but pure functional languages are as trivial as it gets, so that research isn't exactly ground-breaking.

And you're just not fast enough for anyone doing HPC, where the problem is not identifying what can be parallelized, but figuring out to make the most of the hardware, i.e. the codegen.

Oranguru

0 replies

17h48m

2024-05-18 00:38:32 UTC

This approach is valuable because it abstracts away certain complexities for the user, allowing them to focus on the code itself. I found it especially beneficial for users who are not willing to learn functional languages or parallelize code in imperative languages. HPC specialists might not be the current target audience, and code generation can always be improved over time, and I trust based on the dev comments that it will be.

quikoa

0 replies

9h29m

2024-05-18 08:57:25 UTC

Perhaps you can add: "The codegen is still abysmal and single-core performance is bad - that's our next focus." as a disclaimer on the main page/videos/etc. This provides more context about what you claim and also very important what you don't (yet) claim.

mcintyre1994

0 replies

10h11m

2024-05-18 08:15:43 UTC

Naive question: do you expect the linear scaling to hold with those optimisations to single core performance, or would performance diverge from linear there pending further research advancements?

jonahx

0 replies

21h51m

2024-05-17 20:35:16 UTC

I spent hours in meetings to make sure everything is correct. Yet, sometimes it feels that no matter how much effort I put, people will just find ways to misinterpret it.

from reply below:

I apologize if I got defensive, it is just that I put so much effort on being truthful, double-checking, putting disclaimers everywhere about every possible misinterpretation.

I just want to say: don't stop. There will always be some people who don't notice or acknowledge the effort to be precise and truthful. But others will. For me, this attitude elevates the project to something I will be watching.

dheera

0 replies

20h31m

2024-05-17 21:55:24 UTC

I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place

Thank you. I understand in such an early irritation of a language there are going to be lots of bugs.

This seems like a very, very cool project and I really hope it or something like it is successful at making utilizing the GPU less cumbersome.

KingOfCoders

0 replies

14h19m

2024-05-18 04:08:02 UTC

The claim from the website is "automatically achieves near-ideal speedup".

Certhas

0 replies

5h20m

2024-05-18 13:06:54 UTC

I think you were being absolutely precise, but I want to give a tiny bit of constructive feedback anyway:

In my experience, to not be misunderstood it is more important to understand the state of mind/frame of reference of your audience, than to be utterly precise.

The problem is, if you have been working on something for a while, it is extremely hard to understand how the world looks to someone who has never seen it (1).

The second problem is that when you hit a site like hacker News your audience is impossibly diverse, and there isn't any one state of mind.

When I present research, it always takes many iterations of reflecting on both points to get to a good place.

(1) https://xkcd.com/2501/

Twirrim

0 replies

2024-05-17 18:11:06 UTC

Bitonic sort runs in 0m2.035s. Transpiled to c and compiled it takes 0m0.425s.

that sum example, transpiled to C and compiled takes 1m12.704s, so it looks like it's just the VM case that is having serious issues of some description!

glitchc

4 replies

23h52m

2024-05-17 18:34:41 UTC

I have no dog in this fight, but feel compelled to defend the authors here. Recursion does not test compute, rather it tests the compiler's/interpreter's efficiency at standing up and tearing down the call stack.

Clearly this language is positioned at using the gpu for compute-heavy applications and it's still in its early stages. Recursion is not the target application and should not be a relevant benchmark.

light_hue_1

3 replies

21h7m

2024-05-17 21:19:15 UTC

It's the author's own benchmark, but it shows the system doesn't work. Recursion isn't hard to eliminate for pure functions; there's nothing inherently wrong with it.

The authors made obviously false claims. That this is an extremely fast implementation. When it's not. That's really poor form.

Twirrim

1 replies

19h21m

2024-05-17 23:06:09 UTC

Okay, no. I know I called out performance in my post, but that was just from my observations. It surprised me to see something be that much slower than pure python. If you show me a near-python code example in a new language, as someone who mostly writes python code, I'm going to go and write it in python and see how it compares performance wise.

The authors never made any kind of false claims at all. You're reading a lot in to both their README and my post.

They've updated the README for a bit of clarity, but even re-reading the README as it was when I looked this morning (and even a few from before) it hasn't claimed to be fast. The claims are all related to the features that it does have, around parallelisation.

light_hue_1

0 replies

19h8m

2024-05-17 23:18:41 UTC

I don't get it. Are people on HN so completely ignorant of even the most basic CS101 concepts?

That's not how complexity theory works at all. If your constants are massive you can make anything look like it scales well.

Imagine a compiler that adds in a very large sleep that's inversely proportional to the number of cores. There you go. My compiler now emits code that scales linearly. Of course it's very slow. Where do I pick up my Turing award?

I never thought that CS education was this bad.

rowanG077

0 replies

21h1m

2024-05-17 21:25:35 UTC

Where did he claim it is fast? As far as I can see the only claim is that it scales linearly with cores. Which it actually seems to do.

metadat

1 replies

16h43m

2024-05-18 01:43:24 UTC

Why `+0`, is this not a pointless no-op?

pests

0 replies

12h33m

2024-05-18 05:53:26 UTC

Yes, but when looking at the source it's more obvious this is a repeating pattern.

"Hey, I'm accessing the 0th element here, just want to make that clear"

Without the +0, that statement looks disconnected from the +1 even though conceptually its the same.

Say somebody adds some special marker/tombstone/whatever into element 0 and now all those additions need to be bumped up by one. Someone else may go and see the +1, +2, +3 and just change them to +2, +3 +4, etc while completely missing the lone variable by itself as its visually dissimilar.

Ive usually seen it used in longer lists of statements. It also keeps everything lined up formatting wise.

tinyspacewizard

0 replies

8h49m

2024-05-18 09:38:10 UTC

Python is really bad at recursion (part of why it's not appropriate for functional programming), so perhaps an unfair benchmark?

A Pythonic implementation would use loops and mutation.

fulafel

0 replies

13h18m

2024-05-18 05:08:57 UTC

"Thread" term in GPUs and CPUs mean different things, it's more like a SIMD lane in GPUs. A bit like ISPC can compile your code so there's eg 32 invocations of your function per CPU thread running on the same time (if you're using 16-bit datums on AVX512), and you could have 2048 executions going on after multiplying up 32 cores * 2 SMT threads/core * 32 compiler executions.

ziedaniel1

21 replies

1d1h

2024-05-17 17:17:53 UTC

Very cool idea - but unless I'm missing something, this seems very slow.

I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.

    #include <iostream>

    int main() {
      int sum = 0;
      for (int i = 0; i < 1024*1024*1024; i++) {
        sum += i; 
      }
      std::cout << sum << "\n";
      return 0;
    }

LightMachine

12 replies

2024-05-17 18:08:06 UTC

Bend has no tail-call optimization yet. It is allocating a 1-billion long stack, while C is just looping. If you compare against a C program that does actual allocations, Bend will most likely be faster with a few threads.

Bend's codegen is still abysmal, but these are all low-hanging fruits. Most of the work went into making the parallel evaluator correct (which is extremely hard!). I know that sounds "trust me", but the single-thread performance will get much better once we start compiling procedures, generating loops, etc. It just hasn't been done.

(I wonder if I should have waited a little bit more before actually posting it)

nneonneo

8 replies

2024-05-17 18:23:00 UTC

If they’re low-hanging fruit, why not do that before posting about it publicly? All that happens is that you push yourself into a nasty situation: people get a poor first impression of the system and are less likely to trust you the second time around, and in the (possibly unlikely) event that the problems turn out to be harder than you expect, you wind up in the really nasty situation of having to deal with failed expectations and pressure to fix them quickly.

naasking

6 replies

23h59m

2024-05-17 18:27:27 UTC

That's how development under open source works. You can't please everyone.

nneonneo

5 replies

23h49m

2024-05-17 18:37:43 UTC

There’s a big difference between developing something and announcing loudly that you have something cool; the developers have done the latter here.

trenchgun

1 replies

22h17m

2024-05-17 20:09:31 UTC

It is pretty cool milestone achieved, just not production ready.

adw

0 replies

19h31m

2024-05-17 22:56:12 UTC

This is very cool and it's being treated unfairly, though it's also obviously not ready for prime time; it's an existence proof.

To illustrate that, many people on here have been losing their mind over Kolomogorov-Arnold Networks, which are almost identically positioned; interesting idea, kind of cool, does what the paper claims, potentially useful in the future, definitely not any use at all for any real use-case right now.

(In part that's probably because the average understanding of ML here is _not_ strong, so there's more deference and credulousness around those claims.)

vrmiguel

0 replies

22h50m

2024-05-17 19:36:15 UTC

I think it's clearly pretty cool even if not as fast as people expect it to be

LightMachine

0 replies

20h55m

2024-05-17 21:31:39 UTC

Dude we're running unrestricted recursion and closures on GPUs! If that's not cool to you, I apologize, but that mind-blowingly cool to me, and I wanted to share it, even though the codegen is still initial. Hell I was actually going to publish it with the interpreters only, but I still coded an initial compiler because I thought people would like to see where it could go :(

Ar-Curunir

0 replies

22h57m

2024-05-17 19:29:14 UTC

Thats completely unfair. They have developed something cool, just with not all the holes plugged.

LightMachine

0 replies

23h45m

2024-05-17 18:42:08 UTC

I agree with you. But then there's the entire "release fast, don't wait before it is perfect". And, then, there's the case that people using it will guide us to iteratively building what is needed. I'm still trying to find that balance, it isn't so easy. This release comes right after we finally managed to compile it to GPUs, which is a huge milestone people could care about - but there are almost no micro-optimizations.

phkahler

1 replies

15h36m

2024-05-18 02:51:05 UTC

> Bend has no tail-call optimization yet.

I've never understood the fascination with tail calls and recursion among computer science folks. Just write a loop, it's what it optimises to anyway.

fulafel

0 replies

12h23m

2024-05-18 06:03:55 UTC

Computer science traditionally focuses on algorithms. Divide-and-conquer algorithms are frequently much more clearly expressed as recursion. Quicksort is the classic example, also various tree operations.

jay-barronville

0 replies

17h50m

2024-05-18 00:36:18 UTC

(I wonder if I should have waited a little bit more before actually posting it)

No. You built something that’s pretty cool. It’s not done yet, but you’ve accomplished a lot! I’m glad you posted it. Thank you. Ignore the noise and keep cooking!

rroriz

2 replies

1d1h

2024-05-17 17:22:06 UTC

I think the point is that Bend in a much higher level than C++. But to be fair: I also may be missing the point!

gslepak

0 replies

23h30m

2024-05-17 18:57:12 UTC

The point is that Bend parallelizes everything that can be parallelized without developers having to do that themselves.

0 replies

2024-05-17 17:29:58 UTC

here is the same loop finishing in one second on my laptop, single-threaded, in a very high-level language, q:

  q)\t sum til floor 2 xexp 30
  1031

molenzwiebel

2 replies

2024-05-17 18:03:04 UTC

If compiled with -O3 on clang, the loop is entirely optimized out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest comparison.

ziedaniel1

0 replies

23h23m

2024-05-17 19:03:21 UTC

I used GCC and checked that it wasn't optimized out (which actually surprised me!)

LightMachine

0 replies

2024-05-17 18:20:53 UTC

Exactly, this kind of thing always happens with these loops, which is why I think programs that allocate are fairer. But then people point out that the C allocator is terrible, so we can't make that point :')

nneonneo

1 replies

2024-05-17 18:24:40 UTC

You might want to double check with objdump if the loop is actually vectorized, or if the compiler just optimizes it out. Your loop actually performs signed integer overflow, which is UB in C++; the compiler could legally output anything. If you want to avoid the UB, declare sum as unsigned (unsigned integer overflow is well-defined); the optimization will still happen but at least you’ll be guaranteed that it’ll be correct.

ziedaniel1

0 replies

23h22m

2024-05-17 19:04:58 UTC

I did make sure to check before posting.

Good point about the signed integer overflow, though!

CorrectingYou

17 replies

23h18m

2024-05-17 19:08:16 UTC

OP comes around with some of the coolest things posted in HN recently, and all he gets is extensive criticism, when it is clear that this is an early version :/

eating555

7 replies

14h7m

2024-05-18 04:19:26 UTC

I would be pretty appreciated if people criticize my project. That is how you grow. If people tend hide cruel truth behind applause, the world would just crumbled.

LightMachine

3 replies

7h35m

2024-05-18 10:51:48 UTC

I really think I take criticism well... The problem is that people were criticizing us for not doing things that were literally done on the second paragraph. So at this point it didn't feel like productive criticism? That's like being criticized for being naked when you're full clothed. How do you even make sense of that...

truckerbill

1 replies

7h9m

2024-05-18 11:18:05 UTC

People are more childish than they like to believe. It's a mix of jealousy and ignorant skepticism. What you're doing is incredibly interesting I look forward to seeing it develop!

anacrolix

0 replies

6h53m

2024-05-18 11:33:40 UTC

The fact that people compile summaries of discussions from HN comments like they've extracted gold shows the level of douchebaggery.

treme

0 replies

6h6m

2024-05-18 12:21:08 UTC

You are doing superb. Just remind your self there are people that think Elon is incompetent despite TESLA and SpaceX.

diego_sandoval

1 replies

12h41m

2024-05-18 05:45:49 UTC

My observation is that most criticism is useless, because people don't understand why you did things the way you did them.

If you explain why, they either still don't understand, or don't agree.

If the first iPhone had been presented on HN/Reddit/Twitter, everyone would criticize the lack of physical keyboard.

hahajahen

0 replies

14m

2024-05-18 18:13:04 UTC

OP is claiming amazing results, people are poking obvious holes that good single core implementations completely rip the scalability claims to shreds. Near linear scalability is not impressive if—even at the highest throughput—the computation pales by comparison to Rust on a single core.

I do not see how the comparison to the iPhone here stands.

robocat

0 replies

12h2m

2024-05-18 06:24:17 UTC

What you appreciate has little to do with whether we should assume others are thick-skinned. If someone has always been knocked down they will struggle to positively accept criticism regardless of how well meant it might be.

metadat

3 replies

16h27m

2024-05-18 01:59:58 UTC

Correction for you - This is patently false, OP has had three hits -- this one, and two one hundred pointers out of 100-200 submissions.

P.s. it seems rather likely the op is Victor Taelin, they mostly submit his tweets and gists.

Who are you rooting for, exactly, newcomer?

P.p.s. Victor Taelin just happens to be the most recent committer on this submission, imagine that.

https://news.ycombinator.com/item?id=35363400

LightMachine

1 replies

14h49m

2024-05-18 03:38:02 UTC

I have no idea what you're trying to convey, but I'm Victor Taelin. Also very cool comment on that thread, hypothesizing on whether we'd be able to ever run it on GPUs. We did it! That is what we're announcing today.

metadat

0 replies

14h30m

2024-05-18 03:56:41 UTC

Great, thanks for clarifying.

foota

0 replies

16h8m

2024-05-18 02:18:21 UTC

We're a bit off-topic, but there's no requirement that your account be associated with your identity, especially when the op is pretty clearly involved with the project (as opposed to if they were claiming not to be or something).

swayvil

2 replies

19h53m

2024-05-17 22:33:23 UTC

The coolest things are often the most difficult to understand.

Difficult to understand is often threatening.

Criticism is a popular response to threat and is the form of reply that requires the least understanding.

riku_iki

1 replies

18h51m

2024-05-17 23:35:33 UTC

it also could be half cooked and that's why criticism arrives.

swayvil

0 replies

18h47m

2024-05-17 23:39:47 UTC

Like, on a bus?

vitiral

0 replies

1h47m

2024-05-18 16:39:36 UTC

It has 905 upvotes, it has received a fair share of positivity as well. Even criticism is often positive, since it expresses interest and engagement with the ideas and approach.

imranq

0 replies

14h51m

2024-05-18 03:35:46 UTC

I think HN is a community where people want to post something novel or new. When someone wants to post a kudos, most likely they'll upvote someone else instead of posting yet another "awesome job" (even if it is certainly warranted). Criticism instead can be endlessly diverse since there's usually only limited number ways to get it right, but plenty to get wrong.

In the end, HN comments fall prey to this truth and you see a handful of positive comments, with the majority being criticisms or "I wish this did X". No one person is to blame. Its just the culture of technologists today.

davidw

7 replies

1d1h

2024-05-17 16:28:41 UTC

As a resident of Bend, Oregon... it was kind of funny to read this and I'm curious about the origin of the name.

alex_lav

2 replies

2024-05-17 17:34:41 UTC

Totally off topic but I'll be driving there later this afternoon. Hoping it's as beautiful as last time!

davidw

1 replies

2024-05-17 18:01:41 UTC

If you're going to be here for a bit (I am heading out of town on a bike trip for a few days), always happy to grab a beer with fellow HN people!

alex_lav

0 replies

23h41m

2024-05-17 18:45:50 UTC

Unfortunately just the weekend. I'm just in Portland tho so will definitely be back.

bytK7

1 replies

1d1h

2024-05-17 16:33:18 UTC

As a fellow resident of Bend I felt the same way when I saw this.

noumenon1111

0 replies

1d1h

2024-05-17 16:45:09 UTC

As a native Bendite but not current Bend resident, seeing that word with a capital letter always makes me smell juniper and sagebrush a little bit.

developedby

0 replies

1d1h

2024-05-17 16:43:52 UTC

Bending is an operation similar to folding, both in real life and in the language. While fold is recursive on data, bend is recursive on a boolean condition (like a pure while that supports multiple branching recursion points).

I was actually looking forward to seeing someone from Bend to make a comment like this

blinded

0 replies

20h42m

2024-05-17 21:44:37 UTC

Thought the same thing!

yetihehe

5 replies

1d1h

2024-05-17 16:32:24 UTC

Bend looks like a nice language.

That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.

Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).

LightMachine

2 replies

1d1h

2024-05-17 16:35:46 UTC

We will have 64-bit boxed numbers really soon! As in, next month, or earlier if users find this to be a higher priority.

yetihehe

1 replies

1d1h

2024-05-17 16:51:23 UTC

What other types are you planning? Maybe some floats (even if only on cpu targets, would be nice).

LightMachine

0 replies

20h54m

2024-05-17 21:32:50 UTC

Immutable textures and strings. Perhaps actual mutable arrays. Many numeric types like F64, U64, I64. And some vector types like F16x4.

Archit3ch

1 replies

23h52m

2024-05-17 18:34:43 UTC

Is there a platform with native hardware u64? Maybe some FPGA?

Archit3ch

0 replies

23h51m

2024-05-17 18:35:40 UTC

Sorry, meant u24.

light_hue_1

4 replies

2024-05-17 17:51:30 UTC

Massive promises of amazing performance but they can't find one convincing example to showcase. It's hard to see what they're bringing to the table when even the simplest possible Haskell code just as fast on my 4 year old laptop with an ancient version of GHC (8.8). No need for an RTX 4090.

   module Main where
   
   sum' :: Int -> Int -> Int
   sum' 0 x = x
   sum' depth x = sum' (depth - 1) ((x \* 2) + 0) + sum' (depth - 1) ((x \* 2) + 1)
   
   main = print $ sum' 30 0

Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things don't get much more high level.

If you're going to promise amazing performance from a high level language, I'd want to see a comparison against JAX.

It's an improvement over traditional interaction nets, sure! But interaction nets have always been a failure performance-wise. Interaction nets are PL equivalent of genetic algorithms in ML, they sound like a cool idea and have a nice story, but then they always seem to be a dead end.

Interaction nets optimize parallelism at the cost of everything else. Including single-threaded performance. You're just warming up the planet by wasting massive amounts of parallel GPU cores to do what a single CPU core could do more easily. They're just the wrong answer to this problem.

LightMachine

3 replies

23h20m

2024-05-17 19:06:16 UTC

You're wrong. The Haskell code is compiled to a loop, which we didn't optimize for yet. I've edited the README to use the Bitonic Sort instead, on which allocations are unavoidable. Past N=20, HVM2 performs 4x faster than GHC -O2.

light_hue_1

2 replies

21h25m

2024-05-17 21:02:09 UTC

What? I ran your example, from your readme, where you promise a massive performance improvement, and you're accusing me of doing something wrong?

This is exactly what a scammer would say.

I guess that's the point here. Scam people who don't know anything about parallel computing by never comparing against any other method?

hahajahen

0 replies

18h20m

2024-05-18 00:07:08 UTC

The insult against anyone who pushes back a little bit is not a good sign, I agree. From all we can see now, the massive speedups being claimed have zero optimal baselines. I badly would like to identify these.

LightMachine

0 replies

20h48m

2024-05-17 21:38:59 UTC

Thanks for the feedback! Some clarifications:

1. I didn't accuse you of doing something wrong, just that your claim was wrong! It has been proven that Interaction Combinators are an optimal model of concurrent computation. I also pointed cases where it also achieves practical efficiency, over-performing GHC's highest optimization level.

2. The performance scaling claimed been indeed been achieved, and the code is open for anyone to replicate our results. The machines used are listed on the repository and paper. If you find any trouble replicating, please let me know!

3. We're not selling any product. Bend is Apache-licensed.

KeplerBoy

4 replies

1d2h

2024-05-17 16:22:23 UTC

What's going on with the super-linear speedup going from one thread to all 16?

210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.

LightMachine

1 replies

1d1h

2024-05-17 16:29:47 UTC

the single-thread case ran a little slower than it should on this live demo due to a mistake on my part: `run` redirected to the Rust interpreter, rather than the C interpreter. the Rust one is a little bit slower. the numbers on the site and on all docs are correct though, and the actual speedup is ~12x, not ~16x.

KeplerBoy

0 replies

1d1h

2024-05-17 16:33:37 UTC

Thanks for the explanation and the cool project.

I will give bend a shot on some radar signal processing algorithms.

byteknight

0 replies

1d1h

2024-05-17 16:27:14 UTC

Its possible to see such scaling if involving any level of cache or I/O.

LightMachine

0 replies

23h43m

2024-05-17 18:43:40 UTC

I apologize, I gave you the wrong answer.

I thought you was talking about the DEMO example, which ran ~30% slower than expected. Instead, you were talking about the README, which was actually incorrect. I noticed the error and edited it. I explained the issue in another comment.

vegadw

3 replies

23h42m

2024-05-17 18:44:26 UTC

A lot of negativity in these threads. I say ~cudas~ kudos to the author for getting this far! The only similar project I'm aware of is Futhark, and that's haskell-y syntax - great for some people, but to the general class of C/C++/Python/Js/Java/etc. devs pretty arcane and hard to work with. My biggest complaint with this is, unlike Futhark, it only targets Cuda or multi-core. Futhark which can target OpenCL, Cuda, ISPC, HIP, sigle core CPU, or multi core CPU. The performance problems others are pointing out I'm certain can be tackled.

pjmlp

0 replies

22h15m

2024-05-17 20:11:47 UTC

Chapel has a decent use in HPC.

Also NVidia has sponsored variants of Haskell, .NET, Java, Julia on CUDA, have a Python JIT and are collaborating with Mojo folks.

neonsunset

0 replies

21h51m

2024-05-17 20:35:36 UTC

Take a look at ILGPU. It's very nice and has been around for a long time! (just no one knows about it, sadly)

Short example: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/SimpleM...

Supports even advanced bits like inline PTX assembly: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/InlineP...

MarcusE1W

0 replies

12h53m

2024-05-18 05:33:58 UTC

ParaSail also goes into that direction https://github.com/parasail-lang/parasail.

Made by the designer for Ada since 1995, Tucker Taft. Some of the parallel features of ParaSail made it into Ada 2022.

delu

3 replies

1d1h

2024-05-17 17:12:05 UTC

Ten years ago, I took a course on parallel algorithms (15-210 at CMU). It pitched parallelism as the future of computing as Moore's law would hit inevitable limits. I was sold and I was excited to experiment with it. Unfortunately, there weren't many options for general parallel programming. Even the language we used for class (SML) wasn't parallel (there was a section at the end about using extensions and CUDA but it was limited from what I recall).

Since then, I was able to make some experiments with multithreading (thanks Rust) and getting very creative with shaders (thanks Shadertoy). But a general parallel language on the GPU? I'm super excited to play with this!

shwestrick

1 replies

22h36m

2024-05-17 19:50:56 UTC

Nowadays 210 is actually parallel! You can run 210-style code using MaPLe (https://github.com/MPLLang/mpl) and get competitive performance with respect to C/C++.

If you liked 210, you might also like https://futhark-lang.org/ which is an ML-family language that compiles to GPU with good performance.

amelius

0 replies

20h7m

2024-05-17 22:19:28 UTC

Huh, the Maple name is already used by a well known computer algebra project.

https://en.wikipedia.org/wiki/Maple_(software)

Rodeoclash

0 replies

20h35m

2024-05-17 21:51:17 UTC

The trend towards multiple cores in machines was one of the reasons I decided to learn Elixir.

chc4

3 replies

23h23m

2024-05-17 19:03:48 UTC

24bit integers and floats, no array datatype, and a maximum 4GB heap of nodes are very harse restrictions, especially for any workloads that would actually want to be running on a GPU. The limitations in the HVM2 whitepaper about unsound evaluation around closures and infinite loops because it is evaluating both sides of a conditional without any short circuiting are also extremely concerning.

Before you reply "these are things we can address in the future": that doesn't matter. Everyone can address everything in the future. They are currently hard technical barriers to it's use, with no way of knowing the level of effort that will require or the knock-on effects, especially since some of these issues have been "we can fix that later" for ten years.

I also highly recommend changing your benchmark numbers from "interactions per second" to a standard measurement like FLOPS. No one else on earth knows how many of those interactions are pure overhead from your evaluation semantics, and not doing useful work. They come across as attempting to wow an audience with high numbers and not communicating an apples to apples comparison with other languages.

LightMachine

2 replies

23h0m

2024-05-17 19:26:39 UTC

So use a metric that makes absolutely no sense on given domain, instead of one that is completely correct, sensible, accurate, stablished on the literature, and vastly superior in context? What even is a FLOPS in the context of Interaction Net evaluation? These things aren't even interchangeable.

hahajahen

1 replies

18h22m

2024-05-18 00:04:59 UTC

The fact that you don’t know the answer to this question, and don’t even seem to think it is relevant, is chilling.

People want to be able to ground your work—which you are claiming is the “parallel future of computation”—in something familiar. Insulting them and telling them their concerns are irrelevant just isn’t going to work.

I would urge you to think about what a standard comparison versus Haskell would look like. Presumably it would be something that dealt with a large state space, but also top down computation (something you couldn’t easily do with matrices). Big examples might include simply taking a giant Haskell benchmark (given the setting of inets it seems like a natural fit) that is implemented in a fairly optimal way—-both algorithmically and also wrt performance—-and compare directly on large inputs.

Sorry to trash on you here, not trying to come across as insulting, but I agree that “reductions per second” is meaningless without a nuanced understanding of the potentially massive encoding blowup that compilation introduces.

We want to believe, but the claims here are big

LightMachine

0 replies

7h23m

2024-05-18 11:03:46 UTC

I really appreciate the feedback, but the claim is that the performance scales linearly with cores, and it does. Also that it runs on GPUs, and it does. Yet, asking what is its "floating point operations per second" is nonsense, because it is not doing floating point computations. It is doing interactions. Thus the "interactions per second" term, which I didn't invent, it is the term used on the domain's literature.

I truly want to help here, but that is like asking us to tell you how many gallops per second hour car does. It just makes no sense in context. If I did invent some conversion, I would be lying, and that would be much worse than using a non-familiar term. The way to compare across languages is to benchmark and report on time. Which is like "horsepower" in that sense, as it applies to both domains.

MrLeap

3 replies

21h35m

2024-05-17 20:51:36 UTC

This is incredible. This is the kind of work we need to crack open the under utilized GPUs out there. I know LLMs are all the rage, but there's more gold in them hills.

anon291

2 replies

19h13m

2024-05-17 23:13:37 UTC

Except... it's not. Coming from a Haskell background and following the author since the early days, I think his work is excellent w.r.t Interaction Combinators and Nets. However, to do LLM work you need to cooperate with the chip, which means doing things in the manner most expeditious to the intricacies of Computer Architecture. That's not what this does. I don't see how Bend would modify its runtime to take advantage of all the things that modern GPU-based BLAS implementations do (which is what I currently do), but would love to be surprised.

As a whole, the speedups claimed are not actually that great. Going from 1 core to 16k cores increases performance by 50x. That's not actually very good.

Like, I really truly love what the author has contributed to functional languages and Interaction Nets. He has good ideas, but while it's cool that this can be done, things like LLMs require very practical tuning.

Finally, the author has a history of making fantastical claims. Again, it's true there is a speedup, but in my view, this is like making an extremely slow language and then optimizing it and then announcing that you've figure out how to improve your language's performance by 50x. While true, it neglects the fact it was very slow to begin with.

LightMachine

1 replies

18h56m

2024-05-17 23:30:31 UTC

You're comparing CPU cores to GPU cores!

It is "only" 50x because a single GPU core is 100x weaker than a CPU core!

Within CUDA cores, it is actually a linear speedup! It does 2k MIPS with 1 CUDA core, and ~28000 MIPS with 16k CUDA cores. If we double the performance of single-core GPU evaluation, we almost double the performance with 16k cores!

Laaas

0 replies

6h34m

2024-05-18 11:52:50 UTC

2k MIPS is 2 GIPS, no?

Great work BTW

KingOfCoders

3 replies

14h17m

2024-05-18 04:09:51 UTC

The website claims "automatically achieves near-ideal speedup"

12x for 16x threads

51x for 16.000x threads

Can someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?

lmeyerov

0 replies

2h39m

2024-05-18 15:47:28 UTC

Bend is intriguing --

1. Some potentially useful perspectives:

* Weak scaling vs strong scaling: https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-... ?

* ... Strong scaling, especially comparing to a modern sequential baseline, seems to be where folks are noting the author still has some work to do wrt getting to ideal speedups for what performance people care about

* There are parallel models of computation like PRAM for describing asymptomatically idealized speedups of trickier aspects of parallel code like heap usage . Bend currently seems to do a lot of stack allocations that someone writing in most parallel systems wouldn't, and the asymptomatic slowdowns would show up in these models, eg, asymptotically many unnecessary heap/stack data movements. There are a lot of these models, which are useful for being precise when making ideal speedups claims. NUMA, network topology, etc. ("Assume everyone is a sphere and...")

2. The comparisons I'd really like to see are:

* cudf, heavy.ai: how does it compare to high-level python dataframe and SQL that already run in GPUs? How is perf, and what programs do you want people to be able to write and they cannot?

* Halide and other more general purpose languages that compile to GPUs that seem closer to where Bend is going

FWIW, it's totally fine to compare to other languages.

Instead of showing it is beating everywhere, or saying ideal speedups and no comparisons, show where it is strong vs weak compared to others and diff tasks, especially progression across different releases (bend1 vs 2 vs ...), and let folks decide. There is some subset of tasks you already care about, so separate those out and show the quality you get in them, so people know what it looks like when you care + happy path. The rest becomes 'if you need these others, stay clear for now, check in again later as we know and are tracking.' Being clear that wall clock time can be slow and performance per watt can be wasteful is OK, you are looking for early adopters, not OpenAI core engineers.

andersa

0 replies

11h21m

2024-05-18 07:05:16 UTC

It's not.

LightMachine

0 replies

7h28m

2024-05-18 10:58:19 UTC

This is on CPU vs GPU.

A GPU core (shading unit) is 100x weaker than a CPU core, thus the difference.

ON the GPU, HVM's performance scales almost 16000x with 16000x cores. Thus the "near ideal speedup".

Not everyone knows how GPUs work, so we should have been more clear about that!

praetor22

2 replies

19h48m

2024-05-17 22:39:08 UTC

Look, I understand the value proposition and how cool it is from a theoretical standpoint, but I honestly don't think this will ever become relevant.

Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.

1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.

2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.

3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).

4. There are no serious examples that directly compare Bend/HVM code with it's equivalent OMP/CUDA program. How am I suppose to evaluate the reduction in implementation complexity and what to expect on performance. So many claims, so little actual comparisons.

5. In the real world of high performance parallel computing, tree-like structures are non-existent. Arrays are king. And that's because of the physical nature of how memory works on a hardware level. And do you know what works best on mutable contiguous memory buffers ? Loops. We'll see when HVM will implement this.

In the end, what we currently have is half-baked language that is (almost) fully isolated from external data, extremely slow, a massive abstraction on the underlying hardware (unutilised features: multilevel caches, tensor cores, simd, atomics).

I apologize if this comes out as harsh, I still find the technical implementation and the theoretical background to be very interesting. I'm simply not (yet) convinced of its usefulness in the real world.

physicsguy

0 replies

11h17m

2024-05-18 07:09:56 UTC

Re: 5, trees are fairly widely used (though not as most CS people would implement them) with Morton or H index ordering in things like the Fast Multipole and Barnes Hut algorithms which reduce O(n^2) pair wise ops to O(n) and O(n log n) respectively. BH more common in Astro, FMM in chemical molecular dynamics.

LightMachine

0 replies

18h23m

2024-05-18 00:04:12 UTC

Thanks for the feedback. Some corrections:

We do use multi-level caching, and you can achieve 5x higher performance by using it correctly. FFI is already implemented, just not published, because we want to release it with graphics rendering, which I think will be really cool. Haskell/GHC uses a graph and trees too, and nobody would say it is not practical of useful. And while it is true that arrays are king, there are many SOTA algorithms that are implemented in Haskell (including compilers, type-checkers, solvers) because they do not map well to arrays at all.

The main reason ICs are not fast is that nobody ever has done low-level optimization work over it. All previous implementations were terribly inefficient. And my own work is too, because I spent all time so far trying to get it to run *correctly* on GPUs, which was very hard. As you said yourself, there aren't even loops yet. So, how can we solve that? By adding the damn loops! Or do you think there is some inherent limitation preventing us to do that? If you do, you'll be surprised.

HVM2 is finally a correct algorithm that scales. Now we'll optimize it for the actual low-level performance.

highfrequency

2 replies

1d1h

2024-05-17 17:16:42 UTC

CPU, Apple M3 Max, 1 thread: 3.5 minutes

CPU, Apple M3 Max, 16 threads: 10.26 seconds

Surprised to see a more than linear speedup in CPU threads. What’s going on here?

LightMachine

0 replies

2024-05-17 18:24:50 UTC

I believe the single-core version was running slower due to the memory getting full. The benchmark was adding 2^30 numbers, but HVM2 32-bit has a limit of 2^29 nodes. I've re-ran it with 2^28 instead, and the numbers are `33.39 seconds` (1 core) vs `2.94 seconds` (16 cores). You can replicate the benchmark in an Apple M3 Max. I apologize for the mistake.

Archit3ch

0 replies

2024-05-17 18:10:10 UTC

More cores = more caches?

egnehots

2 replies

1d1h

2024-05-17 17:15:26 UTC

the interesting comparison nowadays would be against mojo:

https://www.modular.com/max/mojo

ZitchDog

1 replies

19h14m

2024-05-17 23:12:33 UTC

I think this is quite different- I don’t think mojo runs on the GPU unless I am mistaken.

witherk

0 replies

19h3m

2024-05-17 23:23:26 UTC

Being able to compile to different hardware including GPUs and TPUs seems to be one of the core goals of Mojo based off what Chris Lattner was saying in his Lex Friendman interview. It doesn't seem to come up much on Modular website though, so I can see why you would think that.

notfed

1 replies

1d1h

2024-05-17 17:10:17 UTC

This is really, really cool. This makes me think, "I could probably write a high performance GPU program fairly easily"...a sentence that's never formed in my head.

developedby

0 replies

1d1h

2024-05-17 17:13:10 UTC

That's the main idea!

klabb3

1 replies

2024-05-17 17:34:50 UTC

This is very exciting. I don’t have any GPU background, but I have been worrying a lot about CUDA cementating itself in the ecosystem. Here devs don’t need CUDA directly which would help decoupling the ecosystem from cynical mega corps, always good! Anyway enough politics..

Tried to see what the language is like beyond hello world and found the guide[1]. It looks like a Python and quacks like a Haskell? For instance, variables are immutable, and tree-like divide and conquer data structures/algorithms are promoted for getting good results. That makes sense I guess! I’m not surprised to see a functional core, but I’m surprised to see the pythonic frontend, not that it matters much. I must say I highly doubt that it will make it much easier for Python devs to learn Bend though, although I don’t know if that’s the goal.

What are some challenges in programming with these kind of restrictions in practice? Also, is there good FFI options?

[1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md

mathiasgredal

0 replies

21h26m

2024-05-17 21:01:08 UTC

We have a replacement for CUDA, it is called C++17 parallel algorithms. It has vendor support for running on the GPU by Intel, AMD and NVIDIA and will also run on all your cores on the CPU. It uses the GPU vendors compiler to convert your C++ to something that can natively run on the GPU. With unified memory support, it becomes very fast to run computations on heap allocated memory using the GPU, but implementations also support non-unified memory

Vendor support:

- https://www.intel.com/content/www/us/en/developer/articles/g...

- https://rocm.blogs.amd.com/software-tools-optimization/hipst...

- https://docs.nvidia.com/hpc-sdk/archive/20.7/pdf/hpc207c++_p...

funny_name

1 replies

22h20m

2024-05-17 20:06:43 UTC

What kind of software would this language be good for? I assume it's not the kind of language you'd use for web servers exactly.

trenchgun

0 replies

22h13m

2024-05-17 20:13:47 UTC

Erlang-like actor models would be well suited, so yeah, you could use it for web servers (assuming they are able to finish the language). It's a general purpose high level programming language.

andrewp123

1 replies

22h28m

2024-05-17 19:58:36 UTC

I just wanted to comment on how good the homepage is - it's immediately clear what you do. Most people working with "combinators" would feel a need to use lots of scary lingo, but OP actually shows the simple idea behind the tool (this is the opposite take of most academics, who instead show every last detail and never tell you what's going on). I really appreciate it - we need more of this.

topspin

0 replies

17h0m

2024-05-18 01:26:15 UTC

I'm ashamed that I didn't think to write this. Well deserved praise.

zmmmmm

0 replies

14h11m

2024-05-18 04:15:40 UTC

Reminds me a little bit of Concurnas [0] which sadly seemed to get abandoned right at the point when it was nearly viable.

[0] https://concurnas.com/

wolfspaw

0 replies

21h37m

2024-05-17 20:49:49 UTC

Nice!

Python-like + High-performance.

And, Different from Mojo, its Fully Open-Source.

toastal

0 replies

2h59m

2024-05-18 15:27:49 UTC

Another project locking its communications to the Discord black hole.

thinking_banana

0 replies

18h31m

2024-05-17 23:55:54 UTC

It's pretty cool that you *actually* built this necessary Developer interface aimed towards accessibility of HPC!

temp123789246

0 replies

19h51m

2024-05-17 22:35:38 UTC

Congrats!

I’ve been watching HVM for a while and think it’s extremely cool.

My intuition is that this will eventually be a really big deal.

smusamashah

0 replies

19h27m

2024-05-17 22:59:28 UTC

I have no interest in this tech as it's apparently for backend stuff and not actually rendering things by itself.

But the demo gif is probably the best I have seen in a Github readme. I watched it till the end. It was instantly engaging. I wanted to see the whole story unfold.

shadowpho

0 replies

1d2h

2024-05-17 16:02:55 UTC

Wow this is very impressive!

ruste

0 replies

1d3h

2024-05-17 15:15:32 UTC

Been watching your development for a while on Twitter. This is a monumental achievement and I hope it gets the recognition it deserves.

robust-cactus

0 replies

21h2m

2024-05-17 21:24:17 UTC

This is awesome and much needed. Keep going, forget the overly pedantic folks, the vision is great and early results are exciting.

npalli

0 replies

19h56m

2024-05-17 22:30:28 UTC

Is the recursive sum the best function to show multi-threading or GPU speedups? Seems unlikely. FWIW, i ported the python example to Julia and it ran in about 2.5 seconds the same as the C++ version. Pure python 3.12 took 183 seconds.

  function sum(depth, x)
      if depth == 0
          return x
      else
          fst = sum(depth-1, x*2+0)
          snd = sum(depth-1, x*2+1)
      end
      return fst + snd
  end

println(sum(30,0))

netbioserror

0 replies

19h35m

2024-05-17 22:52:08 UTC

So HVM finally yields fruit. I've been eagerly awaiting this day! Bend seems like a very suitable candidate for a Lispy S-expression makeover.

mbforbes

0 replies

20h28m

2024-05-17 21:58:16 UTC

Congratulations on the launch and hard work so far! We need projects like this. Great readme and demo as well.

Every time I try to write shaders, or even peek through my fingers at CUDA C(++) code, I recoil in disbelief that we don't have high level programming yet on the GPU. I can't wait until we do. The more great projects attacking it the better in my book.

markush_

0 replies

22h25m

2024-05-17 20:01:40 UTC

Exciting project, congrats on the release!

magicalhippo

0 replies

6h7m

2024-05-18 12:20:03 UTC

Everything that can run in parallel, will run in parallel.

On the CPU, there's typically a threshold where dividing and coordinating the parallel work takes more time than simply doing the work on a single thread. Thus you can make the overall runtime much faster by not dividing the work all the way, but rather stop at that optimal threshold and then just loop over the remaining work in the worker threads.

How does this work on the GPU using Bend? Been too long since I did any GPU programming.

kkukshtel

0 replies

1d2h

2024-05-17 16:25:06 UTC

Honestly incredible, and congrats on the release after what looks like an insane amount of work.

kerkeslager

0 replies

21h26m

2024-05-17 21:00:26 UTC

This looks like the language I've wanted for a long time. I'm excited to see how this plays out.

jjovan1

0 replies

10h51m

2024-05-18 07:35:34 UTC

Why so much negativity? An angry crowd sounded more like bots trying to test OP's intelligence by exploiting the ReadMe file imperfections while trying to change the context and intent of the post. It's so ignorant and brutal. They spent hours arguing without taking 2 minutes to properly read the ReadMe file.OP is a one man's show and now they all want to piss on OP. Keep going OP!

jgarzon

0 replies

13h45m

2024-05-18 04:42:00 UTC

Very nice!

i5heu

0 replies

7h41m

2024-05-18 10:45:48 UTC

This is some very cool project! I sometime dream about a instruction set architecture (ISA) that runs in some kind of VM that allows for existing languages to be able to run on CPU/GPU/FPGAs/ASICs automatically.

I think this is a much more practically approach and i hope this will give some inspiration to this possibility.

gsuuon

0 replies

21h30m

2024-05-17 20:56:52 UTC

Congrats on the HVM2 launch! Been following for a while, excited to see where this project goes. For others who are lost on the interaction net stuff, there was a neat show hn that gave a more hands-on interactive intro: https://news.ycombinator.com/item?id=37406742 (the 'Get Started' writeup was really helpful)

exitheone

0 replies

1d1h

2024-05-17 17:20:43 UTC

This seems pretty cool!

Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.

darlansbjr

0 replies

1d2h

2024-05-17 16:06:36 UTC

Would a compiler be faster by using HVM? Would love to see a fully parallel version of typescript tsc

croemer

0 replies

23h22m

2024-05-17 19:04:29 UTC

Dupe of https://news.ycombinator.com/item?id=40387394 and https://news.ycombinator.com/item?id=40383196

britannio

0 replies

2024-05-17 17:47:44 UTC

Incredible feat, congratulations!

KingOfCoders

0 replies

14h5m

2024-05-18 04:21:18 UTC

Has someone written the example in "native" GPU (C/Cuda) to compare performance?

JayShower

0 replies

16h55m

2024-05-18 01:31:20 UTC

This is really cool!

Archit3ch

0 replies

23h26m

2024-05-17 19:00:23 UTC

Pure functions only? This is disappointing. Furthermore, it invites a comparison with JAX.

Arch485

0 replies

20h40m

2024-05-17 21:46:40 UTC

I want to congratulate the author on this, it's super cool. Making correct automatic parallelization is nothing to sneeze at, and something you should absolutely be proud of.

I'm excited to see how this project progresses.

3abiton

0 replies

18h35m

2024-05-17 23:51:30 UTC

That's a 57x speedup by doing nothing.

Okay, I'll have what you're having.