return to table of content

Fast and Portable Llama2 Inference on the Heterogeneous Edge

bugglebeetle
13 replies
2d12h

Wow, this is a “holy shit” moment for Rust in AI applications if this works as described. Also, so long Mojo!

EDIT:

Looks like I’m wrong, but I appreciate getting schooled by all the HNers with low-level expertise. Lots to go and learn about now.

hnfong
6 replies
2d12h

It's "just" a port of GGML (written in C++) to wasm with some additional Rust code.

bugglebeetle
5 replies
2d12h

Right, but if the port achieves performance gains over GGML, which is already highly performant, that’s a. Wild b. a signal to move further GGML development into Rust, no?

cozzyd
1 replies
2d11h

As far as I understand, only the "driver" code is in rust. Everything else is just C++ compiled to WASM. Maybe it's slightly better to have the driver code be in rust than python or scheme or whatever, but I imagine C++ would be basically equivalent (and.... you wouldn't have to go through the trouble of compiling to WASM which likely loses significant performance).

kamray23
0 replies
2d8h

That's what I find weird here. The bit of the code written in rust is almost comically tiny, and the rest is just C++ you compiled to WASM which someone else already wrote. I think comparing this to a Python wrapper for the same code would produce very minimal difference in performance, because the majority goes into performance and formatting the prompt string really isn't that complex of a task. I just don't see what advantage Rust produces here other than the fact that it's a language you can compile to WASM so that you have one binary.

tomalbrc
0 replies
2d9h

There is no mention of it running faster than the original llama2.cpp, if anything it is slower

brrrrrm
0 replies
2d12h

ML has extremely predictable and heavily optimized routines. Languages that can target hardware ISA all tend to have comparable perf and there’s no reason to think Rust would offer much.

Nevin1901
0 replies
2d12h

How would wasm/rust be more performant over c++? I’m not sure the wasm version can take advantage of avx/metal.

Edit: the wasm installer does take advantage by installing plugins.

Unless you’re talking about performance on devices where those two weren’t a thing anyways.

est
3 replies
2d11h

this is a “holy shit” moment for Rust in AI applications

Yeah because I realized the 2MB is just a wrapper that reads stdin and offloads everything to wasi-nn API.

The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

You can do the same using Python with fewer lines of code and maybe smaller executable size.

gumby
2 replies
2d11h

Pretty damning if 40 lines of rust to read stdin generates a 2 MB binary!

lakpan
1 replies
2d10h

Presumably that also accounts for the WASM itself

gpderetta
0 replies
2d3h

Indeed. I hope it does include the WASM VM.

blovescoffee
0 replies
2d12h

No it's not. This does nothing to minimize the size of the models which inference on being run on. It's cool for edge applications, kind of. And Rust is already a go to tool for edge.

3Sophons
0 replies
2d12h

yeah excited to see how this will evolve. BTW, maybe give it a try on your Mac and see how it performs.

FL33TW00D
11 replies
2d11h

This just wrapping llama.cpp right? I’m sorry but I’m pretty tired of projects wrapping x.cpp.

I’ve been developing a Rust + WebGPU ML framework for the past 6 months. I’ve learned quickly how impressive the work by GG is.

It’s early stages but you can check it out here:https://www.ratchet.sh/https://github.com/FL33TW00D/whisper-turbo

hluska
4 replies
2d10h

You ripped on someone else’s work and promoted your own in the same comment?? You need to seriously reflect upon your ethics.

FL33TW00D
3 replies
2d9h

I appreciate the work that went into slimming this binary down, but it's a ~negligble amount of work compared to llama.cpp itself.

HN is inundated with posts doing xyz on top of the x.cpp community. Whilst I appreciate it is exciting - I wish more people would explore the low-level themselves! We can be much more creative in this new playground.

spiderfarmer
0 replies
2d9h

Why not both.

rgbrgb
0 replies
2d1h

Isn’t doing Stuff on top of it what it’s for? llama.cpp is explicitly highly portable systems code with bindings in many languages.

meiraleal
0 replies
2d5h

We can be much more creative in this new playground.

You are not being creative gatekeeping, this behavior is quite old.

tansan
2 replies
2d9h

Who's GG?

europeanNyan
0 replies
2d9h
0xDEADFED5
0 replies
2d4h
Me1000
1 replies
2d

This is really cool! Thank you for sharing! Excited to follow your progress!

FL33TW00D
0 replies
1d16h

Thank you!

stavros
0 replies
2d8h

Can you elaborate on what you find impressive? I know nothing about this stuff so I can't appreciate it.

jasonjmcghee
9 replies
2d11h

Confused about the title rewrite from “Fast and Portable Llama2 Inference on the Heterogeneous Edge” which more clearly communicates what this article is about - a wasm version of llama.cpp.

I feel like editorializing to highlight the fact that it’s 2MB and runs on a mac are missing some of the core aspects of the project and write up.

PUSH_AX
3 replies
2d9h

Now I’m confused, because neither of the titles _clearly_ communicate that it’s a wasm version of llama.cpp in my opinion.

It would probably be helpful to use the words “wasm” and “llama” to achieve that

stavros
2 replies
2d8h

"Run LlaMA 2 on WASM in 2 MB RAM"

This has the added advantage of being completely gibberish to someone outside tech.

Edit: wait, it's not RAM, the binary is just 2 MB. That's disappointing.

wongarsu
0 replies
2d3h

The article is complete gibberish to someone outside tech, so if the role of the title is to describe the article to its intended audience yours is a lot better.

Of course if you intend to communicate to non-tech people that you write relevant cutting-edge articles, then choosing a title like "Fast and Portable Llama2 Inference on the Heterogeneous Edge" does the job much better. Maybe even add the words sustainable and IoT somewhere.

grumpy_tired
0 replies
2d6h

Beautiful. I wish all titles on hn where this concise.

doubloon
2 replies
2d9h

well it requires nvidia so maybe its not actually portable.

threeseed
0 replies
2d8h

It also works with Metal hence why they mention it runs on Mac.

3Sophons
0 replies
2d7h

'portable' in the article refers to the software's ability to run across various operating systems or environments, rather than its hardware dependencies? This means while the software can be installed and run on different OSs, certain hardware-specific optimizations (like those for Nvidia GPUs using CUDA) are necessary to achieve the best performance.

dang
0 replies
1d22h

Thanks - a mod replaced the title last night. (Submitted title was "Run LLMs on my own Mac fast and efficient! Only 2 MBs.")

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." -https://news.ycombinator.com/newsguidelines.html

3Sophons
0 replies
2d11h

ok.. should be run LLMs on my own devices with a 2MB portable app then?

reidjs
8 replies
2d12h

Can I run this offline on my iPhone? That would be like having basic internet search regardless of reception. Could come in handy when camping

woadwarrior01
3 replies
2d7h

I have a successful-ish commercial iOS app[0] for that. I'd originally built it using ggml, and then subsequently ported it to be based on mlc-llm when I found it.

[0]:https://apps.apple.com/us/app/private-llm/id6448106860

JKCalhoun
2 replies
2d3h

Says MacOS 13 when I followed your link. Too bad I'm still on MacOS 12. (Is there a reason to require MacOS 13?)

woadwarrior01
1 replies
1d20h

No specific reason, but SwiftUI improved tremendously between macOS 12 and 13, and I use a couple of the newer SwiftUI features. Also, if I could go back, I’d rather not support Intel Macs. I’d built the original version of the app on an Intel Mac 6 months ago, but the performance difference between Intel Macs and Apple Silicon Macs for LLM inference with Metal is night and day. Apple won’t let me drop support for Intel Macs now, so I’ll begrudgingly support it.

JKCalhoun
0 replies
1d13h

Too bad about SwiftUI (not being as good on 12), but that's fair.

SparkyMcUnicorn
1 replies
2d12h

I got this project[0] running on a Pixel. Looks like it works on some iPhones/iPads as well.

[0]https://github.com/mlc-ai/mlc-llm

simonw
0 replies
2d10h

Yeah I've been using their iPhone app for a while - it works great, though it does make the phone run pretty hot while it's outputting tokens!

https://llm.mlc.ai/#ios

throwaway154
0 replies
2d4h

You'd probably be better off downloading an edition of wikipedia for that purpose. Entropy, and stuff.

3Sophons
0 replies
2d12h

You can run it on a variety of Linux, Mac and Windows based devices, including the Raspberry Pi and most laptops / servers you might have. But you still need a few GBs of memory in order to fit the model itself.

diimdeep
8 replies
2d12h

I do not see the point to use this instead of directly using llama.cpp

3Sophons
5 replies
2d11h

llama.cpp typically needs to be compiled separately for each operating system and architecture (Windows, macOS, Linux, etc.), which is less portable.

Also, the article mentions the use of hardware acceleration on devices with heterogeneous hardware accelerators. This implies that the Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices. A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.

diimdeep
2 replies
2d11h

Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices

I do not buy it, but maybe I am ignorant of progress being made there.

A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.

Because I do not buy previous one, I do not buy that similar performance can be painlessly(without extra developer time) achieved there, and that wasm runtime capable to achieve it.

zmmmmm
1 replies
2d10h

So the magic (or sleight of hand, if you prefer) seems to be in

You just need to install the WasmEdge with the GGML plugin.

And it turns out that all these plugins are native & specific to the acceleration environment as well. But this has to happen after it lands in its environment so your "portable" application is now only portable in the sense that once it starts running it will bootstrap itself by downloading and installing native platform-specific code from the internet. Whether that is a reasonable thing for an "edge" application to do I am not sure.

kamray23
0 replies
2d8h

Basically, WASM is now what the JVM was in 2000. It's portable because it is.

tomalbrc
0 replies
2d9h

Just use Comsopolitan at this point.

pjmlp
0 replies
2d10h

Where have I seen this WORA before, including for C and C++?

WASM does not provide access to hardware acceleration on devices with heterogeneous hardware accelerators, even its SIMD bytecodes are a subset of what most CPUs are capable of.

kelseyfrog
1 replies
2d11h

Hint: the Rewrite-it-in-Rust economy's currency isn't actually running things.

gumby
0 replies
2d11h

The crypto of programming languages?

wokwokwok
5 replies
2d8h

Mmm…

The wasm-nn that this relies on (https://github.com/WebAssembly/wasi-nn) is a proposal that relies on sending arbitrarily chunks to some vendor implementation. The api is literally like set input, compute, set output.

…and that is totally non portable.

The reasonthisworks, is because it’s relying on the abstraction already implemented in llama.cpp that allows it to take a gguf model and map it to multiple hardware targets,which you can see has been lifted as-is into WasmEdge here:https://github.com/WasmEdge/WasmEdge/tree/master/plugins/was...

So..

Developers can refer to this project to write their machine learning application in a high-level language using the bindings, compile it to WebAssembly, and run it with a WebAssembly runtime that supports the wasi-nn proposal, such as WasmEdge.

Is total rubbish; no, you can’t.

This isn’t portable.

It’s not sandboxed.

It’s not a HAL.

If you have a wasm binary youmightbe able to run itifthe version of the runtime you’re usinghappensto implement the specific ggml backend you need, which it probably doesn’t… because there’s literally no requirement for it to do so.

…and if you do, you’re just calling the llama.cpp ggml code, so it’s as safe as that library is…

There’s a lot of “so portable” and “such rust” talk in this article which really seems misplaced; this doesn’t seem to have the benefits of either of those two things.

Let’s imagine you have some new hardware with a WASI runtime on it, can you run your model on it? Does it have GPU support?

Well, turns out the answer is “go and see if llama.cpp compiles on that platform with GPU support and if the runtime you’re using happens have a ggml plugin in it and happens to have a copy of that version of ggml vendored in it, and if not, then no”.

..at which point, wtf are you even using WASI for?

Cross platform GPU supportishard, but this… I dunno. It seems absolutely ridiculous.

Imagine if webGPU was just “post some binary chunk to the GPU and maybe it’ll draw something or whatever if it’s the right binary chunk for the current hardware.”

That’s what this is.

ikurei
3 replies
2d5h

Could you please elaborate on the security implications?

wokwokwok
2 replies
2d2h

It’s as secure as any C++ backend that performs no input validation.

Ie. whatever memory safety or sandbox you had from using wasm or rust is gone when you use it.

jart
1 replies
2d1h

The llama.cpp author thinks security is "very low priority and almost unnecessary".https://github.com/ggerganov/llama.cpp/pull/651#pullrequestr...So I'm not sure why a sandbox would bundle llama.cpp and claim to be secure. They would need more evidence than this to make such a claim.

halyconWays
0 replies
16h17m

This user was caught stealing code and banned from llama.cpp by its creatorhttps://news.ycombinator.com/item?id=35411909

anentropic
0 replies
2d4h

Thanks for clarifying, I was wondering where there were getting GPU support in WASM from...

hnarayanan
5 replies
2d12h

If a large part of the size is essentially the trained weights of a model, how can one reduce the size by orders of magnitude (without losing any accuracy)?

3Sophons
3 replies
2d12h

Hello you might be talking about reducing the size of the model itself (i.e., the trained weights) by orders of magnitude without losing accuracy, that's indeed a different challenge. But the article discusses reducing the inference app size by 100x

hnarayanan
2 replies
2d12h

Oh. Did not think that was even a goal.

3Sophons
1 replies
2d12h

I guess making it portable is still quite important?

hnarayanan
0 replies
2d8h

I am not trying to troll. I genuinely don’t see why a few MB on some binary matter when the models are multiple GB large. This is why I fundamentally misunderstood the article, my brain was looking for the other number going down as that’s genuinely a barrier for edge devices.

rgbrgb
0 replies
2d12h

I don't think you can reduce size without losing accuracy (though I think quantized GGUFs are great). But the 2 MB size here is a reference to the program size not including a model. It looks like it's a way to run llama.cpp with wasm + a rust server that runs llama.cpp.

I like the tiny llama.cpp/examples/server and embed it in FreeChat, but always happy for more tooling options.

Edit: Just checked, the arm64/x86 executable I embed is currently 4.2 MB. FreeChat is 12.1 MB but the default model is ~3 GB so I'm not really losing sleep over 2 MB.

[0]:https://github.com/ggerganov/llama.cpp/tree/master/examples/...

oersted
3 replies
2d6h

I'm all for Rust and WASM, but if you look at the code it's just 150 lines of a basic Rust command-line script. All the heavy lifting is done by a single line passing the model to the WASI-NN backend, which in this case is provided by the WasmEdge runtime, which incidentally is C++, not Rust.

Rust is bringing zero advantage here really, the backend could be called from Python or anything else.

whywhywhywhy
2 replies
2d3h

Seems like the advantage it is bringing is in bundling, shipping Python and PyTorch into something an end user can double click and run is currently a complete mess.

Of course the actual high powered code is C++ in both cases but shipping 2+GB and 10s of thousands of files just to send some instructions to that C++ could benefit from being one 2MB executable instead.

wrsh07
0 replies
2d2h

If you replaced the rust with any other language (including python) you shouldn't need pytorch because the rust code is using ggml (which is cpp)

oersted
0 replies
2d3h

Yes that makes sense.

I am not familiar enough with llama.cpp, but from what I see they have mostly copy-pasted it into WasmEdge for the WASI-NN implementation.

Surely a simple compiled binary of llama.cpp is better than Rust compiled to WASM plus the WasmEdge runtime binary wrapping the same llama.cpp.

It wouldn't be more portable either, all the heterogeneous hardware acceleration support is part of llama.cpp not WasmEdge.

I guess theoretically if the WASI-NN proposal is standardized, other WASM runtimes could implement their own backends. It is a decent abstraction to cleanly expand portability and for optimizing for specific instrastructure.

But at this point it doesn't have much to do with Rust or WASM. It's just the same old concept of portability via bytecode runtimes like the JVM or, indeed, the Python interpreter with native extensions (libraries).

ed
2 replies
2d12h

Whoa! Great work. To other folks checking it out, it still requires downloading the weights, which are pretty large. But they essentially made a fully portable, no-dependency llama.cpp, in 2mb.

If you're an app developer this might be the easiest way to package an inference engine in a distributable file (the weights are already portable and can be downloaded on-demand — the inference engine is really the part you want to lock down).

kristianp
0 replies
2d10h

It might be more helpful if the title says 2MB of wasm. But as you say, the weights dwarf that.

andy99
0 replies
2d

The `main` file that llama.cpp builds is 1.2MB on my machine. The 2MB size isn't anything particularly impressive. Targeting wasm makes it more portable, otherwise these isn't some special extra compactness here.

rjzzleep
1 replies
2d11h

Is there any detailed info on how a 4090 + ryzen 7840 compares to any of the new Apple offerings with 64GB or more unified RAM?

renewiltord
0 replies
2d10h

No. You just have to try it. Anecdotally, I can fit a larger Llama on my M1 Max with 64 GiB than my 3090 with 24 GiB.

est
1 replies
2d11h

The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

TL;DR a 2MB executable that reads stdin and calls WASI-NN

isoprophlex
0 replies
2d10h

"Rust is the language of AGI."

Oh Rust Evangelism Strike Force, never change

dkga
1 replies
2d11h

Very cool, but unless I missed it could someone please explain why not just compile a Rust application? Is the Wasm part needed for the GPU acceleration (whatever the user GPU is?)

bouke
0 replies
2d11h

I suppose wasm provides the portability between platforms. Compile once, run everywhere.

behnamoh
1 replies
2d10h

The way things are going, we'll see more efficient and faster methods to run transformer arch on edge, but I'm afraid we're approaching the limit because you can't just rust your way out of the VRAM requirements, which is the main bottleneck in loading large-enough models. One might say "small models are getting better, look at Mistral vs. llama 2", but small models are also approaching their capacity (there's only so much you can put in 7b parameters).

I don't know man, this approach to AI doesn't "feel" like it'll lead to AGI—it's too inefficient.

danielbln
0 replies
2d8h

I think we have plenty of headroom with MoE systems, dynamically loading LoRAs and such, even with the small models.

anon23432343
1 replies
2d9h

So you need to mb2 for sending an api call to the edge?

Okaayyyy...

kamray23
0 replies
2d8h

This is offline.

anentropic
1 replies
2d4h

the Mac OS build of the GGML plugin uses the Metal API to run the inference workload on M1/M2/M3’s built-in neural processing engines

I don't think that's accurate (someone please correct me...)

GGML use of Metal API means it runs on the M1/2/3GPUand not the neural engine

Which is all good, but for sake of being pedantic...

btown
0 replies
2d3h

Not pedantic at all!https://github.com/ggerganov/llama.cpp/discussions/336is a (somewhat rambling) discussion on whether it would even be worthwhile to use the neural engine specifically, beyond the GPU.

tomalbrc
0 replies
2d9h

No wonder Elon Musk said that Rust is the language of AGI.

What.

thih9
0 replies
2d6h

the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators.

What does the “heterogenous hardware accelerators” mean in practice?

syrusakbary
0 replies
2d9h

Congrats on the work... it's an impressive demo!

It may be worth researching to add support of it into the Wasmer WebAssembly runtime [1]. (Note: I work at Wasmer!)

[1]https://wasmer.io/

rowanG077
0 replies
2d8h

I don't think you can call anything wasm efficient.

nigma
0 replies
2d9h

I hate this kind of clickbait marketing suggesting the project is delivering 1/100 of the size or 100x-35000x the speed of other solutions because it uses a different language for a wrapper around core library and completely neglecting tooling and community expertise built around other solutions.

First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.

[1]https://github.com/ggerganov/llama.cpp

hedgehog
0 replies
2d11h

It looks like this is Rust for the application wrapped around a WASM port of llama.cpp that in turn uses an implementation of WASI-NN for the actual NN compute. It would be interesting to see how this compares to the TFLite, the new stuff in the PyTorch ecosystem, etc.

gvand
0 replies
2d10h

The binary size is not really important in this case, llama.cpp should not be that far from this, what's matter as we all know is how much gpu memory we need.

danielEM
0 replies
2d10h

I'm getting lost in all that.

Using llama cpp and mlc-llm. Both on my 2 years old mobile Ryzen APU with 64GB of RAM. First does not use GPU at all, tried plenty of options, nothing did work, but llama 34B works - painfully slow, but does work. Second is working on top of Vulkan and I didn't take any precise measurements but it's limit looks like is 32GB RAM (so no llama 34B), but it offloads CPU, unfortunately seem like performance is similar to CPU (that is my perception, didn't take any measurements here too).

So ... will I get any benefits from switching to rust/webassembly version???

classified
0 replies
2d7h

How is it still fast if it was compiled to WASM?

antirez
0 replies
2d6h

Linkbait at its finest. But it's true that the Python AI stack sucks big times.