Fast and Portable Llama2 Inference on the Heterogeneous Edge

Wow, this is a “holy shit” moment for Rust in AI applications if this works as described. Also, so long Mojo!

EDIT:

Looks like I’m wrong, but I appreciate getting schooled by all the HNers with low-level expertise. Lots to go and learn about now.

It's "just" a port of GGML (written in C++) to wasm with some additional Rust code.

Right, but if the port achieves performance gains over GGML, which is already highly performant, that’s a. Wild b. a signal to move further GGML development into Rust, no?

As far as I understand, only the "driver" code is in rust. Everything else is just C++ compiled to WASM. Maybe it's slightly better to have the driver code be in rust than python or scheme or whatever, but I imagine C++ would be basically equivalent (and.... you wouldn't have to go through the trouble of compiling to WASM which likely loses significant performance).

That's what I find weird here. The bit of the code written in rust is almost comically tiny, and the rest is just C++ you compiled to WASM which someone else already wrote. I think comparing this to a Python wrapper for the same code would produce very minimal difference in performance, because the majority goes into performance and formatting the prompt string really isn't that complex of a task. I just don't see what advantage Rust produces here other than the fact that it's a language you can compile to WASM so that you have one binary.

There is no mention of it running faster than the original llama2.cpp, if anything it is slower

ML has extremely predictable and heavily optimized routines. Languages that can target hardware ISA all tend to have comparable perf and there’s no reason to think Rust would offer much.

How would wasm/rust be more performant over c++? I’m not sure the wasm version can take advantage of avx/metal.

Edit: the wasm installer does take advantage by installing plugins.

Unless you’re talking about performance on devices where those two weren’t a thing anyways.

this is a “holy shit” moment for Rust in AI applications

Yeah because I realized the 2MB is just a wrapper that reads stdin and offloads everything to wasi-nn API.

The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

You can do the same using Python with fewer lines of code and maybe smaller executable size.

Pretty damning if 40 lines of rust to read stdin generates a 2 MB binary!

Presumably that also accounts for the WASM itself

Indeed. I hope it does include the WASM VM.

No it's not. This does nothing to minimize the size of the models which inference on being run on. It's cool for edge applications, kind of. And Rust is already a go to tool for edge.

yeah excited to see how this will evolve. BTW, maybe give it a try on your Mac and see how it performs.

This just wrapping llama.cpp right? I’m sorry but I’m pretty tired of projects wrapping x.cpp.

I’ve been developing a Rust + WebGPU ML framework for the past 6 months. I’ve learned quickly how impressive the work by GG is.

It’s early stages but you can check it out here:https://www.ratchet.sh/https://github.com/FL33TW00D/whisper-turbo

You ripped on someone else’s work and promoted your own in the same comment?? You need to seriously reflect upon your ethics.

I appreciate the work that went into slimming this binary down, but it's a ~negligble amount of work compared to llama.cpp itself.

HN is inundated with posts doing xyz on top of the x.cpp community. Whilst I appreciate it is exciting - I wish more people would explore the low-level themselves! We can be much more creative in this new playground.

Why not both.

Isn’t doing Stuff on top of it what it’s for? llama.cpp is explicitly highly portable systems code with bindings in many languages.

We can be much more creative in this new playground.

You are not being creative gatekeeping, this behavior is quite old.

Who's GG?

https://github.com/ggerganov

https://en.wikipedia.org/wiki/GG_Allin

This is really cool! Thank you for sharing! Excited to follow your progress!

Thank you!

Can you elaborate on what you find impressive? I know nothing about this stuff so I can't appreciate it.

Confused about the title rewrite from “Fast and Portable Llama2 Inference on the Heterogeneous Edge” which more clearly communicates what this article is about - a wasm version of llama.cpp.

I feel like editorializing to highlight the fact that it’s 2MB and runs on a mac are missing some of the core aspects of the project and write up.

Now I’m confused, because neither of the titles _clearly_ communicate that it’s a wasm version of llama.cpp in my opinion.

It would probably be helpful to use the words “wasm” and “llama” to achieve that

"Run LlaMA 2 on WASM in 2 MB RAM"

This has the added advantage of being completely gibberish to someone outside tech.

Edit: wait, it's not RAM, the binary is just 2 MB. That's disappointing.

The article is complete gibberish to someone outside tech, so if the role of the title is to describe the article to its intended audience yours is a lot better.

Of course if you intend to communicate to non-tech people that you write relevant cutting-edge articles, then choosing a title like "Fast and Portable Llama2 Inference on the Heterogeneous Edge" does the job much better. Maybe even add the words sustainable and IoT somewhere.

Beautiful. I wish all titles on hn where this concise.

well it requires nvidia so maybe its not actually portable.

It also works with Metal hence why they mention it runs on Mac.

'portable' in the article refers to the software's ability to run across various operating systems or environments, rather than its hardware dependencies? This means while the software can be installed and run on different OSs, certain hardware-specific optimizations (like those for Nvidia GPUs using CUDA) are necessary to achieve the best performance.

Thanks - a mod replaced the title last night. (Submitted title was "Run LLMs on my own Mac fast and efficient! Only 2 MBs.")

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." -https://news.ycombinator.com/newsguidelines.html

ok.. should be run LLMs on my own devices with a 2MB portable app then?

Can I run this offline on my iPhone? That would be like having basic internet search regardless of reception. Could come in handy when camping

I have a successful-ish commercial iOS app[0] for that. I'd originally built it using ggml, and then subsequently ported it to be based on mlc-llm when I found it.

[0]:https://apps.apple.com/us/app/private-llm/id6448106860

Says MacOS 13 when I followed your link. Too bad I'm still on MacOS 12. (Is there a reason to require MacOS 13?)

No specific reason, but SwiftUI improved tremendously between macOS 12 and 13, and I use a couple of the newer SwiftUI features. Also, if I could go back, I’d rather not support Intel Macs. I’d built the original version of the app on an Intel Mac 6 months ago, but the performance difference between Intel Macs and Apple Silicon Macs for LLM inference with Metal is night and day. Apple won’t let me drop support for Intel Macs now, so I’ll begrudgingly support it.

Too bad about SwiftUI (not being as good on 12), but that's fair.

I got this project[0] running on a Pixel. Looks like it works on some iPhones/iPads as well.

[0]https://github.com/mlc-ai/mlc-llm

Yeah I've been using their iPhone app for a while - it works great, though it does make the phone run pretty hot while it's outputting tokens!

https://llm.mlc.ai/#ios

You'd probably be better off downloading an edition of wikipedia for that purpose. Entropy, and stuff.

You can run it on a variety of Linux, Mac and Windows based devices, including the Raspberry Pi and most laptops / servers you might have. But you still need a few GBs of memory in order to fit the model itself.

I do not see the point to use this instead of directly using llama.cpp

llama.cpp typically needs to be compiled separately for each operating system and architecture (Windows, macOS, Linux, etc.), which is less portable.

Also, the article mentions the use of hardware acceleration on devices with heterogeneous hardware accelerators. This implies that the Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices. A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.

Wasm-compiled program can efficiently utilize different hardware resources (like GPUs and specialized AI chips) across various devices

I do not buy it, but maybe I am ignorant of progress being made there.

A direct C++ implementation might require specific optimizations or versions for each type of hardware to achieve similar performance.

Because I do not buy previous one, I do not buy that similar performance can be painlessly(without extra developer time) achieved there, and that wasm runtime capable to achieve it.

So the magic (or sleight of hand, if you prefer) seems to be in

You just need to install the WasmEdge with the GGML plugin.

And it turns out that all these plugins are native & specific to the acceleration environment as well. But this has to happen after it lands in its environment so your "portable" application is now only portable in the sense that once it starts running it will bootstrap itself by downloading and installing native platform-specific code from the internet. Whether that is a reasonable thing for an "edge" application to do I am not sure.

Basically, WASM is now what the JVM was in 2000. It's portable because it is.

Just use Comsopolitan at this point.

Where have I seen this WORA before, including for C and C++?

WASM does not provide access to hardware acceleration on devices with heterogeneous hardware accelerators, even its SIMD bytecodes are a subset of what most CPUs are capable of.

Hint: the Rewrite-it-in-Rust economy's currency isn't actually running things.

The crypto of programming languages?

Mmm…

The wasm-nn that this relies on (https://github.com/WebAssembly/wasi-nn) is a proposal that relies on sending arbitrarily chunks to some vendor implementation. The api is literally like set input, compute, set output.

…and that is totally non portable.

The reasonthisworks, is because it’s relying on the abstraction already implemented in llama.cpp that allows it to take a gguf model and map it to multiple hardware targets,which you can see has been lifted as-is into WasmEdge here:https://github.com/WasmEdge/WasmEdge/tree/master/plugins/was...

So..

Developers can refer to this project to write their machine learning application in a high-level language using the bindings, compile it to WebAssembly, and run it with a WebAssembly runtime that supports the wasi-nn proposal, such as WasmEdge.

Is total rubbish; no, you can’t.

This isn’t portable.

It’s not sandboxed.

It’s not a HAL.

If you have a wasm binary youmightbe able to run itifthe version of the runtime you’re usinghappensto implement the specific ggml backend you need, which it probably doesn’t… because there’s literally no requirement for it to do so.

…and if you do, you’re just calling the llama.cpp ggml code, so it’s as safe as that library is…

There’s a lot of “so portable” and “such rust” talk in this article which really seems misplaced; this doesn’t seem to have the benefits of either of those two things.

Let’s imagine you have some new hardware with a WASI runtime on it, can you run your model on it? Does it have GPU support?

Well, turns out the answer is “go and see if llama.cpp compiles on that platform with GPU support and if the runtime you’re using happens have a ggml plugin in it and happens to have a copy of that version of ggml vendored in it, and if not, then no”.

..at which point, wtf are you even using WASI for?

Cross platform GPU supportishard, but this… I dunno. It seems absolutely ridiculous.

Imagine if webGPU was just “post some binary chunk to the GPU and maybe it’ll draw something or whatever if it’s the right binary chunk for the current hardware.”

That’s what this is.

Could you please elaborate on the security implications?

It’s as secure as any C++ backend that performs no input validation.

Ie. whatever memory safety or sandbox you had from using wasm or rust is gone when you use it.

The llama.cpp author thinks security is "very low priority and almost unnecessary".https://github.com/ggerganov/llama.cpp/pull/651#pullrequestr...So I'm not sure why a sandbox would bundle llama.cpp and claim to be secure. They would need more evidence than this to make such a claim.

This user was caught stealing code and banned from llama.cpp by its creatorhttps://news.ycombinator.com/item?id=35411909

Thanks for clarifying, I was wondering where there were getting GPU support in WASM from...

If a large part of the size is essentially the trained weights of a model, how can one reduce the size by orders of magnitude (without losing any accuracy)?

Hello you might be talking about reducing the size of the model itself (i.e., the trained weights) by orders of magnitude without losing accuracy, that's indeed a different challenge. But the article discusses reducing the inference app size by 100x

Oh. Did not think that was even a goal.

I guess making it portable is still quite important?

I am not trying to troll. I genuinely don’t see why a few MB on some binary matter when the models are multiple GB large. This is why I fundamentally misunderstood the article, my brain was looking for the other number going down as that’s genuinely a barrier for edge devices.

I don't think you can reduce size without losing accuracy (though I think quantized GGUFs are great). But the 2 MB size here is a reference to the program size not including a model. It looks like it's a way to run llama.cpp with wasm + a rust server that runs llama.cpp.

I like the tiny llama.cpp/examples/server and embed it in FreeChat, but always happy for more tooling options.

Edit: Just checked, the arm64/x86 executable I embed is currently 4.2 MB. FreeChat is 12.1 MB but the default model is ~3 GB so I'm not really losing sleep over 2 MB.

[0]:https://github.com/ggerganov/llama.cpp/tree/master/examples/...

I'm all for Rust and WASM, but if you look at the code it's just 150 lines of a basic Rust command-line script. All the heavy lifting is done by a single line passing the model to the WASI-NN backend, which in this case is provided by the WasmEdge runtime, which incidentally is C++, not Rust.

Rust is bringing zero advantage here really, the backend could be called from Python or anything else.

Seems like the advantage it is bringing is in bundling, shipping Python and PyTorch into something an end user can double click and run is currently a complete mess.

Of course the actual high powered code is C++ in both cases but shipping 2+GB and 10s of thousands of files just to send some instructions to that C++ could benefit from being one 2MB executable instead.

If you replaced the rust with any other language (including python) you shouldn't need pytorch because the rust code is using ggml (which is cpp)

Yes that makes sense.

I am not familiar enough with llama.cpp, but from what I see they have mostly copy-pasted it into WasmEdge for the WASI-NN implementation.

Surely a simple compiled binary of llama.cpp is better than Rust compiled to WASM plus the WasmEdge runtime binary wrapping the same llama.cpp.

It wouldn't be more portable either, all the heterogeneous hardware acceleration support is part of llama.cpp not WasmEdge.

I guess theoretically if the WASI-NN proposal is standardized, other WASM runtimes could implement their own backends. It is a decent abstraction to cleanly expand portability and for optimizing for specific instrastructure.

But at this point it doesn't have much to do with Rust or WASM. It's just the same old concept of portability via bytecode runtimes like the JVM or, indeed, the Python interpreter with native extensions (libraries).

Whoa! Great work. To other folks checking it out, it still requires downloading the weights, which are pretty large. But they essentially made a fully portable, no-dependency llama.cpp, in 2mb.

If you're an app developer this might be the easiest way to package an inference engine in a distributable file (the weights are already portable and can be downloaded on-demand — the inference engine is really the part you want to lock down).

It might be more helpful if the title says 2MB of wasm. But as you say, the weights dwarf that.

The `main` file that llama.cpp builds is 1.2MB on my machine. The 2MB size isn't anything particularly impressive. Targeting wasm makes it more portable, otherwise these isn't some special extra compactness here.

Is there any detailed info on how a 4090 + ryzen 7840 compares to any of the new Apple offerings with 64GB or more unified RAM?

No. You just have to try it. Anecdotally, I can fit a larger Llama on my M1 Max with 64 GiB than my 3090 with 24 GiB.

The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.

TL;DR a 2MB executable that reads stdin and calls WASI-NN

"Rust is the language of AGI."

Oh Rust Evangelism Strike Force, never change

Very cool, but unless I missed it could someone please explain why not just compile a Rust application? Is the Wasm part needed for the GPU acceleration (whatever the user GPU is?)

I suppose wasm provides the portability between platforms. Compile once, run everywhere.

The way things are going, we'll see more efficient and faster methods to run transformer arch on edge, but I'm afraid we're approaching the limit because you can't just rust your way out of the VRAM requirements, which is the main bottleneck in loading large-enough models. One might say "small models are getting better, look at Mistral vs. llama 2", but small models are also approaching their capacity (there's only so much you can put in 7b parameters).

I don't know man, this approach to AI doesn't "feel" like it'll lead to AGI—it's too inefficient.

I think we have plenty of headroom with MoE systems, dynamically loading LoRAs and such, even with the small models.

So you need to mb2 for sending an api call to the edge?

Okaayyyy...

This is offline.

the Mac OS build of the GGML plugin uses the Metal API to run the inference workload on M1/M2/M3’s built-in neural processing engines

I don't think that's accurate (someone please correct me...)

GGML use of Metal API means it runs on the M1/2/3GPUand not the neural engine

Which is all good, but for sake of being pedantic...

Not pedantic at all!https://github.com/ggerganov/llama.cpp/discussions/336is a (somewhat rambling) discussion on whether it would even be worthwhile to use the neural engine specifically, beyond the GPU.

No wonder Elon Musk said that Rust is the language of AGI.

What.

the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators.

What does the “heterogenous hardware accelerators” mean in practice?

Congrats on the work... it's an impressive demo!

It may be worth researching to add support of it into the Wasmer WebAssembly runtime [1]. (Note: I work at Wasmer!)

[1]https://wasmer.io/

I don't think you can call anything wasm efficient.

I hate this kind of clickbait marketing suggesting the project is delivering 1/100 of the size or 100x-35000x the speed of other solutions because it uses a different language for a wrapper around core library and completely neglecting tooling and community expertise built around other solutions.

First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.

[1]https://github.com/ggerganov/llama.cpp

It looks like this is Rust for the application wrapped around a WASM port of llama.cpp that in turn uses an implementation of WASI-NN for the actual NN compute. It would be interesting to see how this compares to the TFLite, the new stuff in the PyTorch ecosystem, etc.

The binary size is not really important in this case, llama.cpp should not be that far from this, what's matter as we all know is how much gpu memory we need.

I'm getting lost in all that.

Using llama cpp and mlc-llm. Both on my 2 years old mobile Ryzen APU with 64GB of RAM. First does not use GPU at all, tried plenty of options, nothing did work, but llama 34B works - painfully slow, but does work. Second is working on top of Vulkan and I didn't take any precise measurements but it's limit looks like is 32GB RAM (so no llama 34B), but it offloads CPU, unfortunately seem like performance is similar to CPU (that is my perception, didn't take any measurements here too).

So ... will I get any benefits from switching to rust/webassembly version???

How is it still fast if it was compiled to WASM?

Linkbait at its finest. But it's true that the Python AI stack sucks big times.