High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

Took me a while to understand what their "hot" and "cold" neurons meant, since in most ML I do, there is no such notion. And their paper doesn't directly define it (or I missed it)

After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity:

So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)

I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)

Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?).

How/when did these types of regulations come about? This feels like an insane thing to have to keep in mind while developing.

The EU messed up with the GDPR - they should have implemented it at least a decade earlier and ignored the lobby which lead to the cookie banner instead of either an outright ban on tracking for all but a tiny number of purposes. Such a ban would have had a negligible impact on the tech industry financially but would have had huge privacy rewards.

They're trying to get in early on AI so as not to make the same mistake again. Which might result in them making the opposite mistake.

I don't have the study at hand but this was proven false: the impact was negligible (% points) as the fundamentals are extremely good for the big platforms. Take FB and Google: they already have extremely strong (and legitimate) profiles of users without following you around the web.

That is a huge caveat to leave out of a readme, especially one that claims llama compatibility.

They don’t make that claim as far as I can tell. Just that they support llama2 models.

Well it's not really support of llama 2 if it has to be extensively finetuned to "convert" the model.

"We utilize PowerInfer for inference"

9 replies

Everyone compares against llama.cpp because it's easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.

What do you recommend that is faster that I can package into an app for distribution?

I have packaged exllamav2 (plus a lot of other stuff) into an app for distribution here:

I used pyinstaller. It was difficult because Python makes these things difficult. But it works. It does require an Nvidia GPU. MLC-LLM is another option that might be easier to package and potentially able to run on AMD.

Oh yeah, I want to work on AMD/Intel/NVIDIA and MacOS, even iOS/Android.

I've been following MLC-LLM as well. Right now I am just using JS/WASM from Huggingface, but later I will want something more performant.

Yeah if you want maximum performance on multiple platforms you'll probably have to package multiple frameworks. Llama.cpp might be a decently fast option on Apple Silicon, I'm not sure of the state of the art there.

In this case they're comparing against llama.cpp because the code is literally a modification of llama.cpp. I'm not talking about using the ggml lib for matrix calculations, it's literally using the llama.cpp main.cpp and other normal llama.cpp code. It's a fork. It is directly comparable. [Review] Merge PowerInfer with llama.cpp mainline #4543 "The x11 speedup is kind of cherrypicked because the llama.cpp GPU code for Falcon 40b is just not well-optimized."

Thanks for pointing that out, I didn't notice that. That makes sense.

I still think a comparison with exllamav2 or other optimized inference library would make sense too.

ExLlama is GPU only right? This speedup is for GPU + CPU split use cases.

Oh I see, they are running a 40B model unquantized, whereas exllamav2 would have to use 4-bit quantization to fit. Given the quality of 4-bit quantization these days and the speed boost it provides I question the utility of running unquantized for serving purposes.

I see they have a 4-bit benchmark lower down in the page. That's where they ought to compare against exllamav2.

Yeah but exllama doesn't do grammars so I'm stuck with llama.cpp

9 replies

Running uncensored Mixtral on this would be really nice. More than 3 bits quantized for 4090.

Dual GPUs should be considered normal/consumer grade setup, hopefully they'll add it soon, on 4bits it's enough with plenty of space for context.

This whole thing is a fork of llamacpp, also hoping it'll all go upstream sooner or later.

4090s aren't really normal either. How many people have dual GPUs? I don't think it helps much with games last I checked so you'd only buy 2 for AI.

It's more about what's possible to build. Dual 4090 or 3090 is possible to setup without hassle. Beyond that not really because it'd be above home power socket rating, not possible to fit on the board and case etc.

It's true you can also build dual A6000 with 48+48 = 96GB VRAM also, but that's $10k+ setup just for GPUs on legacy generation.

There’s the physical hassle. It was very difficult for me to fit 1 3090 in my case.

Downvoters care to comment? Uncensored llm versions typically perform better (at least on benchmarks) to their "lobotomized" or aligned counterparts

1 replies

For example, the parent commenter could have talked about the specific attributes of that model that make it superior. I personally am aware that Mixtral is one of the best performing models right now, but is everyone else? Also, does Mixtral need to be uncensored? I've used vanilla Mistral for some...interesting...prompts and had no issues with it moralizing at me.

I mean, does it need to? Not every comment has to be plethora of hidden information. Sometimes people are just excited.

Yeah, so they demo a bigger model on an RTX 4090 with 24 GB VRAM. Granted an implementation of sparse activations with the Mixture of Experts could be non-trivial, I think it’s a brilliant move, that could potentially allow for even, e.g., CPU only processing and/or much cheaper GPU processing… Mixtral technically already has neural network controlled sparse activations, but like the Inception meme says: we must go deeper…

5 replies

Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM:

Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.

This looks really really interesting. Any idea whether it would run on a laptop with an Intel Core i7?

VRAM is king. If you have the VRAM to hold the model parameters, you can run it.

Yes, I run it on CPU using LLMStudio. It's very fast.

The performance on integrated GPUs is not stellar, but it should work.

On my AMD Ryzen 5 5600u with dual-channel DDR4, I’m getting 2 tokens/second. My friend with Intel Core i3 and single-channel memory was getting 1 token/second.

try ollama , only needs about 4GB it uses llmcpp

3 replies

From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it's going to be a lot of work to maintain this project.

That sounds about right. They provide a script to combine their "Predictor" weights to the original models, but I don't see anything obvious in the front page of the Github repo about how to create those weights.

A 10x speed improvement is really impressive. If this kind of improvement is reproducible across other models, then presumably identifying hot and cold neurons for inference optimization should go on to become a normal part of model development process.

Like JVM "hot spots," or JIT optimization.

Or profile guided optimization.

3 replies

how much speed increase do we get on CPU only configurations? has anyone tested it in such cases?

CPU-only is impractical for most use cases and this will only become more true over time as models become larger. The mediocre perf/$ and perf/watt makes it not worth the effort

Might be worth it in a datacenter, especially if it's operating other servers alongside (perhaps I/O bound web serving or something); perf/$ does matter, definitely, but the state of the art is moving quickly (getting faster/more efficient) and optimizing some models for CPU is still relevant IMO.

This architecture is specifically aimed at optimizing GPU use.

3 replies

"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)

PowerToys are probably the original (going back to PowerToys for Windows 95)


PowerPoint existed in the late 80s, I think, although Microsoft acquired it from what I understand.

2 replies

This is super cool.

For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.

Hopefully the "cold" neurons eventually get offloaded to the IGP instead?

1 replies

The only thing I could think of on the question of Apple Silicon and Metal is that they think they could still split out the cold neurons to the CPU/Accelerate and the hot ones on the GPU and utilize both. The speedup is likely less if there is already no copying of data between GPU/CPU and using the unified memory. Still, it would be great if you could use even more of the capabilities of the chip simultaneously. In order to avoid thermal throttling they should use the efficiency cores only (I think this is what game mode does).

That doesn't make much sense to me. The GPU's task energy is so much lower than even the e cores, and AFIAK the GPU's compute isn't even fully utilized for local inference.

1 replies

Hybrid CPU/GPU Utilization: Seamlessly integrates memory/computation capabilities of CPU and GPU for a balanced workload and faster processing.

Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?

edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?

GPU-only will be faster if you have enough VRAM.

But if you want to run a model that requires more VRAM than you have, the current approach is to use llama.cpp and specify n_gpu_layers. That works, but is slower than GPU-only.

OP claims to be 10x as fast as llama.cpp in the case when you can't fit the whole model in VRAM.

1 replies

We have tested PowerInfer on the following platforms:

x86-64 CPU (with AVX2 instructions) on Linux

x86-64 CPU and NVIDIA GPU on Linux

Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)

And new features coming soon:

Mistral-7B model

Metal backend for sparse inference on macOS

Also worth mentioning the downloadable llama2 models, and the file.

0 replies

This will be really cool once there's the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they've done it with. Looking through the page and code it doesn't seem like the tools to do that step are included. Guess I'll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).

0 replies

Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!

0 replies

"This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers."


0 replies

All the "consumer grade GPUs" terminology makes it seem like you could run it on a variety of models, but like so many of these posts, is this a 4090 exclusive?

0 replies

This sounds like it uses the same techniques as the ones described in the "LLM in a Flash" paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.

0 replies

It’s not too much faster than exllama2 with flash attention, no?