High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

Took me a while to understand what their "hot" and "cold" neurons meant, since in most ML I do, there is no such notion. And their paper doesn't directly define it (or I missed it)

After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: https://huggingface.co/SparseLLM

So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)

I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)

Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?).

How/when did these types of regulations come about? This feels like an insane thing to have to keep in mind while developing.

The EU messed up with the GDPR - they should have implemented it at least a decade earlier and ignored the lobby which lead to the cookie banner instead of either an outright ban on tracking for all but a tiny number of purposes. Such a ban would have had a negligible impact on the tech industry financially but would have had huge privacy rewards.

They're trying to get in early on AI so as not to make the same mistake again. Which might result in them making the opposite mistake.

Tiny negligible impact on the industry (Except cut advertising revenue in half, but who cares. What do ads pay for anyways?)

What do ads pay for anyways?

Making the world a worse place? If you look carefully you’ll realize most of the harms and negative effects of technology are due to it being primarily funded by advertising and trying to maximize ad revenue.

Ads seem less harmful than, say, mobile game rewards (gambling). Plenty of dark patterns in the paid space too. Banning ads would not be a panacea.

Mobile games are only harmful to a relatively tiny group of addicted gamers, while internet ads have very serious consequences acting on society as a whole.

I don’t think mobile gaming companies have a potential to destroy free press, or negatively affect mental health of wide population of teenagers, or invade privacy of billions of people. They simply don’t have the scale for any of that.

Ads are harmful, no doubt, but I do not think they are more harmful than the normalization of gambling in our society.

'I watched an ad, and then [my entire life was destroyed]' is quite hard to imagine, unless it's an ad for an MLM, crypto, entrepreneurship scam, or gambling.

On the other hand, I absolutely know people who started out in soft gambling who then proceeded to throw their life (and sometimes families) away trying to catch the next high with higher and higher stakes gambling until they lost everything, and then some.

We also don't really know the impact gambling is going to have in the near future. Loot boxes, online gambling, internet celebrity gambling, etc. really only became popular around ~2010 or later, and the kids who have been growing up with low-risk gambling as a daily accessible thing on their iPads have not come into adulthood yet.

Not an either or situation. We should do both.

The parent comment downplayed the importance of mobile gaming/gambling. I simply rebutted.

Mobile games are only harmful to a relatively tiny group of addicted gamers, while internet ads have very serious consequences acting on society as a whole

It is still unethical to even play "free"-to-play games. You are entertained at the expense of a small group of addicts that are often spending more money than what they can afford, and, at least in many games, just being logged in helps create a nicer environment that lures in those people. If you are not there to be a whale you are there to be lure for them. It might not be harmful to you to play, but you are being harmful to the addicts.

I see again and again this non-argument on HN. Yes, if you get robbed but not killed then it is a better outcome than getting killed but this doesn't make robbing good by any measure.

The claim was that the majority of tech's ills are caused by ads. By leaving that statement without analysis we're blind to other problems.

But what if you make the punishment for robbing harsher than murder? Maybe people start killing you after robbing you to get a lesser sentence. It happens in some parts of the world, if they accidentally hit you with their car they'll run over you again to finish the job because if you sue or go after them it'll be real bad. Point is we have to be careful about how we regulate things or we can shift things in an even worse direction.

All those mobile games frequently require advertising in the first place to race their customers/victims. We should definitely ban a lot of the dark patterns which would coincidentally improve AAA games which use similar patterns (eg increasing duration of gameplay because of grinding mechanics).

And the largest benefit of modern technology comes from the fact that so much of it is "free" (ad-supported). Without ads, there would simply be no effect at all.

Wikipedia and stack overflow and forums like reddit and chat and similar are the biggest benefits of the internet and they are very cheap to run, you could run them based on donations. Reddit is more expensive than it has to be since they try to pivot to more ads and media, but a text forum is very cheap.

The biggest benefit from ad supported tech are search and video, the rest would be better without ads. Reddit would be a better place if they didn't try to get ad revenue etc, in those cases them chasing revenue makes user experience worse instead of better.

I don't have the study at hand but this was proven false: the impact was negligible (% points) as the fundamentals are extremely good for the big platforms. Take FB and Google: they already have extremely strong (and legitimate) profiles of users without following you around the web.

How/when did these types of regulations come about?

I can't say much about US. As I see it, EU pretty much copied US about that part. There was nothing related to computation in the EU's AI Act projects until few months ago, it was purely a "what kind of data processing are you allowed to do?"

Politely, what the hell are you talking about? Who is telling anyone what they can or cannot compute?

US:

https://www.whitehouse.gov/briefing-room/presidential-action...

"Until such technical conditions are defined, the Secretary shall require compliance with these reporting requirements for:

          (i)   any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations[...]"

EU:

https://thefuturesociety.org/wp-content/uploads/2023/12/EU-A...

1026

1023

Should be 10^26 and 10^23.

Probably I did this wrong but I’m getting an approximation of 300K H100s completes that in a month. At least they choose something fairly large it seems. Not sure how LoRA or other incremental training is handled.

Depends on which spec you used, since the law doesn't specify the floating point width. If you used FP8 ops on the H100 SXM then a single GPU would hit the limit in 25265285497.72612 seconds. 300,000 GPUs would pass 10^26 FP8 ops in 23 hours.

Are they trying to bring back SIMD-within-a-register? Though that only gives you ~one order of magnitude doing packed 4-bit stuff with 64-bit GPRs. And perhaps fixed-point, sign-exponent and posits are unregulated.

anyone with a functional government.

That is a huge caveat to leave out of a readme, especially one that claims llama compatibility.

They don’t make that claim as far as I can tell. Just that they support llama2 models.

Well it's not really support of llama 2 if it has to be extensively finetuned to "convert" the model.

Indeed

https://huggingface.co/SparseLLM/ReluFalcon-40B

"We utilize PowerInfer for inference"

Everyone compares against llama.cpp because it's easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.

What do you recommend that is faster that I can package into an app for distribution?

I have packaged exllamav2 (plus a lot of other stuff) into an app for distribution here: https://apps.microsoft.com/detail/9NC624PBFGB7

I used pyinstaller. It was difficult because Python makes these things difficult. But it works. It does require an Nvidia GPU. MLC-LLM is another option that might be easier to package and potentially able to run on AMD.

Oh yeah, I want to work on AMD/Intel/NVIDIA and MacOS, even iOS/Android.

I've been following MLC-LLM as well. Right now I am just using JS/WASM from Huggingface, but later I will want something more performant.

Yeah if you want maximum performance on multiple platforms you'll probably have to package multiple frameworks. Llama.cpp might be a decently fast option on Apple Silicon, I'm not sure of the state of the art there.

In this case they're comparing against llama.cpp because the code is literally a modification of llama.cpp. I'm not talking about using the ggml lib for matrix calculations, it's literally using the llama.cpp main.cpp and other normal llama.cpp code. It's a fork. It is directly comparable.

https://github.com/ggerganov/llama.cpp/pull/4543 [Review] Merge PowerInfer with llama.cpp mainline #4543

https://github.com/ggerganov/llama.cpp/discussions/4534#disc... "The x11 speedup is kind of cherrypicked because the llama.cpp GPU code for Falcon 40b is just not well-optimized."

Thanks for pointing that out, I didn't notice that. That makes sense.

I still think a comparison with exllamav2 or other optimized inference library would make sense too.

ExLlama is GPU only right? This speedup is for GPU + CPU split use cases.

Oh I see, they are running a 40B model unquantized, whereas exllamav2 would have to use 4-bit quantization to fit. Given the quality of 4-bit quantization these days and the speed boost it provides I question the utility of running unquantized for serving purposes.

I see they have a 4-bit benchmark lower down in the page. That's where they ought to compare against exllamav2.

Yeah but exllama doesn't do grammars so I'm stuck with llama.cpp

Also apparently exllama has a few side effects in coherence https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_for...

Running uncensored Mixtral on this would be really nice. More than 3 bits quantized for 4090.

Dual GPUs should be considered normal/consumer grade setup, hopefully they'll add it soon, on 4bits it's enough with plenty of space for context.

This whole thing is a fork of llamacpp, also hoping it'll all go upstream sooner or later.

4090s aren't really normal either. How many people have dual GPUs? I don't think it helps much with games last I checked so you'd only buy 2 for AI.

It's more about what's possible to build. Dual 4090 or 3090 is possible to setup without hassle. Beyond that not really because it'd be above home power socket rating, not possible to fit on the board and case etc.

It's true you can also build dual A6000 with 48+48 = 96GB VRAM also, but that's $10k+ setup just for GPUs on legacy generation.

There’s the physical hassle. It was very difficult for me to fit 1 3090 in my case.

Downvoters care to comment? Uncensored llm versions typically perform better (at least on benchmarks) to their "lobotomized" or aligned counterparts

Probably because the parent comment didn't contain much of substance. "Oh, I'd love to see this with [insert my favorite model here]" doesn't really add a lot to the discussion.

For example, the parent commenter could have talked about the specific attributes of that model that make it superior. I personally am aware that Mixtral is one of the best performing models right now, but is everyone else? Also, does Mixtral need to be uncensored? I've used vanilla Mistral for some...interesting...prompts and had no issues with it moralizing at me.

I mean, does it need to? Not every comment has to be plethora of hidden information. Sometimes people are just excited.

looking good https://www.youtube.com/watch?v=q2KpPUOsBCs

Yeah, so they demo a bigger model on an RTX 4090 with 24 GB VRAM. Granted an implementation of sparse activations with the Mixture of Experts could be non-trivial, I think it’s a brilliant move, that could potentially allow for even, e.g., CPU only processing and/or much cheaper GPU processing… Mixtral technically already has neural network controlled sparse activations, but like the Inception meme says: we must go deeper…

Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM: https://github.com/Const-me/Cgml

Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.

This looks really really interesting. Any idea whether it would run on a laptop with an Intel Core i7?

VRAM is king. If you have the VRAM to hold the model parameters, you can run it.

Yes, I run it on CPU using LLMStudio. It's very fast.

The performance on integrated GPUs is not stellar, but it should work.

On my AMD Ryzen 5 5600u with dual-channel DDR4, I’m getting 2 tokens/second. My friend with Intel Core i3 and single-channel memory was getting 1 token/second.

try ollama , only needs about 4GB it uses llmcpp

From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it's going to be a lot of work to maintain this project.

That sounds about right. They provide a script to combine their "Predictor" weights to the original models, but I don't see anything obvious in the front page of the Github repo about how to create those weights.

A 10x speed improvement is really impressive. If this kind of improvement is reproducible across other models, then presumably identifying hot and cold neurons for inference optimization should go on to become a normal part of model development process.

Like JVM "hot spots," or JIT optimization.

Or profile guided optimization.

how much speed increase do we get on CPU only configurations? has anyone tested it in such cases?

CPU-only is impractical for most use cases and this will only become more true over time as models become larger. The mediocre perf/$ and perf/watt makes it not worth the effort

Might be worth it in a datacenter, especially if it's operating other servers alongside (perhaps I/O bound web serving or something); perf/$ does matter, definitely, but the state of the art is moving quickly (getting faster/more efficient) and optimizing some models for CPU is still relevant IMO.

This architecture is specifically aimed at optimizing GPU use.

"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)

PowerToys are probably the original (going back to PowerToys for Windows 95)

Edit: https://socket3.wordpress.com/2016/10/22/using-windows-95-po...

PowerPoint existed in the late 80s, I think, although Microsoft acquired it from what I understand.

https://en.wikipedia.org/wiki/PowerPC

This is super cool.

For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.

Hopefully the "cold" neurons eventually get offloaded to the IGP instead?

Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?

The only thing I could think of on the question of Apple Silicon and Metal is that they think they could still split out the cold neurons to the CPU/Accelerate and the hot ones on the GPU and utilize both. The speedup is likely less if there is already no copying of data between GPU/CPU and using the unified memory. Still, it would be great if you could use even more of the capabilities of the chip simultaneously. In order to avoid thermal throttling they should use the efficiency cores only (I think this is what game mode does).

That doesn't make much sense to me. The GPU's task energy is so much lower than even the e cores, and AFIAK the GPU's compute isn't even fully utilized for local inference.

Hybrid CPU/GPU Utilization: Seamlessly integrates memory/computation capabilities of CPU and GPU for a balanced workload and faster processing.

Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?

edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?

GPU-only will be faster if you have enough VRAM.

But if you want to run a model that requires more VRAM than you have, the current approach is to use llama.cpp and specify n_gpu_layers. That works, but is slower than GPU-only.

OP claims to be 10x as fast as llama.cpp in the case when you can't fit the whole model in VRAM.

The important stuff from the readme (if you're not looking to tinker with it directly):

We have tested PowerInfer on the following platforms:

x86-64 CPU (with AVX2 instructions) on Linux

x86-64 CPU and NVIDIA GPU on Linux

Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)

And new features coming soon:

Mistral-7B model

Metal backend for sparse inference on macOS

Also worth mentioning the downloadable llama2 models, and the convert.py file.

This will be really cool once there's the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they've done it with. Looking through the page and code it doesn't seem like the tools to do that step are included. Guess I'll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).

Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!

"This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers."

Brilliant!

All the "consumer grade GPUs" terminology makes it seem like you could run it on a variety of models, but like so many of these posts, is this a 4090 exclusive?

This sounds like it uses the same techniques as the ones described in the "LLM in a Flash" paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.

It’s not too much faster than exllama2 with flash attention, no?