return to table of content

Show HN: Speeding up LLM inference 2x times (possibly)

xrd
17 replies
22h22m

My takeaway is that this proves what was said on the recent latent.space podcast with David Luan from Adept.

https://www.latent.space/p/adept

"I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models."

In other words, if you train your own models, you will not get to take advantage of breakthroughs like this that start with open models (like Mistral).

All the advantages are going towards the open models and this is an existential risk for OpenAI and other closed model companies.

quadrature
14 replies
22h5m

Maybe, but theres nothing that stops OpenAI from stealing these tricks.

refulgentis
7 replies
21h53m

It's extremely unlikely anyone will take any of this.

Quick take:

- it was 15x slower than llama.cpp when I used Apple's new proprietary ML framework on my MacBook

- So I made it possible to skip arbitrary amounts of work.

- I identified an arbitrary tradeoff that seems arbitrarily good to me.

- I've confirmed this by making GPT-4 write some prompt with questions. Then I had the normal version answer, and the "skip arbitrary work" version answer, and it LGTM.

- So I threw it up on GitHub, then on HN with a title "LLM inference 2x faster (possibly)", and people missed: [on my laptop] [in the ML framework I'm forcing myself to use] [based on an eval I made up] [based on an an eval where I am the evaluator]

This *really* shouldn't have the title it does, very misleading.

Author, please feel free to correct me, I'm sorry for not taking the time to find a gentler way to communicate this. I hope you kick ass. You did put possibly in a parenthetical, but its carrying the weight of the world here, people just see LLM 2x faster. That's why everyone is spinning off into grand speculation land, which I also see you valiantly commenting to dissuade

xrd
3 replies
21h28m

I think you are commenting on the "stealing" reply and not my original comment. And, I think you are making my point stronger.

OpenAI could easily (and will) put out a blog post or tweet saying "next models do inference 2.5x faster!" Koliko did that, or maybe he didn't and someone else put words in his mouth. I don't really care: I can validate and test your comments here (and they are great!) and I can try his experiments myself.

I cannot do that against "GPT-5-2.5x faster (c) 2024"-42B (because it isn't released yet publicly). Putting a paper and some vague ideas on Arvix isn't really doing much these days except adding to the confusion. Truly open work like koliko is doing is really exciting and feels like it can only be done against truly open models like Mistral.

Oh wait, Mistral isn't fully open either (ducks...).

observationist
2 replies
20h23m

There used to be a class of software called freeware - you could download and use it without restriction, you just couldn't resell it, or have the source to modify it. Llama and similar models are like freeware - an inscrutable binary blob crafted to work with other software, except instead of a VM or native OS environment, you have llama.cpp or similar software that runs the AI model.

Mistral is open source, in that you can do anything the Apache license allows you to do, even package it into your own product and resell or modify it. We're missing the dataset details, the source to the software that produces the model, similar to not having access to an operating system and special compiler software. That's not a huge deal, because people don't have the resources to make use of those large datasets or the Mistral training software, which is likely highly tailored to their own training and development pipeline, and wouldn't do much good for anyone without at least a pod of A100's of their own.

Weights available and other terms are being thrown around, and Meta and the like are calling their stuff "open" but that use of the term bears little resemblance to the use of the word by the open source community.

The public Mistral models have open source licenses. The model can be used like open source software. The terms are permissive and free, requiring only attribution. Meta's license scheme is novel and not open, with arbitrary lawyerese and absolutely, 100% will bite someone in the ass when the threshold between "annoying to sue" and "profitable to sue" gets exceeded by someone using Llama in a way that's technically incorrect. Right now, Meta wants the goodwill more than they want a couple million dollars chasing a couple dozen startups.

If the model doesn't have an open source license, it's not open. It might be freeware. Llama is freeware. You can, technically, do whatever you want to it, but try to not attract too much notice or be too successful with it.

Mistral, by using Apache licensing, couldn't go after you even if they wanted to, unless you do something deliberately stupid.

oceanplexian
1 replies
17h39m

If the model doesn't have an open source license, it's not open.

Actually, OSS comes with tons of strings attached that make the term open dubious. And there are many ways they could come after you legally. Apache, GPL, etc all have terms and conditions, you have to contribute back X Y and Z, agree to our manifesto, and so on.

The only truly free license is MIT. Go build a billion dollar business, change it however you want with no strings attached and you can express the license terms in a single paragraph.

observationist
0 replies
3h41m

    Apache is stricter because it requires you to list all the modifications you make to the original software. A copy of the license, with copyright and attribution notices, must be included in the source code and any modifications. A specific section on contributions outlines the conditions for accepting external contributions to the project. The MIT license requires a copyright notice, but it does not have explicit requirements for the license or attribution notices.
source: https://www.mend.io/blog/top-10-apache-license-questions-ans...

Apache 2.0 is almost as permissive as MIT, and better suited for some cases. I love the MIT license, as it allows the most freedom for users and creators, across the board. Apache 2.0 is second best, but might better in a formalized organization that wants more structure and formalized documentation requirements.

kolinko
2 replies
21h38m

The whole inference is slow, but it's matrix multiplications that count. They work reliably on all the Macbooks that I tested - at 50% effort it's the same speed as the state of the art matrix multiplications, at 25% they are twice as fast.

The apple's MPS matrix multiplications from Apple are comparable in speed to the speed of Llama.cpp and the other models. When I was doing tests, I was comparing the Llama.cpp benchmarks ( https://github.com/ggerganov/llama.cpp/discussions/4167 ) to Apple's MPS - they match very closely. And then I was comparing Apple's MPS to my results.

Even if the end-results would show that the models somehow break (which they might on Q8), there is no other implemented method right now that would give you such speedups with matrixes of 25% sparsity. The usual methods break even with full matrix multiplications around 15% mark, and show speed improvements under 10% (as far as I know, but I'm new to the field, so I wait to be corrected).

As for the other metrics - I hope to get help from the community to get the implementation done properly. So far it's been 3 months of work 12 hours a day - even during Easter - to get this version going. It is as far as I can push it without the community support, which I'm happy I received over the last hour.

Also, I'm not sure what you'd expect really. A full production ready system on the day one? From a solo developer? Seriously? :)

Let's get the flame war going! :D

refulgentis
1 replies
21h29m

Nah it's good work, you'll be able to flame me in a couple weeks...months?..too when I ship my yawn-worthy yet another llama.cpp / OpenAI wrapper. :p

I'd love this knob, particularly in llama.cpp, inference is a bit too slow on Android, 6 tkn/s for 3B. just can't stand it when people don't actually read anything but the title, and go crazy overboard, like, how are we in a thread where people are like "oh this confirm local models will definitely win like I heard on a podcast" and "big bad OpenAI will steal this".

kolinko
0 replies
20h57m

Hahah thanks, although I was hoping for a flame to get adrenaline flowing to push through the night :D

I also hope there will be an extra knob - or more like knobs, because effort can be regulated smoothly layer by layer, token by token, matrix by matrix. Think more like an equalizer, not a volume control :)

The biggest question right now is how (if) it will perform with Q8 and with smaller models. The risk is that the quality dropoff will show up closer to 40-60% at Q8, negating the performance gains.

littlestymaar
5 replies
21h0m

Exactly, that's the problem with the current state of things with open models, the players that keep their secret sauce keep an edge over the people doing things in open while benefiting from all of their work without contributing back.

kolinko
3 replies
20h2m

That was the claim with a lot of the software in the past, but open source won in many places in the end.

iwontberude
1 replies
18h38m

At the end of the day, if their profit margins aren’t good, it doesn’t matter whether their competition is open source or not (which is often where OSS wins). I think we are seeing that AI isn’t the slam dunk for increasing productivity or we would see companies like UIPath being profitable. I don’t think we’ve seen anyone net a profit on AI software and the only company that has been investing since at least 2017, Apple, gets zero credit for their contributions and commercialization of the tech. I think about how Amazon abandoned the AI-powered, checkout-free tech because the margin of error stayed stubbornly high for too long. The clock is ticking on the industry and some players, like Apple, already have found it isn’t profitable (well to their standards of 60% return on investment).

kolinko
0 replies
10h58m

Depends on a field.

The project from the thread would take me an impossible amount of time without GPT. Even the page itself would take me twice as long to generate - the charts were done by pasting source data to GPT and GPT writing plotlib code for me to chart them, and the equations were originally written by GPT as well, because I wasn't familiar with MathJAX.

Ditto with the code - a bunch of it was written by GPT originally, since this is my first Swift/Metal project. I kept telling it what I want to do in Python, and it kept rewriting it in Swift/Metal until I learned the latter.

The name "effort" was also invented by GPT. Originally, internally, I was using "quant" but that would be confused with quantization. I considered "perc" from percentage - but that's ugly. I described the project to GPT, and it suggested "effort" as a metric.

As for self-checkout - in Poland we have Żabka Nano which is still going on, and seems more solid than Amazon, but of course the time will tell :)

littlestymaar
0 replies
7h48m

but open source won in many places in the end.

The biggest companies in the world are running closed-source software for profit that uses open source foundation while barely contributing back, so it's really not the counter-argument you think it is. And that's no wonder we're seeing open source companies going for source-available licenses now (Redis, HashiCorp) or other kinds of restrictions (RedHat), because they were helpless regarding the parasitic behavior of the big bad wolfs.

Hugsun
0 replies
16h54m

In these fast moving early times of LLMs, they can maintain this advantage with simple things like proprietary datasets and greater compute.

The difference in quality between the best model that can be made with such proprietary utilities and without is likely to decrease over time as open datasets of greater quality are published and the field matures.

The difference in quality and number of competitors ultimately pays the bills and the harsher the competition is, the less money there will be, for each individual company, to maintain their possibly dwindling proprietary edge.

The greater access to compute is an edge companies will likely hold for a while. It will be interesting to see how much open models will be able to catch up and how great of an edge proprietary models will maintain.

kolinko
0 replies
21h52m

In this case though, the algorithm should be just as useful to closed models as to open models. There is nothing there that is optimised specifically for Mistral - aside from the hard-coded dimensions in multiple places in the code :D

Having said that, it's awesome to have open source models out there, and I hope they will ultimately win in the end.

Havoc
0 replies
18h44m

That seems fundamentally flawed to me. Closed providers can copy techniques from open models but not vice versa.

To me that reads as closed will always be a step ahead not an existential risk to OpenAI.

renonce
5 replies
13h36m

Essentially the algorithm is about pruning parameters on the fly and making the weight matrix sparse by setting least significant weights to zero, where "least significant" is defined in terms of the rank of the absolute value of the pruned weight within its group?

A search for model pruning turns out many results, including https://arxiv.org/abs/2305.11627 which discusses "magnitude pruning" as a baseline, and refers to https://arxiv.org/pdf/2301.00774.pdf, which asserts in the introduction:

First, as shown in Figure 1, SparseGPT can induce uniform layerwise sparsity of up to 60% in e.g. the 175-billion-parameter variant of the OPT family (Zhang et al., 2022), with minor accuracy loss. By contrast, the only known one-shot baseline which easily extends to this scale, Magnitude Pruning (Hagiwara, 1994; Han et al., 2015), preserves accuracy only until 10% sparsity, and completely collapses beyond 30% sparsity.

I don't like how these papers boast their own methods by poorly implementing a baseline and describing their own methods using lots of mathematical jargon - the OP's blog post is a breeze in making the methods accessible to people with very little background.

kolinko
4 replies
11h50m

Thanks! The last full month was all about making sure all the research is as replicable and as trustworthy as I can do it. The original implementation was extremely inefficient, and even once I had the Metal/GPU mmul op working fast, I spent much time to bring the rest of the implementation as close to the Llama.cpp as possible to allow for easier benchmarking.

As for the approaches - it seems the papers you mention do this statically, and din't produce an algorithm for actually speeding up computations with their 20-50% results - which was a large part of the difficulty. I'll probably take some time off one day and do a thorough review of the literature finally.

Ultimately I want to add a citations page with these and some other papers people have posted in the comments soon. I expect sooner or later someone will find this algorithm written up by someone :)

While development asked gpt-4 and googled trying to find information about this method - everything seemed either static, or focused on removing full dimensions/layers ad hoc with eventual retraining. Didn't find anything matching this idea exactly.

bravura
2 replies
11h5m

Thank you. If you want to contribute novel research, a thorough literature search of prior work is essential. And hopefully a comparison to previous approaches.

If you didn't find these works in your literature search, I encourage you to improve this skill, because it's important and valuable.

kolinko
1 replies
10h7m

True, it will be easier next time because new llms allow to better searching through papers.

Having said that - the tradeoff with getting into a new field is that there is 50 years of history to dig through, and by the time you know exactly what you’re looking for, you don’t need to find it really. Although a better math framework or another approach would help a bit for sure.

I think in the future it will be way easier with better llm searches. Gpt right now fell short in finding some stuff, but it was already very useful in giving me an overview of the field.

llama_person
0 replies
1h20m

One more opinion here: if you're looking around and nobody else has something obviously working, it's ok not to do a thorough search through hundreds of papers hoping to find someone else who came up with the idea. Get it working first, show it off, and if someone comes claiming it was their idea, respectfully acknowledge the prior work. But there's a gigantic gap between "wrote it in a paper" and "actually got it working".

Also, awesome job! Thanks for writing up some seriously cool engineering.

kanwisher
5 replies
23h34m

Could you break down a bit more about why you can skip so many calculations ?

kolinko
2 replies
23h20m

It’s explained in detail here:

https://kolinko.github.io/effort/equations.html

Long story short - I kind of sort them and pick only the top % that would give a highest result.

One part is choosing them though - I think this was done before in some papers. But the second part was an implementation of multiplication that is efficient on both gpus and cpus when choosing weights almost at will.

All explained on the site, but I just got feedback that it may not be easy enough to read, so I’ll push it through gpt for grmamar fixes soon :) It’s also a bit complicated as an algorithm.

throwaway2562
1 replies
23h0m

Hmmm. Very interesting. I wonder if you could speed up your clever approximation still further with this approach https://arxiv.org/pdf/2205.09120.pdf

kolinko
0 replies
20h42m

Nah, sadly this most likely will not work with Q1/Q1.5 implementations - initially I was playing with monte carlo approximations (before I arrived at bucketMul), and the convergence was very slow for binary/ternary networks.

Or, in simpler terms - if you have just ones and zeroes, and minus ones, you can remove zeroes from calculations, but that's it. No good method to figure out which ones are more important than the other ones.

Also, there are no bits left to store positional information when bucketing.

There are some paths that could be explored in this fashion, but it would require a redesign of the algorithm from the ground up.

punnerud
1 replies
23h9m

A bit like calculating the fastest route for a car, you probably don’t need to calculate the distances for the opposite side of earth if you will not drive there. Then multiply that by a billion, but the optimization still holds.

kolinko
0 replies
11h17m

Nice analogy, thanks! Yes - you still need a map, because one day you need this road, another day you need info about another road, but you're not using info about all the roads all of the time.

dartos
5 replies
1d

Thank you for this really cool and open contribution!

I will be watching llama.cpp closely for them to implement this!

I’ve been looking for ways to speed up CPU inference and I really like this idea of “effort”

kolinko
4 replies
1d

Hahah, thanks! It was a marathon to get develop this, and I'm glad it reached the front page.

The name was proposed by chatgpt :) It claims it doesn't recognise this approach - so there is a chance it's really a new thing.

I want to reach out to llama.cpp and the others - I hope it gets implemented. I considered just writing a patch to llama, but c++ and the scale of that project was beyond me.

As for CPU inference - it should speed it up just as well. But thanks to the fact that it can load up a fraction of weights (e.g. just 70%, skipping the least important ones), it should be possible now to run models on less VRAM than before (still, Q8 needs to implemented though).

Funnily - when I tried comparing benchmarks to llama.cpp, I couldn't find speeds for 7B/FP16 on MB Air 16GB, because it's impossible to run with regular methods. It is possible with Effort.

Ditto, I was running full resolution, but cropped, Mixtral on my 96GB M2, even though it usually takes 114GB ram. I just loaded 75% of weights, and it was working smoothly. (before I messed something up with implementation and it now produces crap output - needs a fix)

indymike
1 replies
23h44m

It is possible with Effort.

"All things are possible with enough effort." -- Dad.

kolinko
0 replies
23h30m

Hahaha :)

dhruvdh
1 replies
1d

I would imagine the importance of weights depends on the prompt. How do you decide which weights are important?

kolinko
0 replies
1d

Yeah, that is the point more or less - it dynamically chise the weights layer per layer depending on the internal state.

A bit technical explaination here. https://kolinko.github.io/effort/equations.html

brrrrrm
5 replies
22h33m

do you have a simple python impl? :)

kolinko
2 replies
22h16m

It originally started as a fork to Recmo’s cria pure numpy llama impl :)

https://github.com/recmo/cria

Took a whole night to compute a few tokens, but I used it to do the first tests.

Also, my friend pasted the paper to claude and it produced a working basic impl instantly :D

But in all seriousness - I think MLX implementation would be doable, or a wrapper to the Metal gou functionality

kwikiel
1 replies
22h6m

Will share python implementation soon as a kind of executable pseudo code which then can be ported to any platform.

This project is kind of like ultimate nerdsnipe as math is quite simple, you don’t need PhD to understand it and actually implementing things would teach you linear algebra faster vs just mindlessly doing exercises sets.

kolinko
0 replies
20h30m

Haha yes :) Publish it, Kacper!

The project is a nerdsnipe for math geeks, because there are multiple small things that beg to be proven / described by math there. For example - what's the tradeoff between the number of bits we loose when embedding position vs the bits of information that we gain by knowing which bucket a weight belongs to?

In other words - is it possible that when storing weights in the bucketed form we can actually end up having a higher precision than using a regular form? For Q8 we get just 4 bits to store the weight (and 1 bit for sign, and 3 bits for location), but these 4 bits need to express numbers from a smaller range than before.

brrrrrm
1 replies
17h14m

not sure why I got downvoted for asking this. seems like a reasonable thing in the ML space

anyway here's a numerical simulation written in PyTorch for those who want to consider this algo on their projects before fully integrating: https://gist.github.com/bwasti/78d7058aad7f42dc893d906c877b7...

happy hacking

kolinko
0 replies
11h4m

If you got downvoted, it was briefly. That was a good question. PyTorch/numpy is awesome for initial testing.

uhlo
4 replies
20h51m

Okay now add a small model that decides how much effort is needed in each inference step and we are good to go

kolinko
3 replies
20h48m

Yes! That would be awesome. Especially since there are ~32*6 independent effort settings for every single token.

I tested the most basic implementation, with a flat effort setting for all the muls, but I bet the results could be pushed even further with such an approach. Or even with just doing some ML to figure out which layer/matrix needs more and which less effort.

uhlo
1 replies
20h43m

Great work! One thing: it seems the hugging face link doesn't work... I get a 404

kolinko
0 replies
11h6m

It should work now :)

mentos
0 replies
8h14m

If this is within reach of doing it sounds like the joke might be worth exploring?

byyoung3
4 replies
23h13m

I think this is the overall idea behind the MoE LLM models right? MoE just expands upon this idea by learning which sets of weights to use

kolinko
1 replies
23h7m

I’d say that this expands on MoE really - MoE chooses dynamically which groups of weights may be needed, but it’s whole matrixes. Here it’s ingle weights.

Also, this works on top of MoE beautifully - most of the development and testing was done on Mixtral and it was getting (anecdotally) even better results - getting down to 15-18% effort before seriously losing quality.

I decided to release the Mistral version, but Mixtral was fully operational a few commits back :)

Also, the cool thing - because you can load only the top say 70% weights, I was running Mixtral full precision on my MB 96G - there were no bemchmarks for this in other impls because others need to load full model into the memory.

The real question is Q8 performance - I didn’t implement it fully so far.

anentropic
0 replies
6h23m

Could you add a git tag for the last commit where Mixtral was working?

huac
1 replies
20h3m

It also feels similar to mixture of depths (https://arxiv.org/abs/2404.02258).

Being able to apply this post-training is pretty cool though, makes it easier to use across a wider range of setups.

kolinko
0 replies
11h7m

Yes! I like that, and I saw that paper last weekend iirc. I think these MoD/MoE and other similar methods are highly compatible, and in a similar style.

I was originally afraid that this method wouldn't be compatible with MoE and the other methods, but fortunately, at least for Mixtral, there seems to be an amazing synergy.

By the way, other tasks have higher priority now, byt there is an interesting observation about MoE. In MoE you get two experts chosen, and each expert has a different weight attached to it - e.g. expert 1 has 75% weight, and expert 2 has 25% weight. Perhaps this could allow to scale the effort to give 75% effort to one expert, and 25% to the other. There are some issues there due to non-linearity of the layers, but perhaps there is something to it.

queuebert
3 replies
23h42m

Please, please, please call the final product Halfwit.

Seriously though, this is a very interesting biologically inspired idea, since not all neuronal pathways fire all the time.

It seems to follow that, if you can predict which weights you won't need, then you should be able to compress the model architecture permanently, at least for certain use cases.

kolinko
0 replies
23h16m

Haha halfwit! I’m waiting for such a fork.

As for predicting the weights - not necessarily so. It seems most weights are being used, just not all the time. Kind of like that saying that humans are using just 5% of their brain - perhaps they are, but it’s various parts of the 5%.

Interestingly, Effort works just as well on MoE, if not better. I did most of the development on Mixtral and I think it go even to 15-20% effort before losing quality, but there is some sort of a bug right now that prevents the inference on Mixtral.

It’s on a todo to fix, but I didn’t want to delay the release because of it.

LorenDB
0 replies
21h38m

Effortless would be another great name (since you are literally reducing effort to get speed). OK, maybe not "great", but "an option if you're going for puns".

HPsquared
0 replies
21h39m

Halfweight. Half the weights, half the wait, half the wit.

gsuuon
3 replies
23h2m

Nice writeup! Very curious about the performance per VRAM with this compared to just quantization. Any plans to implement a cross platform version?

kolinko
2 replies
22h20m

Per VRAM - not much netter, because it still uses all the weights, just not all the time.

I mean - it can also load less weights, but quality seems to degrade quick after offloading more than 20-30% weights.

In other words - this algorithm decouples inference time from VRAM use.

Having said that, I’m curious as well if using effort you can get better results on Q8 cropped to 75% than on Q6.

But it’s still probably a few weeks to get the implementation polished enough to be well tested.

AnthonyMouse
1 replies
20h17m

Having said that, I’m curious as well if using effort you can get better results on Q8 cropped to 75% than on Q6.

This is what I wanted to ask. This seems like the same kind of optimization as quantization, sacrificing a bit of quality for performance by discarding some data. So then the question is, which is better, and how do they combine?

You could even potentially get different results at different points in the scale. Maybe Q8 cropped to 75% isn't better than Q6 but Q4 cropped to 75% is better than Q3, or vice versa.

kolinko
0 replies
11h30m

Below Q8 I think it will combine poorly - bucketMul needs to keep a few bits for an index. It can be offset by the fact that the numbers from each bucket are from a limited range. My intuition is that with Q8 this trades off roughly evenly - meaning bucketed Q8 may store as much info as regular Q8, but it will be more difficult with lower quantizations, and cross the boundary of impossible after Q5. Don't have math to back it up, but perhaps some information theoretitian could pick up the calculations.

You could say that these are divergent paths in the future developments (if the results hold for Q8) - perhaps you can crop the Q8 models to Q6 sizes and inference speeds of Q2/Q4. On the other hand, the wildly optimistic scenario is that the bucketMul speeds will overcome even Q1, with dynamic computation pruning - token by token and layer by layer, by having a separate small network that chooses effort levels based on the input data. (someone already proposed it in the thread).

For now, the most important thing is fixing the bugs, doing more serious tests for FP16, and showing how the charts look for Q8. Especially the Q8 is the biggest unknown, although the initial results are hopeful.

bigcat12345678
3 replies
1d

“““ So instead, let's flip the matrix, sort the elements row-wise, and revisit the multiplications from that direction.

This is called a Compressed Sparse Row (CSR) format by the smart people. To do the multiplication now, we take, say, the 1 from the vector, multiply it by 256, and add it into the output vector at the 3rd row. And so on.

Now, let's see what happens if we truncate the last column - the one with the lowest values. ”””

How does csr works with reduced numbers multiplication?

kolinko
2 replies
1d

Can you rephrase the question? Not sure I get it.

bigcat12345678
1 replies
20h1m

I could not read from the text how multiplication with CSR format works in the context of optimization.

The key missing piece for me is that how to utilize CSR format to find the numbers to do multiplication, in other words, how does CSR format helps with picking the numbers to multiply with the vector.

kolinko
0 replies
19h49m

Ah, I get the question now.

If I understand correctly CSR, it stores indexes and values as a list. I store them also as a list - that's why the comparison is there. The difference is that with CSR you store say 15% of the values from a given row. I store all of them, but use only the first X% of them. The X% varies and depends on the input vector.

They are stored sorted, from the one of a highest absolute value to the lowest absolute value.

It's after midnight so my explanations may not be too good now, but I hope the pseudocode on the page and the examples explain it slightly better.

I'll be fixing grammar / typos, and asking ChatGPT to rewrite the page text for me tomorrow to make it more readable :)

a2code
3 replies
20h26m

If it sounds too good to be true, it probably is not.

If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense.

mijoharas
0 replies
17h42m

The metric that's improved is computation speed, and it's achieved by essentially changing the computation (by not performing some computation that likely doesn't have large impact on the results).

Give that it's a different computation, you could argue that Mistral+effort is a new model with the improved metric of quality per amount of computation performed.

Otherwise - given that for every different input there's a seperate set of weights in the model that are excluded - I don't think you could conclude from this (if it holds up etc etc) that the base model is not optimal.

In a similar sense, quantization improved the "quality per model size" metric, but I don't think people are arguing that Mistral is less optimal than quantised Mistral (unless you're speaking about literally that metric). On the other hand, if you're targeting that metric specifically, then it would make perfect sense to say that quantised Mistral is more optimisal for it.

I guess it comes down to optimality being dependent on the metric you're looking at, and there being many things you might want to optimise for.

To note again, if this technique holds up, it's better than model distillation (just get rid of some of the weights) because for some inputs those weights could matter and this technique should (iiuc) account for that somewhat. To me, this is what it sounds like you're referring to when saying:

If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense
kolinko
0 replies
20h16m

The algorithm still uses all the weights, just not all the time - just skips the weights when they are not important given an input vector.

Also, approximation methods, as a field, are not new and they have shown their use.

Having said all that, extraordinary claims require extraordinary evidence - that’s why I hedge the communication messages. It’s „probably” until we get serious tests going on

addandsubtract
0 replies
17h20m

Wait, does the second "it" refer to the true part? Because traditionally, it refers to the "too good" expression. So you'd say, "it _is_ too good to be true".

kolinko
1 replies
1d

Similar theme, but they skip whole neurons, in my case it’s down to a level of single weights.

From my experiment, skipping whole neurons (so whole rows/columns of matrixes) didn’t allow for such good results. In my case 30-50% neurons whole are skipped with 15% effort, but the rest is used partially still.

There are a few papers on a similar theme that a friend sent me today morning - I plan to add them in the citations part

toisanji
0 replies
23h52m

awesome, looking forward to seeing how your results perform. I tested powerinfer on smaller models and didn't see large performance gains.

kolinko
2 replies
1d

Here to answer questions!

A friend of mine has published the original link on HN before me, so I hope a double post won't be an issue :)

kolinko
0 replies
10h57m

Thx! Although I'm happy you took your time and now I can brag until the end of time that I got 2 out of 5 top slots for a moment :D

gcy
2 replies
23h44m

Could you explain the pseudo code in your equations page? Is the second approxMul call a typo (also the capitalized V)?

def precompute(W): W = W.T probes = get_probes(W) W_idx, W_val = sortMatrixRows(W)

def approxMul(v, W_idx, W_val, probes): cutoff_chart = v * probes cutoff = topK(cutoff_chart, effort) approxMul(V, W_idx, W_val, cutoff)

kolinko
1 replies
20h37m

oh, thanks, I fixed it. No idea what I meant there originally.

There are still a few typos on the page, I'll be fixing them tomorrow - it's midnight now, and my mental batteries are slowly drying out :)

yousif_123123
1 replies
19h3m

I know this doesn't retrain, but I wonder if approaches like this plus quantization could get back any "lost" quality with some post training.

It's great to see, and to know in one's mind how much likely performance and cost will be better in the future.

I know it's fun to work on, but I also want to say THANK YOU for developing open source.

spencerchubb
0 replies
16h6m

At first glance, that sounds like it could work. From what I've read, there seems to be two main ways to regain some quality with quantization: post-training which happens after, and quantization-aware training which quantizes during training but leaves the activations and gradients full precision

wayoverthecloud
1 replies
15h37m

I was wondering if something similar can be done for CNN too.

kolinko
0 replies
11h21m

A friend who first heard of the method immediately suggested this may work, perhaps even better in Diffusion models.

It is really a drop-in replacement for the regular matrix multiplication. The data structure is a bit more painful to work with (you have three datasets, not just one to represent a weight matrix), but it shouldn't be too difficult for devs of the existing inference engines to implement it for a test.

Half of my challenge was that I wasn't knowledgable enough to just patch llama.cpp or MLX and use their engines with bucketMul. That's why I opted for making my own - still not sure if it was a good choice to build everything from the ground up though, although I'm proud of the name :)

Finally - the basic math behind approximation suggest that this should work with all the models - cosine similarity score is 0.99 until the magical 25% mark for most of the matrixes I tried. It can vary within a model though - e.g. in Llama, on the first layer, wq/wv/wk could be easily approximated with 5% effort, whereas some deeper layers at 25% had just 0.90 cos score - still seemed enough to not lose coherency within the model.

spencerchubb
1 replies
20h51m

I love this line in the gpu implementation section.

"Readers fresh to GPU programming may ask now - how does it work?

Readers experienced with GPU programming may ask - how the hell does it work?"

kolinko
0 replies
20h0m

Haha thanks! :) As far as I understand I had to implement the memory reads and some other things the opposite way to what is considered a proper approach.

Would love to have that code reviewed by someone who actually knows stuff about Metal - this is my first gpu programming attempt

kolinko
0 replies
20h27m

Ah yes, forgot to make the repo public. Thanks a ton for pointing it out and writing the comment - I'd miss it if you didn't point it out on HN.

marmaduke
1 replies
21h19m

Having used CSR it's not surprising, and some newer formats might have more mechanical sympathy like block ELL, since they avoid uncoalesced reads / gathers, tho the code is trickier.

kolinko
0 replies
20h51m

Oh, nice to finally bump into someone who has experience with CSR!

bucketMul has few uncoalesced reads, and it uses a different data structure than the regular CSR - it's decribed here: https://kolinko.github.io/effort/bucketmul.html It splits each Matrix row into 16 parts, and chooses which ones are necessary to read. The writes are fully linear.

Not sure if I speak sense though, it's getting a bit late today, and it's been a long day ;)

hatthew
1 replies
19h30m

This seems similar to semi-structured (aka 2:4) sparsity, may be worth explicitly comparing. As far as I can tell by skimming, your technique:

- is optimized for apple silicon - ~2x speed at 75% sparsity - dynamic, depends on input, applied at runtime - can choose amount of sparsity

And 2:4 semi-structured sparsity:

- is optimized for GPUs with sparse tensor cores (nvidia ampere and beyond) - ~2x speed at 50% sparsity - static, applied to the model at rest - probably worse results than your technique at 50% sparsity

The interesting comparison I'd want to see is semi-structured sparsity results (50% sparsity, 2x speedup) vs your results at 75% sparsity (2x speedup).

kolinko
0 replies
19h17m

Thanks for checking it out!

I also can’t wait to have more tests done.

Also, I chose Apple Silicon because it was easier for me to develop on. It is possible that the algorithm would also have good performance on other architectures.

globally_unique
1 replies
22h18m

That looks like fantastic stuff. I just want to point out the 15ms delay looks similar to 60Hz vsync (16.7ms), if you are updating the screen once per token, maybe that’s causing a sync somehow?

kolinko
0 replies
21h50m

Nah, that's not it, I measure the CPU & GPU work separately, and 15ms happens between the kernel invocations. It also happens when I don't print out text.

Thanks for the idea though! I treat it as the first community contribution :D

coolvision
1 replies
23h37m

how does it compare to 8-bit/4-bit quantization in terms of speed/accuracy?

kolinko
0 replies
23h27m

hard to say for now, I’m curious as well, but I used simpler tests so far because of the implementation issues - most test suites are geared towards testing models and not model implementation.

I didn’t want to wait any longer with the release, but better tests will be coming soon I hope. Anecdotally, I think 30% effort should be comparable to Q8z

More importantly, this algorithm should work on top of Q8. The quality is not yet certain though - I could use help with the implementation.

brrrrrm
1 replies
15h10m

I've been trying to think about how you'd amp up the batch size here. it's a bit tricky since the memory access would be way higher, but I think you can actually still save on compute by chunking things up in a clever way to utilize the tensorcores

kolinko
0 replies
11h38m

What would definitely help would be a better understanding a GPU handles myVal - does it go to the cache, threadgroup memory or where. Or perhaps decompilation of the source kernel to see what's really going on there. If we knew, we could calculate the optimal parameters given an architecture.

There is also another approach I tried - I've kept it in the wildMul.swift/metal . More by the book. Splitting work so that one thread keeps track of only one or two output dimensions when going through the buckets. Didn't manage to get it working with a decent speed though.

The idea would be to make sure that one simdgroup reads 4/8 buckets in a batch (so still ~8-16 consecutive bytes which is considered minimum for an optimal read on Apple Silicon) and splits each single bucket among 4-8 threads. One thread might then need to remember only 2-4 dims in myVal, and this might stay in internal GPU registers.

Not sure if I'm explaining this all correctly though.

Having said this - the priority now will be to remove the overhead to get the whole inference time closer in speed to just the multiplication speeds, but above all - fixing the bugs causing suboptimal results at 100% (and breaking Mixtral), and doing Q8. The algorithm needs to work in Q8 to be really usable.

bick_nyers
1 replies
22h10m

I'm wondering if you could sort the two inputs, add the indicies for the multiplications together, then take the largest of that.

In your 0.1 example, 1000 gets index 2, and 0.1 index 0, combines to 2. This will tie with the 1*8, but I think it would even out with larger vector lengths.

Edit: I could be wrong but I think you can precompute the indices for the weights in advance without a prompt, then you won't need to perform those sorts at runtime.

kolinko
0 replies
11h12m

That was one of the ideas I was also considering originally! Don't remember why is skipped it though - may have been because I tried the published one and it worked well enough during the first tries.

As for indices for weight - not sure if I get what you mean, but the weights are sorted as a part of precomputations. Sorting is not done in runtime, because that would kill any efficiency - the moment you need to read all the weights, you've lost, because it's the memory reads, not computations that matter. If you can skip one memory read by doing 5-10 multiplications, it's a good tradeoff.

avereveard
1 replies
1d

we need this into llama.cpp it seems somewhat stable down to 40% effort

kolinko
0 replies
1d

Yes! I hope, if it’s proven, that it will be implemented into the main inference engines.

40% effort is only a bit faster than a full base multiplication, but I hope both the speed and the accuracy could be improved further.