My takeaway is that this proves what was said on the recent latent.space podcast with David Luan from Adept.
https://www.latent.space/p/adept
"I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models."
In other words, if you train your own models, you will not get to take advantage of breakthroughs like this that start with open models (like Mistral).
All the advantages are going towards the open models and this is an existential risk for OpenAI and other closed model companies.
Maybe, but theres nothing that stops OpenAI from stealing these tricks.
It's extremely unlikely anyone will take any of this.
Quick take:
- it was 15x slower than llama.cpp when I used Apple's new proprietary ML framework on my MacBook
- So I made it possible to skip arbitrary amounts of work.
- I identified an arbitrary tradeoff that seems arbitrarily good to me.
- I've confirmed this by making GPT-4 write some prompt with questions. Then I had the normal version answer, and the "skip arbitrary work" version answer, and it LGTM.
- So I threw it up on GitHub, then on HN with a title "LLM inference 2x faster (possibly)", and people missed: [on my laptop] [in the ML framework I'm forcing myself to use] [based on an eval I made up] [based on an an eval where I am the evaluator]
This *really* shouldn't have the title it does, very misleading.
Author, please feel free to correct me, I'm sorry for not taking the time to find a gentler way to communicate this. I hope you kick ass. You did put possibly in a parenthetical, but its carrying the weight of the world here, people just see LLM 2x faster. That's why everyone is spinning off into grand speculation land, which I also see you valiantly commenting to dissuade
I think you are commenting on the "stealing" reply and not my original comment. And, I think you are making my point stronger.
OpenAI could easily (and will) put out a blog post or tweet saying "next models do inference 2.5x faster!" Koliko did that, or maybe he didn't and someone else put words in his mouth. I don't really care: I can validate and test your comments here (and they are great!) and I can try his experiments myself.
I cannot do that against "GPT-5-2.5x faster (c) 2024"-42B (because it isn't released yet publicly). Putting a paper and some vague ideas on Arvix isn't really doing much these days except adding to the confusion. Truly open work like koliko is doing is really exciting and feels like it can only be done against truly open models like Mistral.
Oh wait, Mistral isn't fully open either (ducks...).
There used to be a class of software called freeware - you could download and use it without restriction, you just couldn't resell it, or have the source to modify it. Llama and similar models are like freeware - an inscrutable binary blob crafted to work with other software, except instead of a VM or native OS environment, you have llama.cpp or similar software that runs the AI model.
Mistral is open source, in that you can do anything the Apache license allows you to do, even package it into your own product and resell or modify it. We're missing the dataset details, the source to the software that produces the model, similar to not having access to an operating system and special compiler software. That's not a huge deal, because people don't have the resources to make use of those large datasets or the Mistral training software, which is likely highly tailored to their own training and development pipeline, and wouldn't do much good for anyone without at least a pod of A100's of their own.
Weights available and other terms are being thrown around, and Meta and the like are calling their stuff "open" but that use of the term bears little resemblance to the use of the word by the open source community.
The public Mistral models have open source licenses. The model can be used like open source software. The terms are permissive and free, requiring only attribution. Meta's license scheme is novel and not open, with arbitrary lawyerese and absolutely, 100% will bite someone in the ass when the threshold between "annoying to sue" and "profitable to sue" gets exceeded by someone using Llama in a way that's technically incorrect. Right now, Meta wants the goodwill more than they want a couple million dollars chasing a couple dozen startups.
If the model doesn't have an open source license, it's not open. It might be freeware. Llama is freeware. You can, technically, do whatever you want to it, but try to not attract too much notice or be too successful with it.
Mistral, by using Apache licensing, couldn't go after you even if they wanted to, unless you do something deliberately stupid.
Actually, OSS comes with tons of strings attached that make the term open dubious. And there are many ways they could come after you legally. Apache, GPL, etc all have terms and conditions, you have to contribute back X Y and Z, agree to our manifesto, and so on.
The only truly free license is MIT. Go build a billion dollar business, change it however you want with no strings attached and you can express the license terms in a single paragraph.
Apache 2.0 is almost as permissive as MIT, and better suited for some cases. I love the MIT license, as it allows the most freedom for users and creators, across the board. Apache 2.0 is second best, but might better in a formalized organization that wants more structure and formalized documentation requirements.
The whole inference is slow, but it's matrix multiplications that count. They work reliably on all the Macbooks that I tested - at 50% effort it's the same speed as the state of the art matrix multiplications, at 25% they are twice as fast.
The apple's MPS matrix multiplications from Apple are comparable in speed to the speed of Llama.cpp and the other models. When I was doing tests, I was comparing the Llama.cpp benchmarks ( https://github.com/ggerganov/llama.cpp/discussions/4167 ) to Apple's MPS - they match very closely. And then I was comparing Apple's MPS to my results.
Even if the end-results would show that the models somehow break (which they might on Q8), there is no other implemented method right now that would give you such speedups with matrixes of 25% sparsity. The usual methods break even with full matrix multiplications around 15% mark, and show speed improvements under 10% (as far as I know, but I'm new to the field, so I wait to be corrected).
As for the other metrics - I hope to get help from the community to get the implementation done properly. So far it's been 3 months of work 12 hours a day - even during Easter - to get this version going. It is as far as I can push it without the community support, which I'm happy I received over the last hour.
Also, I'm not sure what you'd expect really. A full production ready system on the day one? From a solo developer? Seriously? :)
Let's get the flame war going! :D
Nah it's good work, you'll be able to flame me in a couple weeks...months?..too when I ship my yawn-worthy yet another llama.cpp / OpenAI wrapper. :p
I'd love this knob, particularly in llama.cpp, inference is a bit too slow on Android, 6 tkn/s for 3B. just can't stand it when people don't actually read anything but the title, and go crazy overboard, like, how are we in a thread where people are like "oh this confirm local models will definitely win like I heard on a podcast" and "big bad OpenAI will steal this".
Hahah thanks, although I was hoping for a flame to get adrenaline flowing to push through the night :D
I also hope there will be an extra knob - or more like knobs, because effort can be regulated smoothly layer by layer, token by token, matrix by matrix. Think more like an equalizer, not a volume control :)
The biggest question right now is how (if) it will perform with Q8 and with smaller models. The risk is that the quality dropoff will show up closer to 40-60% at Q8, negating the performance gains.
Exactly, that's the problem with the current state of things with open models, the players that keep their secret sauce keep an edge over the people doing things in open while benefiting from all of their work without contributing back.
That was the claim with a lot of the software in the past, but open source won in many places in the end.
At the end of the day, if their profit margins aren’t good, it doesn’t matter whether their competition is open source or not (which is often where OSS wins). I think we are seeing that AI isn’t the slam dunk for increasing productivity or we would see companies like UIPath being profitable. I don’t think we’ve seen anyone net a profit on AI software and the only company that has been investing since at least 2017, Apple, gets zero credit for their contributions and commercialization of the tech. I think about how Amazon abandoned the AI-powered, checkout-free tech because the margin of error stayed stubbornly high for too long. The clock is ticking on the industry and some players, like Apple, already have found it isn’t profitable (well to their standards of 60% return on investment).
Depends on a field.
The project from the thread would take me an impossible amount of time without GPT. Even the page itself would take me twice as long to generate - the charts were done by pasting source data to GPT and GPT writing plotlib code for me to chart them, and the equations were originally written by GPT as well, because I wasn't familiar with MathJAX.
Ditto with the code - a bunch of it was written by GPT originally, since this is my first Swift/Metal project. I kept telling it what I want to do in Python, and it kept rewriting it in Swift/Metal until I learned the latter.
The name "effort" was also invented by GPT. Originally, internally, I was using "quant" but that would be confused with quantization. I considered "perc" from percentage - but that's ugly. I described the project to GPT, and it suggested "effort" as a metric.
As for self-checkout - in Poland we have Żabka Nano which is still going on, and seems more solid than Amazon, but of course the time will tell :)
The biggest companies in the world are running closed-source software for profit that uses open source foundation while barely contributing back, so it's really not the counter-argument you think it is. And that's no wonder we're seeing open source companies going for source-available licenses now (Redis, HashiCorp) or other kinds of restrictions (RedHat), because they were helpless regarding the parasitic behavior of the big bad wolfs.
In these fast moving early times of LLMs, they can maintain this advantage with simple things like proprietary datasets and greater compute.
The difference in quality between the best model that can be made with such proprietary utilities and without is likely to decrease over time as open datasets of greater quality are published and the field matures.
The difference in quality and number of competitors ultimately pays the bills and the harsher the competition is, the less money there will be, for each individual company, to maintain their possibly dwindling proprietary edge.
The greater access to compute is an edge companies will likely hold for a while. It will be interesting to see how much open models will be able to catch up and how great of an edge proprietary models will maintain.
In this case though, the algorithm should be just as useful to closed models as to open models. There is nothing there that is optimised specifically for Mistral - aside from the hard-coded dimensions in multiple places in the code :D
Having said that, it's awesome to have open source models out there, and I hope they will ultimately win in the end.
That seems fundamentally flawed to me. Closed providers can copy techniques from open models but not vice versa.
To me that reads as closed will always be a step ahead not an existential risk to OpenAI.