This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com.... Ravenwolf notes that the model slightly outperforms some of its benchmark testing, and this is my experience. It's surprisingly good for a model of its size, and a very capable daily driver on a Mac for chat, code input and other uses.
Something that has come to light since the release of the weights, and not mentioned in this paper is that it looks like fairly likely that the 8 experts were all seeded by Mistral 7B and subsequently diverged. This has generated a lot of experimentation in the local LLM community with cloning models as a way to cheaply generate experts.
It was generally thought likely that training an 8x7B network would be as much work as training 8 7B networks, but this seems not to have been true for Mistral, which is super interesting.
There's still a lot of rapid innovation happening in this space, with papers like Calm from DeepMind this week, and a lot of the adhoc experimental layer combining happening in the wild, (see, e.g. Goliath-120b), I think we're likely to see some pretty interesting architectural improvements this year in the LLM space.
Calm seems to point the way to a next step after MoE, and models like Goliath seem to indicate that even a really really lazy version of Calm (no Linear layer combination, just literally alternating layers at full weights) can be very impactful. Overall I think we will see really, really strong models that are performant on consumer hardware in 2024, likely first half of this year.
I've had excellent results with Mixtral too - it's genuinely impressive. Only problem is that it's a relatively big model that's difficult to run with full GPU inference on consumer hardware (vs the 7b/13b models people typically use).
So far, the main consumer platform capable of running it without 'ruining' the quality of its output (with high levels of quantization) is the newer Apple Silicon Macs with unified memory - generally >=48GB. It can apparently be done on 32 or 36GB, but there's not much headroom.
Edit: As coder543 points out, yes - you can run it without more lossy levels of quantization on multi-GPU setups providing those have enough combined vram.
Mixtral works great at 3-bit quantization. It fits onto a single RTX 3090 and runs at about 50 tokens/s. The output quality is not "ruined" at all.
For the amount of money you're talking about, you could also buy two 3090s (~$750 each on eBay) and have 48GB of VRAM to run with less quantization at full speed.
M-series Macs are surprisingly flexible platforms, but they're not "the only" consumer platform that can do Mixtral.
That was my experience as well - 3-bit version is pretty good.
I also tried 2-bit version, which was disappointing.
However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size).
[1] https://github.com/ggerganov/llama.cpp/pull/4773
I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.
After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.
With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.
Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.
Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.
[1] https://github.com/jmorganca/ollama
[2] https://github.com/ollama-webui/ollama-webui
That is a very interesting discussion. Weird to me that the quantization code wasn’t required to be in the same PR. Ika is also already talking about a slightly higher 2.31bpw quantization, apparently.
Fair enough. I did put 'ruining' in quotes for a reason - I haven't compared output between Q3 and Q4_K_M that I use, but you do generally sacrifice output quality at higher quantization levels.
And you're right, you can run it on a multi-GPU setup if you're so inclined.
You can also choose to run at 4-bit quantization, offloading ~27 out of 33 layers to the GPU, and that runs at about 25 tokens/s for me. I think that's about the same speed as you get out of an M1 Max running at 4 bits? Although I'm not sure about the newer M2 or M3 Max chips. Googling around, I didn't immediately see clear benchmarks for those.
Just as another data point, a CPU-only setup with Q5_K_M would give you roughly 4 tokens per second on a Ryzen laptop (Dell Inspiron 7415 upgraded to 64 GB of RAM).
Nice - that's still pretty solid.. although on a more typical 3060 or 3070 with less vram available, I probably wouldn't expect numbers quite that good.
My 14" M1 Max does around 30t/s on Mixtral Q4_K_M.
Have you tried the 2x 3090 setup? Using nvlink or SLI?
I use 2x 3090, no nvlink though. I’d read it doesn’t help that much but not well read on any improvement.
I have not personally gotten to test things that way
I tried it, with NVlink. The speedup during inference is negligible. You'll probably benefit more during training.
So you don't see significantly worse performance on 3bit quantized models compared to 4? Every 7/13b model I tried quantized gave much worse responses at 3 bit and below, whereas the differences from 4 bit to 6 or even 8 bit is more subtle.
Mixtral is a larger model, so maybe that makes it more tolerant of that level of quantization? I’ve been impressed with 3-bit Mixtral, but I haven’t done a ton of side by sides against 4-bit because I haven’t felt the need.
Could you share what you use to run it on a single 3090? I'd love to try it!
ollama has been by far the easiest way for me, either on Linux directly (as I do now) or WSL2.
Mixtral has been MLXd already? Write ups, if any?
Here's the direct link and I can confirm that Mixtral-8x7B-v0.1 works on M2 Ultra 128GB via MLX and is easy to setup (the longest part is just downloading the weights):
https://github.com/ml-explore/mlx-examples/tree/main/llms/mi...
We'll have a tutorial soon (next week) on combining/composing with Reexpress to add uncertainty quantification (and to use it for semantic search). A link will be here: Tutorial 6: https://re.express/guide.html
I'm running it on an M2 Max with 96GB, and have plenty of room to spare. And it's fast. Faster than I can get responses from ChatGPT.
How many tokens/s? Which quantization? If you could test Q4KM and Q3KM, it would be interesting to hear how the M2 Max does!
No quantization (8_0). The full 48GB model. As for token count, I haven't tested it on more than 200 or so.
Isn’t 8_0 8-bit quantization?
Not to my knowledge. But because the unified memory doubles as VRAM for the onboard GPU, normal GPU acceleration can access the entire model even if it's 50+ GB. That's why ASi Macs are currently the holy grail for at-home inferencing, and also why projects like llama.cpp focus so much on ASi above all else, and why so many UIs release for macOS first before other operating systems. Certain Mac models offer up to 192GB of unified memory.
But that's not a Macbook. And a Macbook M3Max with 128GB of RAM is almost 8000€.
Considering how inaccessible and expensive 128GB of pro-level cards is, that is believe it or not, a good price.
Not to mention that's 128GB for a single GPU. No need to shard or split.
Yes it has, actually: https://github.com/ml-explore/mlx-examples. It's right in the main repo. NB, I haven't tried this, I'm using llama.cpp with a non-K-quant quantization on my MBP.
I have and don't consider MLX to be production ready. I've tested it on M1Max and M1Ultra (128) machines. It's completely non-deterministic in its resource consumption, sometimes using the GPU fully, sometimes getting seemingly stuck while processing, sometimes the GPU throttles.
However, there's one curious thing: llama.cpp _always_ leads to GPU throttling on Apple Silicon (e.g. the M1Max GPU will go from 1200MHz to around 700MHz), and then fully saturates it. In the rare cases I could get MLX to stay on the GPU, it was able to keep it at the maximum clock rate. However the unpredictable pauses and seemingly unoptimized prompt processing makes it hard to pick a winner in end-to-end tokens/s
Many options for running Mistral models in your terminal using LLM:
https://simonwillison.net/2023/Dec/18/mistral/
I liked "Using Llamafile’s OpenAI API endpoint" described there, using Justine Tunney's llamafiles for Mixtral, but the article link is out of date, as the models have been replaced with newer: https://huggingface.co/jartine
Three 4060 Ti 16GB (there are single slot models) is around $1500. I think is possible to get a consumer system that's cheaper than a 48GB Mac.
Yep. Edited my post to reflect as much. The MBP makes a rather nice portable package though.
Mixtral is good but those Ravenwolf benchmarks are meaningless. It’s like some random dude trying to reinvent MMLU without any rigor or consistency and in German. Dataset contamination is a problem, but not one that’s solved by folkloric evaluation of LLMs by people asking for tips on a subreddit.
I don't think they're meaningless; they have a few benefits:
1) He doesn't have an ax to grind / an LLM to pimp out, so he's relatively even-handed
2) He uses the same (secret) test data for each model, so his testing is resistant to cherry-picking/finetuning on tests
3) He likes weirdo role-play prompting, so he has a very good sense of the edges of refusal and alignment tuning
4) He picks up stuff well before it hits the only other fair testing I know of, the chat arena
5) I think asking stuff in German is at worst neutral, and at best useful for testing capacity in edge cases.
Practically speaking, his 'preferred' non-giant models, Nous-Capybara-34B and Mixtral both are excellent in comparison with some of the others he looks at, and good recommendations.
That said, I'd like to see a test suite that GPT-4 fails at, or struggles at, at least. And, it would save him a lot of time if he could get something automated together, it's clearly a lot of effort to hand test all those models.
Any tests that are unfalsifiable and can’t be reproduced are meaningless when it comes to gauging the performance of LLMs (and most other things). I could also post on a subreddit and say I have a secret set of German tests that may or may not exist and and I like these models, but that does nothing to advance the science of evaluating these things. If you want to evaluate human preferences, you can use chatbot arena, which can be gamed, but at least reflects more than what one guy says to be true. And this is with me agreeing that Nous-Capybara is also a good model. But don’t take my word for it because it’s not worth very much!
I think we agree on almost everything about this - which is to say, probably both you and I think the Wolfram Ravenwolf tests are just about useful enough to indicate whether or not a model might be worth downloading, but certainly not enough to justify spending money, say, or planning around. So, yes, I'm with you.
I agree that better ways to evaluate models would be super super useful, and benchmarks like MMLU and whatever's next will continue to be helpful ("real" science). And, it seems like there may even be some benefits for models to training to 'ace the test' more broadly, which is interesting, and ties to some educational theories in teaching humans.
However, one area that open tests can't excel in is in this "fair" evaluation arena -- and I do think that private tests have some value there, to the extent that they can show utility and maintain trust. I don't make any claims about German sex role-play being a good or bad start for these, though.
I think there’s room for private tests, but those should probably be done by some kind of independent standards body. LLMs are an incredible advancement of machine learning research, but we diminish that when we let these kind of wholly unscientific evaluations predominate and especially in open source, where so much great work is otherwise being done to combat the monopoly tactics of big tech.
I agree with all except using german, because some model might be better at german but it doesn't necessarily means they are better overall.
For example I'm pretty sure Mistral models are better at french, so doing a benchmark using french only would be advantageous for them.
If you want to compare all models, better use english. Because now his benchmark just show which models is better at german.
That being said, it's still a very welcomed benchmark.
I find it baffling that anyone would take these benchmarks seriously. The methodology is not transparent, and some of the tests completely unfair, like those in German or about specific niche German topics. The author readily acknowledges that these tests are his personal interests, which is totally fair. But that it would rise to the top of that subreddit and now HN as a general measure of quality of any kind is indicative of the lack of reliable benchmarks out there.
It’s all just people jockeying for influence and hype, like crypto before it.
I'm really curious if/when we'll see MoE based on even smaller models like Phi-2?
At this small a scale, what would the benefits be of something like a 8x2B as opposed to moving to a 7B?
For the output performance (quality and tokens/second), Mixtral 8x7B seems to require less VRAM. But it is still hard to make it fit, even with a lot of quantization, within the GPU RAM of most discreet consumer GPUs. Perhaps a smaller base model like Phi-2 could bring the VRAM requirements down, but the MoE will bring the output quality up from Phi-2.
I'm looking forward to all the hardware announcements. It's certainly looking like intentionally designed on device acceleration of LLMs for consumers is coming.