Mixtral 8x7B: A sparse Mixture of Experts language model

This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com.... Ravenwolf notes that the model slightly outperforms some of its benchmark testing, and this is my experience. It's surprisingly good for a model of its size, and a very capable daily driver on a Mac for chat, code input and other uses.

Something that has come to light since the release of the weights, and not mentioned in this paper is that it looks like fairly likely that the 8 experts were all seeded by Mistral 7B and subsequently diverged. This has generated a lot of experimentation in the local LLM community with cloning models as a way to cheaply generate experts.

It was generally thought likely that training an 8x7B network would be as much work as training 8 7B networks, but this seems not to have been true for Mistral, which is super interesting.

There's still a lot of rapid innovation happening in this space, with papers like Calm from DeepMind this week, and a lot of the adhoc experimental layer combining happening in the wild, (see, e.g. Goliath-120b), I think we're likely to see some pretty interesting architectural improvements this year in the LLM space.

Calm seems to point the way to a next step after MoE, and models like Goliath seem to indicate that even a really really lazy version of Calm (no Linear layer combination, just literally alternating layers at full weights) can be very impactful. Overall I think we will see really, really strong models that are performant on consumer hardware in 2024, likely first half of this year.

I've had excellent results with Mixtral too - it's genuinely impressive. Only problem is that it's a relatively big model that's difficult to run with full GPU inference on consumer hardware (vs the 7b/13b models people typically use).

So far, the main consumer platform capable of running it without 'ruining' the quality of its output (with high levels of quantization) is the newer Apple Silicon Macs with unified memory - generally >=48GB. It can apparently be done on 32 or 36GB, but there's not much headroom.

Edit: As coder543 points out, yes - you can run it without more lossy levels of quantization on multi-GPU setups providing those have enough combined vram.

Mixtral works great at 3-bit quantization. It fits onto a single RTX 3090 and runs at about 50 tokens/s. The output quality is not "ruined" at all.

For the amount of money you're talking about, you could also buy two 3090s (~$750 each on eBay) and have 48GB of VRAM to run with less quantization at full speed.

M-series Macs are surprisingly flexible platforms, but they're not "the only" consumer platform that can do Mixtral.

The output quality is not "ruined" at all.

That was my experience as well - 3-bit version is pretty good.

I also tried 2-bit version, which was disappointing.

However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size).

[1] https://github.com/ggerganov/llama.cpp/pull/4773

I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.

After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.

With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.

Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.

Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.

[1] https://github.com/jmorganca/ollama

[2] https://github.com/ollama-webui/ollama-webui

That is a very interesting discussion. Weird to me that the quantization code wasn’t required to be in the same PR. Ika is also already talking about a slightly higher 2.31bpw quantization, apparently.

Fair enough. I did put 'ruining' in quotes for a reason - I haven't compared output between Q3 and Q4_K_M that I use, but you do generally sacrifice output quality at higher quantization levels.

And you're right, you can run it on a multi-GPU setup if you're so inclined.

You can also choose to run at 4-bit quantization, offloading ~27 out of 33 layers to the GPU, and that runs at about 25 tokens/s for me. I think that's about the same speed as you get out of an M1 Max running at 4 bits? Although I'm not sure about the newer M2 or M3 Max chips. Googling around, I didn't immediately see clear benchmarks for those.

Just as another data point, a CPU-only setup with Q5_K_M would give you roughly 4 tokens per second on a Ryzen laptop (Dell Inspiron 7415 upgraded to 64 GB of RAM).

Nice - that's still pretty solid.. although on a more typical 3060 or 3070 with less vram available, I probably wouldn't expect numbers quite that good.

My 14" M1 Max does around 30t/s on Mixtral Q4_K_M.

Have you tried the 2x 3090 setup? Using nvlink or SLI?

I use 2x 3090, no nvlink though. I’d read it doesn’t help that much but not well read on any improvement.

I have not personally gotten to test things that way

I tried it, with NVlink. The speedup during inference is negligible. You'll probably benefit more during training.

So you don't see significantly worse performance on 3bit quantized models compared to 4? Every 7/13b model I tried quantized gave much worse responses at 3 bit and below, whereas the differences from 4 bit to 6 or even 8 bit is more subtle.

Mixtral is a larger model, so maybe that makes it more tolerant of that level of quantization? I’ve been impressed with 3-bit Mixtral, but I haven’t done a ton of side by sides against 4-bit because I haven’t felt the need.

Could you share what you use to run it on a single 3090? I'd love to try it!

ollama has been by far the easiest way for me, either on Linux directly (as I do now) or WSL2.

the newer Apple Silicon Macs with unified memory

Mixtral has been MLXd already? Write ups, if any?

Here's the direct link and I can confirm that Mixtral-8x7B-v0.1 works on M2 Ultra 128GB via MLX and is easy to setup (the longest part is just downloading the weights):

https://github.com/ml-explore/mlx-examples/tree/main/llms/mi...

We'll have a tutorial soon (next week) on combining/composing with Reexpress to add uncertainty quantification (and to use it for semantic search). A link will be here: Tutorial 6: https://re.express/guide.html

I'm running it on an M2 Max with 96GB, and have plenty of room to spare. And it's fast. Faster than I can get responses from ChatGPT.

How many tokens/s? Which quantization? If you could test Q4KM and Q3KM, it would be interesting to hear how the M2 Max does!

No quantization (8_0). The full 48GB model. As for token count, I haven't tested it on more than 200 or so.

Isn’t 8_0 8-bit quantization?

Not to my knowledge. But because the unified memory doubles as VRAM for the onboard GPU, normal GPU acceleration can access the entire model even if it's 50+ GB. That's why ASi Macs are currently the holy grail for at-home inferencing, and also why projects like llama.cpp focus so much on ASi above all else, and why so many UIs release for macOS first before other operating systems. Certain Mac models offer up to 192GB of unified memory.

But that's not a Macbook. And a Macbook M3Max with 128GB of RAM is almost 8000€.

Considering how inaccessible and expensive 128GB of pro-level cards is, that is believe it or not, a good price.

Not to mention that's 128GB for a single GPU. No need to shard or split.

Yes it has, actually: https://github.com/ml-explore/mlx-examples. It's right in the main repo. NB, I haven't tried this, I'm using llama.cpp with a non-K-quant quantization on my MBP.

I have and don't consider MLX to be production ready. I've tested it on M1Max and M1Ultra (128) machines. It's completely non-deterministic in its resource consumption, sometimes using the GPU fully, sometimes getting seemingly stuck while processing, sometimes the GPU throttles.

However, there's one curious thing: llama.cpp _always_ leads to GPU throttling on Apple Silicon (e.g. the M1Max GPU will go from 1200MHz to around 700MHz), and then fully saturates it. In the rare cases I could get MLX to stay on the GPU, it was able to keep it at the maximum clock rate. However the unpredictable pauses and seemingly unoptimized prompt processing makes it hard to pick a winner in end-to-end tokens/s

Many options for running Mistral models in your terminal using LLM:

https://simonwillison.net/2023/Dec/18/mistral/

I liked "Using Llamafile’s OpenAI API endpoint" described there, using Justine Tunney's llamafiles for Mixtral, but the article link is out of date, as the models have been replaced with newer: https://huggingface.co/jartine

Three 4060 Ti 16GB (there are single slot models) is around $1500. I think is possible to get a consumer system that's cheaper than a 48GB Mac.

Yep. Edited my post to reflect as much. The MBP makes a rather nice portable package though.

Mixtral is good but those Ravenwolf benchmarks are meaningless. It’s like some random dude trying to reinvent MMLU without any rigor or consistency and in German. Dataset contamination is a problem, but not one that’s solved by folkloric evaluation of LLMs by people asking for tips on a subreddit.

I don't think they're meaningless; they have a few benefits:

1) He doesn't have an ax to grind / an LLM to pimp out, so he's relatively even-handed

2) He uses the same (secret) test data for each model, so his testing is resistant to cherry-picking/finetuning on tests

3) He likes weirdo role-play prompting, so he has a very good sense of the edges of refusal and alignment tuning

4) He picks up stuff well before it hits the only other fair testing I know of, the chat arena

5) I think asking stuff in German is at worst neutral, and at best useful for testing capacity in edge cases.

Practically speaking, his 'preferred' non-giant models, Nous-Capybara-34B and Mixtral both are excellent in comparison with some of the others he looks at, and good recommendations.

That said, I'd like to see a test suite that GPT-4 fails at, or struggles at, at least. And, it would save him a lot of time if he could get something automated together, it's clearly a lot of effort to hand test all those models.

Any tests that are unfalsifiable and can’t be reproduced are meaningless when it comes to gauging the performance of LLMs (and most other things). I could also post on a subreddit and say I have a secret set of German tests that may or may not exist and and I like these models, but that does nothing to advance the science of evaluating these things. If you want to evaluate human preferences, you can use chatbot arena, which can be gamed, but at least reflects more than what one guy says to be true. And this is with me agreeing that Nous-Capybara is also a good model. But don’t take my word for it because it’s not worth very much!

I think we agree on almost everything about this - which is to say, probably both you and I think the Wolfram Ravenwolf tests are just about useful enough to indicate whether or not a model might be worth downloading, but certainly not enough to justify spending money, say, or planning around. So, yes, I'm with you.

I agree that better ways to evaluate models would be super super useful, and benchmarks like MMLU and whatever's next will continue to be helpful ("real" science). And, it seems like there may even be some benefits for models to training to 'ace the test' more broadly, which is interesting, and ties to some educational theories in teaching humans.

However, one area that open tests can't excel in is in this "fair" evaluation arena -- and I do think that private tests have some value there, to the extent that they can show utility and maintain trust. I don't make any claims about German sex role-play being a good or bad start for these, though.

I think there’s room for private tests, but those should probably be done by some kind of independent standards body. LLMs are an incredible advancement of machine learning research, but we diminish that when we let these kind of wholly unscientific evaluations predominate and especially in open source, where so much great work is otherwise being done to combat the monopoly tactics of big tech.

I agree with all except using german, because some model might be better at german but it doesn't necessarily means they are better overall.

For example I'm pretty sure Mistral models are better at french, so doing a benchmark using french only would be advantageous for them.

If you want to compare all models, better use english. Because now his benchmark just show which models is better at german.

That being said, it's still a very welcomed benchmark.

I find it baffling that anyone would take these benchmarks seriously. The methodology is not transparent, and some of the tests completely unfair, like those in German or about specific niche German topics. The author readily acknowledges that these tests are his personal interests, which is totally fair. But that it would rise to the top of that subreddit and now HN as a general measure of quality of any kind is indicative of the lack of reliable benchmarks out there.

It’s all just people jockeying for influence and hype, like crypto before it.

I'm really curious if/when we'll see MoE based on even smaller models like Phi-2?

At this small a scale, what would the benefits be of something like a 8x2B as opposed to moving to a 7B?

For the output performance (quality and tokens/second), Mixtral 8x7B seems to require less VRAM. But it is still hard to make it fit, even with a lot of quantization, within the GPU RAM of most discreet consumer GPUs. Perhaps a smaller base model like Phi-2 could bring the VRAM requirements down, but the MoE will bring the output quality up from Phi-2.

I'm looking forward to all the hardware announcements. It's certainly looking like intentionally designed on device acceleration of LLMs for consumers is coming.

I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.

3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locally deployed Mixtral in their games. e.g. something like CIV but with each leader powered via LLM

Google tells me that the RTX 3090 is priced between US$1,480 and $1,680.

You can buy a whole PC for that, I refuse to believe that a GPU priced that highly is "consumer grade" and "common".

Are there any GPUs that are good for LLMs or other genAI that aren't absurdly priced? Or ones specifically designed for AI rather than gaming graphics?

Gamers and LLM/AI/ML GPU users do not find that absurdly priced. Absurdly priced in our world is $15,000 so your perceptions are off by about a order of magnitude.

I can assure you a $1,500 graphics card is a big luxury for most gamers.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

3090 isn't even in the top 30.

I would go even further - anytime I look at the hardware survey, I am surprised by the anemic and dated hardware people are running. Most people who game are not building a custom box anymore.

I mean I was, but turns out a well built box lasts a while.

Actually it should be #23 on that list but is split up into two items with roughly 0.6% each. Seems to be a bug.

Search the page for 3090 and see for yourself, it's on the list twice.

To be honest I didn't look that hard

Aren’t there others in that list above the 3090 that are even more expensive?

Then again, the Apple II cost around $6.5k in today's dollars. [0] My hunch is that people caring less for tricking out their computers for gaming is about people not being all that interested in being able to have the top of the line graphics settings enabled when playing AAA games. But I think the history of PCs and gaming very much proves that even normal consumers are willing to spend the big bucks on technology when it enables something truly new.

[0]: https://en.wikipedia.org/wiki/Apple_II

I recently purchased a 3090 on Reddit’s hardwareswap community for $550. New GPUs are pricey right now because of shortages, but if you look around a bit it can be affordable.

Nah. It's about half that. You can pick up a used 4090 for that much.

tl;dr, no, especially since AMD is lagging behind.

Apple is the one doing the best in terms of making consumer-friendly hardware that can perform AI/ML tasks...but that involves a different problem regarding video games.

[A used RTX 3090 goes for around $700.](https://prayag.bhakar.org/apollo-ai-compute-cluster-for-the-...)

To be fair, I got a card second hand to play with, and it was only £700 ($900~) and it came with manufacturer warranty. It was a bit of a gamble, but the 24GB VRAM has been a godsend for experimenting with LLMs. And playing video games at 4K!

The average gamer doesn't have a 3090-equivalent, or even an Nvidia GPU.

Running LLMs locally to create custom dialogue for games is still years away.

VR isn't pragmatically accessible to the average gamer due to hardware requirements and the necessity of setting up the right physical environment but there are still VR games.

There's a few after almost a decade of vr being a thing.

How did you arrive at "almost a decade"? There's been VR stuff since at least the late 1980s. On the flip side you could say it wasn't "a thing" until just a few years ago. Or that it isn't "a thing" yet.

I think it's pretty obvious that I'm talking about post oculus rift vr and the crop of games that use controllers similar to that device.

or even an Nvidia GPU.

What do you mean? Most gamers do have an nvidia GPU.

Edit: unless you talk about mobile gamers, and not PC gamers?

To wit, the steam report says 74% of its users have Nvidia.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

Running LLMs locally to create custom dialogue for games is still years away

I had think about this, you need a small LLM for a game, you do not need it to know about movies, music bands, history, coding and all the text on the internet. I am thinking we need a small model similar to phi-2 , trained only on basic stuff then you would train it on the game world lore. Then the game would also use some "simpler" graphics( we had good looking game graphics decades ago so I do not think you could be limited to text adventures or 2D graphics, just you need some simpler and optimized graphics, it is always interesting when someone shows his unreal demo that uses most RAM/VRAM for a simple demo level then a giant game like GTA5)

Why couldn't this be handled remotely and be handled as part of a subscription to the game?

This is wrong in many ways. First off you don't necessarily need a high-end consumer GPU, look at llama.cpp.

Or even better look at something like phi-2. It's likely to go even lower. I'm sure there are people here who can detail more specifics.

Its 1/24 and we're already there more or less.

You can also run Mixtral, at a decent token rate, on a post-2020 Apple Macbook Pro M1/M2/M3 with 32GB+ of RAM. 16GB RAM also works, sort of ok, which I suspect is the same quantization a 3090 is using, but I do notice a difference in the quantization. On my M2 Pro, the token rate and intelligence feels like GPT-3.5turbo. This is the first model I've started actually using (vs playing around with for the love of the tech) instead of GPT-3.5.

An Apple M2 Pro with 32GB of RAM is in the same price range as a gaming PC with a 3090, but its another example of normal people with moderately high performance systems "accidentally" being able to run a GPT-3.5 comparable model.

If you have an Apple meeting these specs and want to play around, LLM Studio is open source and has made it really easy to get started: https://lmstudio.ai/

I hope to see a LOT more hobby hacking as a result of Mixtral and successors.

How did you get Mixtral to run an a 32gb M1?

I tried using Ollama on my machine (same specs as above) and it told me I needed 49gb RAM minimum.

I'm using:

  ollama run dolphin-mixtral:8x7b-v2.5-q3_K_S

That runs on 32G? The original mixtral q3 wouldn’t run for me. Maybe the dolphin tuned version is smaller?

EDIT: I just checked, it runs great, thanks.

I don't think it's true that LM Studio is open source. Maybe I'm missing something?

Lmstudio (that they linked) is definitely not open source, and doesn't even offer a pricing model for business use.

Llmstudio is, but I suspect that was a typo in their comment. https://github.com/TensorOpsAI/LLMStudio

I have so far run it on my M1 MacBook using llamafile [1] and found it to be great.

Is there any speed/performance/quality/context size/etc. advantage to using LLM Studio or any of the other *llama tools that require more setup than downloading and running a single llamafile executable?

[1] https://github.com/Mozilla-Ocho/llamafile/

LM Studio, sadly, is not open source.

It's worth noting that the 4bit quants can run on cpu at ~reading speed, which should unlock many use cases - especially if we can precompute some of the results async.

That assumes full CPU utilization, with no other tasks heavily using the CPU.

In the case of high-end video games, that's unlikely.

Resource constraints would be a concern, yeah, so if you were developing a game featuring LLMs (which would, at this point in their development and maturity, be a gimmick) you would keep that in mind and keep other demands on resources low.

True, yet many games are effectively single-threaded anyways.

The bigger problem is memory capacity and bandwidth, but I suspect folks will eventually figure out some sort of QoS setup to let the system crunch LLMs using otherwise unused/idle resources.

Although in my testing the 4bit reasoning was not nearly as good.

The VRAM usage is closer to a 47B model - although only 2 experts are used at a time for inference, all experts are needed to complete it.

Confirmed. Currently running Mixtral 8x7B gguf (Q8_0) on a Macbook Pro M1 Max w 64GB ram, and RAM usage is sitting at 48.8 GB.

How many t/s?

you cannot currently run mixtral with a 32k context on a 3090. Unless am I wrong? I think the largest context I was able to reproduce was around 1500 with 2 or 3 bit, I would have to look at my notes.

I've been working with local models as agents, and anyone interested in trying this needs to know about llama.ccp's "gammers" feature. You can force the model's output to conform to a specific structure, which is not only useful for ensuring that you recieve eg. valid JSON output but more specific stuff like "if you choose to do x, you must also provide y", which can be great for influencing it's thinking (eg. an actor who's planning ahead might be required to resond with three of any of the five W's (it's choice which three), but then it get's to be free-form inside of the JSON string values, which can be used as context for a following selection from a restricted set of actions; or a model might have option of asking for more time to think at the end if it's response, but if it doesn't then it needs to specify it's next action). This does't impact generation speed AFAICT and can be used in vry creative ways, but results can still need re-generated if they're truncated and I had to write a function to stop immediately when the valid JSON obj is closed (ie. at the end) or when more that like five newlines are generated it a row. This will vary by model.

  In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks

I'm interested in seeing how it does with mathematics. That has always seemed like a particular weakness that no one yet has effectively cracked.

it's sort of a weakness inherent to LLM's... next word prediction isn't really supposed to be good at math.

I doubt it will be ever 'cracked' with better LLM's, only multimodal ones with access to program execution and calculators.

it's sort of a weakness inherent to LLM's... next word prediction isn't really supposed to be good at math.

FWIW I don't agree with this in a theoretical sense. The reason LLMs can do as much as they can is because next token prediction attempts to infer a world model for the processes that generated each next-token in the training set. I don't see a reason that this would preclude learning arithmetic in order to better predict next tokens that require arithmetic.

I'd guess that arithmetic will become suddenly reliable with one of the next significant (e.g. 2-5x) jumps in parameter count.

I disagree; 1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns. 2. Next token prediction is a terrible way to perform arithmetic. 3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.

Any perceived arithmetic ability is just a textual coincidence.

I agree that a sufficiently intelligent LLM, like 300+ IQ would have an excellent model of how multiplication works. It may even assist in finding new theorems, but a calculator will always be better at 926*725.

1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns.

There's nothing "textual" about the tokens. They are arbitrary identifiers. There's nothing "textual" about transformers! The fact that e.g. GPT-4 can accept images as input, and that its textual performance improved as a result, and that transformers are also being used for text-to-speech models should have already communicated this.

2. Next token prediction is a terrible way to perform arithmetic.

This is just attempting to resolve our disagreement with pure assertion. It's certainly less efficient to use an artificial intelligence to do arithmetic. But whether it's efficient is a different question than how likely it is to be possible.

3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.

This is blatantly untrue. The same argument would suggest that LLMs can't do anything that wasn't exactly in their training set already. But they can.

1. this is interesting. tokens are arbitrary identifiers... of text. It's called a 'text encoder' for a reason

2. it's a strong assertion but it is true. it's kind of inherent in the name; next word prediction. Why would you want a calculator to be making predictions? I agree that it's worth understanding whether it's possible, but your original point was that 'arithmetic will become suddenly reliable' - this is what I am disagreeing with.

3. this links to the above point - you are right: LLM's can already perform arithmetic, this could be easily proved by loading up gpt-2. but again, your original point was that LLM's will 'figure out' arithmetic - I don't believe they will. The loss function does not incentivize being good at arithmetic, it incentivizes making good predictions that are close to the true value. The loss function will not penalize an LLM who predict 'good' is the next word when it should have been 'great'. while it might penalize '2+2=3; since this sequence would be strongly represented in the training set, it's not going to penalize the model getting '1234*5678' one digit off - which is the problem.

About 3., isn't that done specifically in order to not overfit?

LLMs are compression functions. An LLM that internalizes the rules of arithmetic will compress better than one that doesn't. That improvement will be measurable in the loss function, as it correctly predicts more of its training data that depends on arithmetic answers to predict the next word.

Why would you want a calculator to be making predictions?

If the predictions are correct, why wouldn't I? You are objecting to the entire concept of LLMs at this point, there's nothing specific to arithmetic here.

Annoyingly, Bard with Gemini will apparently use coding to answer every logical thinking question and get them all wrong. If you end with "Do not use code", it will get them all right.

ChatGPT-4 is starting to do this now all of the sudden, and it is also annoyingly returning erroneous information.

Agreed. But they do call it out here and I'm interested in understanding why.

You don't need multimodals to access tools such as calculator. Check out PandasAI or Langchain Agent workflow. Rather than workout 451 * 995 for example the llm constructs the pandas query, runs it, returns result to user. Works pretty well.

There's been attempts to use different embeddings for numbers, which helps a lot. E.g., https://news.ycombinator.com/item?id=37936005

Sure, llms don't do discreet math or logic directly. On the other hand, they write surprisingly good code.

I'm guessing we'll see llms that do input>program(s)>run>summarize>output

I'm not really disagreeing - I just think llms will do "more of the work" themselves by way of writing and running prolog programs, symbolic math (Julia etc) and running theorem provers.

In their recent interview on the A16Z podcast the Mistral founder said they have multiple internal models between chatGPT and GPT4 level of quality.

Given their high quality releases so far they means exciting times for open source LLMs.

Except there's no indication those more powerful Mistral models will also be FOSS.

Well he said that's the plan for the whole company so it'd be surprising if they aren't.

Mistral Medium is available via their API, but not available for download, for example, so I find that confusing if you’re claiming their CEO claimed the plan is to be open for all models.

Isn't Mistral Medium the Mixtral model? I'd never heard of Mistral Medium TBH.

"Mistral 7B" is "mistral-tiny", "Mixtral" is "mistral-small", and "mistral-medium" is something larger.

https://mistral.ai/news/la-plateforme/

Is this a model that can be run using Simon Wilison's LLM tool? I cannot find any mention of Mixtral in the issues nor in the discussions. Is there an easy way to play with this model from the command line other than that?

https://simonwillison.net/2023/Dec/18/mistral/

ollama or llama.cpp

Unsure what Simon Wilson's program does, but you can pull models via many methods.

For CLI, Ollama: https://ollama.ai/library/mixtral

With UI, GPT4All: https://gpt4all.io/index.html (doesn't yet support Mixtral)

In-app, superagent.sh: https://github.com/homanp/superagent

Ollama is probably the easiest way. https://ollama.ai/library/mixtral

LM Studio is another option.

Wasn’t this what was released at the end of last year?

Yes, the Mixtral magnet link was tweeted on the 8th of December: https://twitter.com/mistralai/status/1733150512395038967

The paper just came out today: https://twitter.com/dchaplot/status/1744547220983005478l

Also available on Ollama: https://ollama.ai/library/mixtral

The model's weights were released in a torrent, but this is the much anticipated paper detailing some of the work.

The model itself was, but was the writeup? This submission is from yesterday.

I've not been following their releases too well but it seemed they were very much on the side of releasing models asap.

I'm curious when we'll start to see open access multimodal models being released.

The advancement in text only models has been amazing, but a lot of the 'emergent' behavior in GPT-4 may be because of multimodal training and not just MoE or parameter sizes.

I'll be curious to see if multimodal smaller models see similar leaps.

CogVLM is very good in my (brief) testing: https://github.com/THUDM/CogVLM

The model weights seem to be under a non-commercial license, not true open source, but it is "open access" as you requested.

It would be nice if someone trained a CogVLM-compatible model from scratch under an open source license.

When I tried last, it couldn’t be run on M-series Macs.

LLaVA is open, although not the leap you are expecting: https://llava-vl.github.io/

Meta also released a (non-commercial) multimodal model among 6 modalities: https://ai.meta.com/blog/imagebind-six-modalities-binding-ai...

I've heard that Google actually got the jump on OpenAI in this regard (just from people in a FAANG), and they're playing a bit of catch-up. OpenAI still has a distinct advantage on the language side, though. This is all hearsay, of course.

On mac silicon:

https://ollama.ai/

ollama pull mixtral

For a chatgpt-esk web ui

https://github.com/ollama-webui/ollama-webui

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama-webui/ollama-webui:main

Navigate to http://localhost:3000

You can also use ollama in langchain.

I wouldn't use Ollama (or any LLM host) through docker on Mac m1 since as far as I'm aware there's still no support for Metal.

The UI is in Docker but the Ollama server isn't.

There's also some unlocked fine tunings available. Dolphin seems to be a very popular one. (Trained on more coding data) If you want to fit under 32gb, there's https://ollama.ai/library/dolphin-mixtral:8x7b-v2.7-q3_K_M

I haven't played with dolphin-mixtral, but dolphin-mistral gave me a good impression for generic rag application.

There's also developments and experimentation in making it more factual via dpo and laser(afaik so far not very unsuccessful).

I haven't read a lot of LLM papers, but I believe this is a rather weak paper low on details (note: not the results achieved of the LLM, but the paper itself). If it had landed on my desk for a review, I probably would have sent it back just based on that.

For example, they never really say how they trained the experts or which dataset they used.

Is this the current standard in the field?

Is this the current standard in the field?

It’s becoming pretty common, yeah. The two things you mentioned: training particulars and dataset mixture are also basically the only competitive advantage companies have. Since the code/architecture is trivial to reproduce, anyone with enough money can make a competing model “easily”.

OpenAI started this trend and cemented it with GPT4’s “technical report” which didn’t even specify the number of parameters in the model. They’ve been historically vague about their dataset for far longer than that though.

Exactly, same thought. Actually I would expect, that they trained each expert separately and later together, since you need to train the router network as well. I'm far from an expert in LLMs. But this would be interesting to know, especially how different training setups influence the performance.

Anyone know of a decent model for coding assistant that can run on a 16GB vram rtx 4060ti?

The chatbot arena has tons of good models that can do that, and Mistral, at the very least, isn't half bad.

It will probably require some trial and error, but this leaderboard is a good starting point. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

~13B models should work well with plenty of room for other applications. Lately, I’ve heard good things about Solar10B, but new models come in a dozen by day, so it might have already been changed.

Does anyone know what Figure 8 at the end shows?

It looks like each expert is used interchangeably with no clear pattern. And earlier they say "Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic."

So then, what is the point of the "expert"?

Could this extra performance just be through the 8 expert architectural design, and not be based on the underlying training material? For example, if all 8 experts were ArXiv papers, would the performance be different?

Not sure if it answers your question, but the expertise of each expert arise from the training process and aren't assigned by a human. Thus, they wouldn't be discernable to a human. The choice of the model to use, I think, is something called like a "gating network". That's also trained to favor most appropriate model based on training.

questions running through my head: Is there some magic with the number 8? Why not 6? 11?

Each of these 8 models were 7B models. What about using 80 x tinyllama 1B models?

TL;DR: It looks like a convenient and explainable starting point for the proof-of-concept.

Manually combining specialist variants is a known technique. This paper automates it with a router component which mixes 2 sub-models at any given time. Training 8 slight variants of a base seems safe and configurable compared to n > 16 specialists. The latter seems like the parts could interact unpredictably.

Also, the memory usage seems predictable: it follows 2^m memory conventions by mixing 2 models at a time, so ~2x the memory is actively used at a time. I'm not up to date on the hardware implications, so it might not mean anything yet. It might one day if this approach works well enough to design around.

If anyone wants to try out this model, I believe it's one of the ones released as a Llamafile by Mozilla/jart[0].

1) Download llamafile[1] (30.03 GB): https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-ll...

2) chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile

3) ./mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile

[0] https://hacks.mozilla.org/2023/11/introducing-llamafile/

[1] https://github.com/Mozilla-Ocho/llamafile#quickstart

tldr;

- 'sparse mixture of experts' means each layer contains 8 mini-neural-nets (experts), each trained to be good at certain data/tasks

- a router network passes each token (word) to 2 experts which suit it best

- since only 2/8 experts are utilized, the model effectively only uses 13/47B parameters during inference (text generation)

- this expert mechanism makes it very efficient and effective (uses less params, is able to speciailize to tokens)

- beats llmana 70b and gpt-3.5, especially at math and coding and language

- has fine tuned model able to beat gemini pro, as well as llmana 70b and gpt-3.5

Recent and related:

Mixtral of experts - https://news.ycombinator.com/item?id=38598559 - Dec 2023 (300 comments)

Mistral-8x7B-Chat - https://news.ycombinator.com/item?id=38594578 - Dec 2023 (69 comments)

Mistral "Mixtral" 8x7B 32k model [magnet] - https://news.ycombinator.com/item?id=38570537 - Dec 2023 (239 comments)

https://youtu.be/qHj-8SWuvy0

Is there a description of each "expert"? Does one of the 8 models specialize in multi-lingual translation, while another specializes in coding?

I don't see any answer to this in the paper, although I only skimmed it.