return to table of content

Mixtral 8x7B: A sparse Mixture of Experts language model

vessenes
45 replies
1d14h

This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com.... Ravenwolf notes that the model slightly outperforms some of its benchmark testing, and this is my experience. It's surprisingly good for a model of its size, and a very capable daily driver on a Mac for chat, code input and other uses.

Something that has come to light since the release of the weights, and not mentioned in this paper is that it looks like fairly likely that the 8 experts were all seeded by Mistral 7B and subsequently diverged. This has generated a lot of experimentation in the local LLM community with cloning models as a way to cheaply generate experts.

It was generally thought likely that training an 8x7B network would be as much work as training 8 7B networks, but this seems not to have been true for Mistral, which is super interesting.

There's still a lot of rapid innovation happening in this space, with papers like Calm from DeepMind this week, and a lot of the adhoc experimental layer combining happening in the wild, (see, e.g. Goliath-120b), I think we're likely to see some pretty interesting architectural improvements this year in the LLM space.

Calm seems to point the way to a next step after MoE, and models like Goliath seem to indicate that even a really really lazy version of Calm (no Linear layer combination, just literally alternating layers at full weights) can be very impactful. Overall I think we will see really, really strong models that are performant on consumer hardware in 2024, likely first half of this year.

Casteil
32 replies
1d14h

I've had excellent results with Mixtral too - it's genuinely impressive. Only problem is that it's a relatively big model that's difficult to run with full GPU inference on consumer hardware (vs the 7b/13b models people typically use).

So far, the main consumer platform capable of running it without 'ruining' the quality of its output (with high levels of quantization) is the newer Apple Silicon Macs with unified memory - generally >=48GB. It can apparently be done on 32 or 36GB, but there's not much headroom.

Edit: As coder543 points out, yes - you can run it without more lossy levels of quantization on multi-GPU setups providing those have enough combined vram.

coder543
16 replies
1d13h

Mixtral works great at 3-bit quantization. It fits onto a single RTX 3090 and runs at about 50 tokens/s. The output quality is not "ruined" at all.

For the amount of money you're talking about, you could also buy two 3090s (~$750 each on eBay) and have 48GB of VRAM to run with less quantization at full speed.

M-series Macs are surprisingly flexible platforms, but they're not "the only" consumer platform that can do Mixtral.

bekantan
3 replies
1d6h

The output quality is not "ruined" at all.

That was my experience as well - 3-bit version is pretty good.

I also tried 2-bit version, which was disappointing.

However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size).

[1] https://github.com/ggerganov/llama.cpp/pull/4773

mark_l_watson
1 replies
1d3h

I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.

After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.

With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.

Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.

Casteil
0 replies
1d1h

Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.

[1] https://github.com/jmorganca/ollama

[2] https://github.com/ollama-webui/ollama-webui

coder543
0 replies
1d3h

That is a very interesting discussion. Weird to me that the quantization code wasn’t required to be in the same PR. Ika is also already talking about a slightly higher 2.31bpw quantization, apparently.

Casteil
3 replies
1d13h

Fair enough. I did put 'ruining' in quotes for a reason - I haven't compared output between Q3 and Q4_K_M that I use, but you do generally sacrifice output quality at higher quantization levels.

And you're right, you can run it on a multi-GPU setup if you're so inclined.

coder543
2 replies
1d13h

You can also choose to run at 4-bit quantization, offloading ~27 out of 33 layers to the GPU, and that runs at about 25 tokens/s for me. I think that's about the same speed as you get out of an M1 Max running at 4 bits? Although I'm not sure about the newer M2 or M3 Max chips. Googling around, I didn't immediately see clear benchmarks for those.

patrakov
0 replies
1d11h

Just as another data point, a CPU-only setup with Q5_K_M would give you roughly 4 tokens per second on a Ryzen laptop (Dell Inspiron 7415 upgraded to 64 GB of RAM).

Casteil
0 replies
1d13h

Nice - that's still pretty solid.. although on a more typical 3060 or 3070 with less vram available, I probably wouldn't expect numbers quite that good.

My 14" M1 Max does around 30t/s on Mixtral Q4_K_M.

3abiton
3 replies
1d10h

Have you tried the 2x 3090 setup? Using nvlink or SLI?

kwerk
0 replies
23h6m

I use 2x 3090, no nvlink though. I’d read it doesn’t help that much but not well read on any improvement.

coder543
0 replies
1d4h

I have not personally gotten to test things that way

Tepix
0 replies
23h30m

I tried it, with NVlink. The speedup during inference is negligible. You'll probably benefit more during training.

eyegor
1 replies
1d11h

So you don't see significantly worse performance on 3bit quantized models compared to 4? Every 7/13b model I tried quantized gave much worse responses at 3 bit and below, whereas the differences from 4 bit to 6 or even 8 bit is more subtle.

coder543
0 replies
1d11h

Mixtral is a larger model, so maybe that makes it more tolerant of that level of quantization? I’ve been impressed with 3-bit Mixtral, but I haven’t done a ton of side by sides against 4-bit because I haven’t felt the need.

chpatrick
1 replies
1d8h

Could you share what you use to run it on a single 3090? I'd love to try it!

coder543
0 replies
1d4h

ollama has been by far the easiest way for me, either on Linux directly (as I do now) or WSL2.

ignoramous
12 replies
1d13h

the newer Apple Silicon Macs with unified memory

Mixtral has been MLXd already? Write ups, if any?

reexpressionist
4 replies
1d4h

Here's the direct link and I can confirm that Mixtral-8x7B-v0.1 works on M2 Ultra 128GB via MLX and is easy to setup (the longest part is just downloading the weights):

https://github.com/ml-explore/mlx-examples/tree/main/llms/mi...

We'll have a tutorial soon (next week) on combining/composing with Reexpress to add uncertainty quantification (and to use it for semantic search). A link will be here: Tutorial 6: https://re.express/guide.html

teilo
3 replies
1d4h

I'm running it on an M2 Max with 96GB, and have plenty of room to spare. And it's fast. Faster than I can get responses from ChatGPT.

coder543
2 replies
1d3h

How many tokens/s? Which quantization? If you could test Q4KM and Q3KM, it would be interesting to hear how the M2 Max does!

teilo
1 replies
19h4m

No quantization (8_0). The full 48GB model. As for token count, I haven't tested it on more than 200 or so.

pilotneko
0 replies
17h48m

Isn’t 8_0 8-bit quantization?

LoganDark
3 replies
1d13h

Not to my knowledge. But because the unified memory doubles as VRAM for the onboard GPU, normal GPU acceleration can access the entire model even if it's 50+ GB. That's why ASi Macs are currently the holy grail for at-home inferencing, and also why projects like llama.cpp focus so much on ASi above all else, and why so many UIs release for macOS first before other operating systems. Certain Mac models offer up to 192GB of unified memory.

mkesper
2 replies
1d10h

But that's not a Macbook. And a Macbook M3Max with 128GB of RAM is almost 8000€.

zarzavat
1 replies
1d5h

Considering how inaccessible and expensive 128GB of pro-level cards is, that is believe it or not, a good price.

LoganDark
0 replies
18h59m

Not to mention that's 128GB for a single GPU. No need to shard or split.

vessenes
1 replies
1d12h

Yes it has, actually: https://github.com/ml-explore/mlx-examples. It's right in the main repo. NB, I haven't tried this, I'm using llama.cpp with a non-K-quant quantization on my MBP.

summarity
0 replies
1d8h

I have and don't consider MLX to be production ready. I've tested it on M1Max and M1Ultra (128) machines. It's completely non-deterministic in its resource consumption, sometimes using the GPU fully, sometimes getting seemingly stuck while processing, sometimes the GPU throttles.

However, there's one curious thing: llama.cpp _always_ leads to GPU throttling on Apple Silicon (e.g. the M1Max GPU will go from 1200MHz to around 700MHz), and then fully saturates it. In the rare cases I could get MLX to stay on the GPU, it was able to keep it at the maximum clock rate. However the unpredictable pauses and seemingly unoptimized prompt processing makes it hard to pick a winner in end-to-end tokens/s

Terretta
0 replies
1d6h

Many options for running Mistral models in your terminal using LLM:

https://simonwillison.net/2023/Dec/18/mistral/

I liked "Using Llamafile’s OpenAI API endpoint" described there, using Justine Tunney's llamafiles for Mixtral, but the article link is out of date, as the models have been replaced with newer: https://huggingface.co/jartine

lithiumii
1 replies
1d13h

Three 4060 Ti 16GB (there are single slot models) is around $1500. I think is possible to get a consumer system that's cheaper than a 48GB Mac.

Casteil
0 replies
1d13h

Yep. Edited my post to reflect as much. The MBP makes a rather nice portable package though.

bugglebeetle
7 replies
1d13h

Mixtral is good but those Ravenwolf benchmarks are meaningless. It’s like some random dude trying to reinvent MMLU without any rigor or consistency and in German. Dataset contamination is a problem, but not one that’s solved by folkloric evaluation of LLMs by people asking for tips on a subreddit.

vessenes
4 replies
1d12h

I don't think they're meaningless; they have a few benefits:

1) He doesn't have an ax to grind / an LLM to pimp out, so he's relatively even-handed

2) He uses the same (secret) test data for each model, so his testing is resistant to cherry-picking/finetuning on tests

3) He likes weirdo role-play prompting, so he has a very good sense of the edges of refusal and alignment tuning

4) He picks up stuff well before it hits the only other fair testing I know of, the chat arena

5) I think asking stuff in German is at worst neutral, and at best useful for testing capacity in edge cases.

Practically speaking, his 'preferred' non-giant models, Nous-Capybara-34B and Mixtral both are excellent in comparison with some of the others he looks at, and good recommendations.

That said, I'd like to see a test suite that GPT-4 fails at, or struggles at, at least. And, it would save him a lot of time if he could get something automated together, it's clearly a lot of effort to hand test all those models.

bugglebeetle
2 replies
1d12h

Any tests that are unfalsifiable and can’t be reproduced are meaningless when it comes to gauging the performance of LLMs (and most other things). I could also post on a subreddit and say I have a secret set of German tests that may or may not exist and and I like these models, but that does nothing to advance the science of evaluating these things. If you want to evaluate human preferences, you can use chatbot arena, which can be gamed, but at least reflects more than what one guy says to be true. And this is with me agreeing that Nous-Capybara is also a good model. But don’t take my word for it because it’s not worth very much!

vessenes
1 replies
23h0m

I think we agree on almost everything about this - which is to say, probably both you and I think the Wolfram Ravenwolf tests are just about useful enough to indicate whether or not a model might be worth downloading, but certainly not enough to justify spending money, say, or planning around. So, yes, I'm with you.

I agree that better ways to evaluate models would be super super useful, and benchmarks like MMLU and whatever's next will continue to be helpful ("real" science). And, it seems like there may even be some benefits for models to training to 'ace the test' more broadly, which is interesting, and ties to some educational theories in teaching humans.

However, one area that open tests can't excel in is in this "fair" evaluation arena -- and I do think that private tests have some value there, to the extent that they can show utility and maintain trust. I don't make any claims about German sex role-play being a good or bad start for these, though.

bugglebeetle
0 replies
22h20m

I think there’s room for private tests, but those should probably be done by some kind of independent standards body. LLMs are an incredible advancement of machine learning research, but we diminish that when we let these kind of wholly unscientific evaluations predominate and especially in open source, where so much great work is otherwise being done to combat the monopoly tactics of big tech.

audessuscest
0 replies
1d8h

I agree with all except using german, because some model might be better at german but it doesn't necessarily means they are better overall.

For example I'm pretty sure Mistral models are better at french, so doing a benchmark using french only would be advantageous for them.

If you want to compare all models, better use english. Because now his benchmark just show which models is better at german.

That being said, it's still a very welcomed benchmark.

epups
1 replies
1d9h

I find it baffling that anyone would take these benchmarks seriously. The methodology is not transparent, and some of the tests completely unfair, like those in German or about specific niche German topics. The author readily acknowledges that these tests are his personal interests, which is totally fair. But that it would rise to the top of that subreddit and now HN as a general measure of quality of any kind is indicative of the lack of reliable benchmarks out there.

bugglebeetle
0 replies
1d

It’s all just people jockeying for influence and hype, like crypto before it.

pseudosavant
2 replies
1d10h

I'm really curious if/when we'll see MoE based on even smaller models like Phi-2?

Redster
1 replies
1d1h

At this small a scale, what would the benefits be of something like a 8x2B as opposed to moving to a 7B?

pseudosavant
0 replies
21h52m

For the output performance (quality and tokens/second), Mixtral 8x7B seems to require less VRAM. But it is still hard to make it fit, even with a lot of quantization, within the GPU RAM of most discreet consumer GPUs. Perhaps a smaller base model like Phi-2 could bring the VRAM requirements down, but the MoE will bring the output quality up from Phi-2.

tracerbulletx
0 replies
1d12h

I'm looking forward to all the hardware announcements. It's certainly looking like intentionally designed on device acceleration of LLMs for consumers is coming.

cuuupid
42 replies
1d15h

I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.

3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locally deployed Mixtral in their games. e.g. something like CIV but with each leader powered via LLM

LeoPanthera
13 replies
1d15h

Google tells me that the RTX 3090 is priced between US$1,480 and $1,680.

You can buy a whole PC for that, I refuse to believe that a GPU priced that highly is "consumer grade" and "common".

Are there any GPUs that are good for LLMs or other genAI that aren't absurdly priced? Or ones specifically designed for AI rather than gaming graphics?

alchemist1e9
7 replies
1d14h

Gamers and LLM/AI/ML GPU users do not find that absurdly priced. Absurdly priced in our world is $15,000 so your perceptions are off by about a order of magnitude.

__loam
6 replies
1d14h

I can assure you a $1,500 graphics card is a big luxury for most gamers.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

3090 isn't even in the top 30.

fbdab103
1 replies
1d13h

I would go even further - anytime I look at the hardware survey, I am surprised by the anemic and dated hardware people are running. Most people who game are not building a custom box anymore.

contravariant
0 replies
1d6h

I mean I was, but turns out a well built box lasts a while.

MacsHeadroom
1 replies
1d12h

Actually it should be #23 on that list but is split up into two items with roughly 0.6% each. Seems to be a bug.

Search the page for 3090 and see for yourself, it's on the list twice.

__loam
0 replies
1d11h

To be honest I didn't look that hard

ryanwaggoner
0 replies
1d14h

Aren’t there others in that list above the 3090 that are even more expensive?

helloplanets
0 replies
1d12h

Then again, the Apple II cost around $6.5k in today's dollars. [0] My hunch is that people caring less for tricking out their computers for gaming is about people not being all that interested in being able to have the top of the line graphics settings enabled when playing AAA games. But I think the history of PCs and gaming very much proves that even normal consumers are willing to spend the big bucks on technology when it enables something truly new.

[0]: https://en.wikipedia.org/wiki/Apple_II

rfw300
0 replies
1d14h

I recently purchased a 3090 on Reddit’s hardwareswap community for $550. New GPUs are pricey right now because of shortages, but if you look around a bit it can be affordable.

renewiltord
0 replies
1d14h

Nah. It's about half that. You can pick up a used 4090 for that much.

minimaxir
0 replies
1d15h

tl;dr, no, especially since AMD is lagging behind.

Apple is the one doing the best in terms of making consumer-friendly hardware that can perform AI/ML tasks...but that involves a different problem regarding video games.

PrayagBhakar
0 replies
1d13h
Conasg
0 replies
1d14h

To be fair, I got a card second hand to play with, and it was only £700 ($900~) and it came with manufacturer warranty. It was a bit of a gamble, but the 24GB VRAM has been a godsend for experimenting with LLMs. And playing video games at 4K!

minimaxir
9 replies
1d15h

The average gamer doesn't have a 3090-equivalent, or even an Nvidia GPU.

Running LLMs locally to create custom dialogue for games is still years away.

somnic
3 replies
1d15h

VR isn't pragmatically accessible to the average gamer due to hardware requirements and the necessity of setting up the right physical environment but there are still VR games.

__loam
2 replies
1d14h

There's a few after almost a decade of vr being a thing.

Cyphase
1 replies
1d8h

How did you arrive at "almost a decade"? There's been VR stuff since at least the late 1980s. On the flip side you could say it wasn't "a thing" until just a few years ago. Or that it isn't "a thing" yet.

__loam
0 replies
23h34m

I think it's pretty obvious that I'm talking about post oculus rift vr and the crop of games that use controllers similar to that device.

Mashimo
1 replies
1d10h

or even an Nvidia GPU.

What do you mean? Most gamers do have an nvidia GPU.

Edit: unless you talk about mobile gamers, and not PC gamers?

fragmede
0 replies
1d1h

To wit, the steam report says 74% of its users have Nvidia.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

simion314
0 replies
1d10h

Running LLMs locally to create custom dialogue for games is still years away

I had think about this, you need a small LLM for a game, you do not need it to know about movies, music bands, history, coding and all the text on the internet. I am thinking we need a small model similar to phi-2 , trained only on basic stuff then you would train it on the game world lore. Then the game would also use some "simpler" graphics( we had good looking game graphics decades ago so I do not think you could be limited to text adventures or 2D graphics, just you need some simpler and optimized graphics, it is always interesting when someone shows his unreal demo that uses most RAM/VRAM for a simple demo level then a giant game like GTA5)

michaelmrose
0 replies
1d12h

Why couldn't this be handled remotely and be handled as part of a subscription to the game?

kytazo
0 replies
1d15h

This is wrong in many ways. First off you don't necessarily need a high-end consumer GPU, look at llama.cpp.

Or even better look at something like phi-2. It's likely to go even lower. I'm sure there are people here who can detail more specifics.

Its 1/24 and we're already there more or less.

snickell
7 replies
1d14h

You can also run Mixtral, at a decent token rate, on a post-2020 Apple Macbook Pro M1/M2/M3 with 32GB+ of RAM. 16GB RAM also works, sort of ok, which I suspect is the same quantization a 3090 is using, but I do notice a difference in the quantization. On my M2 Pro, the token rate and intelligence feels like GPT-3.5turbo. This is the first model I've started actually using (vs playing around with for the love of the tech) instead of GPT-3.5.

An Apple M2 Pro with 32GB of RAM is in the same price range as a gaming PC with a 3090, but its another example of normal people with moderately high performance systems "accidentally" being able to run a GPT-3.5 comparable model.

If you have an Apple meeting these specs and want to play around, LLM Studio is open source and has made it really easy to get started: https://lmstudio.ai/

I hope to see a LOT more hobby hacking as a result of Mixtral and successors.

nraford
2 replies
1d9h

How did you get Mixtral to run an a 32gb M1?

I tried using Ollama on my machine (same specs as above) and it told me I needed 49gb RAM minimum.

eurekin
1 replies
1d6h

I'm using:

  ollama run dolphin-mixtral:8x7b-v2.5-q3_K_S

mark_l_watson
0 replies
1d2h

That runs on 32G? The original mixtral q3 wouldn’t run for me. Maybe the dolphin tuned version is smaller?

EDIT: I just checked, it runs great, thanks.

cjbprime
1 replies
1d13h

I don't think it's true that LM Studio is open source. Maybe I'm missing something?

eyegor
0 replies
1d11h

Lmstudio (that they linked) is definitely not open source, and doesn't even offer a pricing model for business use.

Llmstudio is, but I suspect that was a typo in their comment. https://github.com/TensorOpsAI/LLMStudio

barnabee
0 replies
1d6h

I have so far run it on my M1 MacBook using llamafile [1] and found it to be great.

Is there any speed/performance/quality/context size/etc. advantage to using LLM Studio or any of the other *llama tools that require more setup than downloading and running a single llamafile executable?

[1] https://github.com/Mozilla-Ocho/llamafile/

Me1000
0 replies
1d6h

LM Studio, sadly, is not open source.

RandomBK
4 replies
1d15h

It's worth noting that the 4bit quants can run on cpu at ~reading speed, which should unlock many use cases - especially if we can precompute some of the results async.

minimaxir
2 replies
1d15h

That assumes full CPU utilization, with no other tasks heavily using the CPU.

In the case of high-end video games, that's unlikely.

somnic
0 replies
1d14h

Resource constraints would be a concern, yeah, so if you were developing a game featuring LLMs (which would, at this point in their development and maturity, be a gimmick) you would keep that in mind and keep other demands on resources low.

RandomBK
0 replies
1d14h

True, yet many games are effectively single-threaded anyways.

The bigger problem is memory capacity and bandwidth, but I suspect folks will eventually figure out some sort of QoS setup to let the system crunch LLMs using otherwise unused/idle resources.

ilaksh
0 replies
1d14h

Although in my testing the 4bit reasoning was not nearly as good.

sanjiwatsuki
2 replies
1d14h

The VRAM usage is closer to a 47B model - although only 2 experts are used at a time for inference, all experts are needed to complete it.

discordance
1 replies
1d13h

Confirmed. Currently running Mixtral 8x7B gguf (Q8_0) on a Macbook Pro M1 Max w 64GB ram, and RAM usage is sitting at 48.8 GB.

karolist
0 replies
1d10h

How many t/s?

fswd
0 replies
1d14h

you cannot currently run mixtral with a 32k context on a 3090. Unless am I wrong? I think the largest context I was able to reproduce was around 1500 with 2 or 3 bit, I would have to look at my notes.

LanternLight83
0 replies
1d12h

I've been working with local models as agents, and anyone interested in trying this needs to know about llama.ccp's "gammers" feature. You can force the model's output to conform to a specific structure, which is not only useful for ensuring that you recieve eg. valid JSON output but more specific stuff like "if you choose to do x, you must also provide y", which can be great for influencing it's thinking (eg. an actor who's planning ahead might be required to resond with three of any of the five W's (it's choice which three), but then it get's to be free-form inside of the JSON string values, which can be used as context for a following selection from a restricted set of actions; or a model might have option of asking for more time to think at the end if it's response, but if it doesn't then it needs to specify it's next action). This does't impact generation speed AFAICT and can be used in vry creative ways, but results can still need re-generated if they're truncated and I had to write a function to stop immediately when the valid JSON obj is closed (ie. at the end) or when more that like five newlines are generated it a row. This will vary by model.

xrd
13 replies
1d15h

  In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks
I'm interested in seeing how it does with mathematics. That has always seemed like a particular weakness that no one yet has effectively cracked.

justinl33
12 replies
1d15h

it's sort of a weakness inherent to LLM's... next word prediction isn't really supposed to be good at math.

I doubt it will be ever 'cracked' with better LLM's, only multimodal ones with access to program execution and calculators.

cjbprime
5 replies
1d13h

it's sort of a weakness inherent to LLM's... next word prediction isn't really supposed to be good at math.

FWIW I don't agree with this in a theoretical sense. The reason LLMs can do as much as they can is because next token prediction attempts to infer a world model for the processes that generated each next-token in the training set. I don't see a reason that this would preclude learning arithmetic in order to better predict next tokens that require arithmetic.

I'd guess that arithmetic will become suddenly reliable with one of the next significant (e.g. 2-5x) jumps in parameter count.

justinl33
4 replies
1d10h

I disagree; 1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns. 2. Next token prediction is a terrible way to perform arithmetic. 3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.

Any perceived arithmetic ability is just a textual coincidence.

I agree that a sufficiently intelligent LLM, like 300+ IQ would have an excellent model of how multiplication works. It may even assist in finding new theorems, but a calculator will always be better at 926*725.

cjbprime
3 replies
1d10h

1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns.

There's nothing "textual" about the tokens. They are arbitrary identifiers. There's nothing "textual" about transformers! The fact that e.g. GPT-4 can accept images as input, and that its textual performance improved as a result, and that transformers are also being used for text-to-speech models should have already communicated this.

2. Next token prediction is a terrible way to perform arithmetic.

This is just attempting to resolve our disagreement with pure assertion. It's certainly less efficient to use an artificial intelligence to do arithmetic. But whether it's efficient is a different question than how likely it is to be possible.

3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.

This is blatantly untrue. The same argument would suggest that LLMs can't do anything that wasn't exactly in their training set already. But they can.

justinl33
2 replies
1d6h

1. this is interesting. tokens are arbitrary identifiers... of text. It's called a 'text encoder' for a reason

2. it's a strong assertion but it is true. it's kind of inherent in the name; next word prediction. Why would you want a calculator to be making predictions? I agree that it's worth understanding whether it's possible, but your original point was that 'arithmetic will become suddenly reliable' - this is what I am disagreeing with.

3. this links to the above point - you are right: LLM's can already perform arithmetic, this could be easily proved by loading up gpt-2. but again, your original point was that LLM's will 'figure out' arithmetic - I don't believe they will. The loss function does not incentivize being good at arithmetic, it incentivizes making good predictions that are close to the true value. The loss function will not penalize an LLM who predict 'good' is the next word when it should have been 'great'. while it might penalize '2+2=3; since this sequence would be strongly represented in the training set, it's not going to penalize the model getting '1234*5678' one digit off - which is the problem.

eurekin
0 replies
1d6h

About 3., isn't that done specifically in order to not overfit?

cjbprime
0 replies
1d1h

LLMs are compression functions. An LLM that internalizes the rules of arithmetic will compress better than one that doesn't. That improvement will be measurable in the loss function, as it correctly predicts more of its training data that depends on arithmetic answers to predict the next word.

Why would you want a calculator to be making predictions?

If the predictions are correct, why wouldn't I? You are objecting to the entire concept of LLMs at this point, there's nothing specific to arithmetic here.

jakderrida
1 replies
1d13h

Annoyingly, Bard with Gemini will apparently use coding to answer every logical thinking question and get them all wrong. If you end with "Do not use code", it will get them all right.

Kerbonut
0 replies
1d12h

ChatGPT-4 is starting to do this now all of the sudden, and it is also annoyingly returning erroneous information.

xrd
0 replies
1d14h

Agreed. But they do call it out here and I'm interested in understanding why.

monkeydust
0 replies
1d11h

You don't need multimodals to access tools such as calculator. Check out PandasAI or Langchain Agent workflow. Rather than workout 451 * 995 for example the llm constructs the pandas query, runs it, returns result to user. Works pretty well.

foota
0 replies
1d14h

There's been attempts to use different embeddings for numbers, which helps a lot. E.g., https://news.ycombinator.com/item?id=37936005

e12e
0 replies
1d3h

Sure, llms don't do discreet math or logic directly. On the other hand, they write surprisingly good code.

I'm guessing we'll see llms that do input>program(s)>run>summarize>output

I'm not really disagreeing - I just think llms will do "more of the work" themselves by way of writing and running prolog programs, symbolic math (Julia etc) and running theorem provers.

nl
5 replies
1d6h

In their recent interview on the A16Z podcast the Mistral founder said they have multiple internal models between chatGPT and GPT4 level of quality.

Given their high quality releases so far they means exciting times for open source LLMs.

ignoramous
4 replies
1d6h

Except there's no indication those more powerful Mistral models will also be FOSS.

nl
3 replies
1d3h

Well he said that's the plan for the whole company so it'd be surprising if they aren't.

coder543
2 replies
1d2h

Mistral Medium is available via their API, but not available for download, for example, so I find that confusing if you’re claiming their CEO claimed the plan is to be open for all models.

nl
1 replies
15h9m

Isn't Mistral Medium the Mixtral model? I'd never heard of Mistral Medium TBH.

coder543
0 replies
15h5m

"Mistral 7B" is "mistral-tiny", "Mixtral" is "mistral-small", and "mistral-medium" is something larger.

https://mistral.ai/news/la-plateforme/

xrd
4 replies
1d14h

Is this a model that can be run using Simon Wilison's LLM tool? I cannot find any mention of Mixtral in the issues nor in the discussions. Is there an easy way to play with this model from the command line other than that?

xrd
0 replies
1d7h
ilaksh
0 replies
1d14h

ollama or llama.cpp

ignoramous
0 replies
1d14h

Unsure what Simon Wilson's program does, but you can pull models via many methods.

For CLI, Ollama: https://ollama.ai/library/mixtral

With UI, GPT4All: https://gpt4all.io/index.html (doesn't yet support Mixtral)

In-app, superagent.sh: https://github.com/homanp/superagent

gsharma
0 replies
1d13h

Ollama is probably the easiest way. https://ollama.ai/library/mixtral

LM Studio is another option.

smcleod
4 replies
1d15h

Wasn’t this what was released at the end of last year?

simonw
1 replies
1d15h

Yes, the Mixtral magnet link was tweeted on the 8th of December: https://twitter.com/mistralai/status/1733150512395038967

The paper just came out today: https://twitter.com/dchaplot/status/1744547220983005478l

grepfru_it
0 replies
1d15h

Also available on Ollama: https://ollama.ai/library/mixtral

ricopags
0 replies
1d15h

The model's weights were released in a torrent, but this is the much anticipated paper detailing some of the work.

IanCal
0 replies
1d15h

The model itself was, but was the writeup? This submission is from yesterday.

I've not been following their releases too well but it seemed they were very much on the side of releasing models asap.

kromem
4 replies
1d14h

I'm curious when we'll start to see open access multimodal models being released.

The advancement in text only models has been amazing, but a lot of the 'emergent' behavior in GPT-4 may be because of multimodal training and not just MoE or parameter sizes.

I'll be curious to see if multimodal smaller models see similar leaps.

coder543
1 replies
1d14h

CogVLM is very good in my (brief) testing: https://github.com/THUDM/CogVLM

The model weights seem to be under a non-commercial license, not true open source, but it is "open access" as you requested.

It would be nice if someone trained a CogVLM-compatible model from scratch under an open source license.

beoberha
0 replies
1d13h

When I tried last, it couldn’t be run on M-series Macs.

minimaxir
0 replies
1d14h

LLaVA is open, although not the leap you are expecting: https://llava-vl.github.io/

Meta also released a (non-commercial) multimodal model among 6 modalities: https://ai.meta.com/blog/imagebind-six-modalities-binding-ai...

ijustlovemath
0 replies
1d14h

I've heard that Google actually got the jump on OpenAI in this regard (just from people in a FAANG), and they're playing a bit of catch-up. OpenAI still has a distinct advantage on the language side, though. This is all hearsay, of course.

aunty_helen
4 replies
1d13h

On mac silicon:

https://ollama.ai/

ollama pull mixtral

For a chatgpt-esk web ui

https://github.com/ollama-webui/ollama-webui

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama-webui/ollama-webui:main

Navigate to http://localhost:3000

You can also use ollama in langchain.

vunderba
1 replies
1d10h

I wouldn't use Ollama (or any LLM host) through docker on Mac m1 since as far as I'm aware there's still no support for Metal.

Luc
0 replies
1d9h

The UI is in Docker but the Ollama server isn't.

viraptor
1 replies
1d13h

There's also some unlocked fine tunings available. Dolphin seems to be a very popular one. (Trained on more coding data) If you want to fit under 32gb, there's https://ollama.ai/library/dolphin-mixtral:8x7b-v2.7-q3_K_M

NwpierratorR
0 replies
1d6h

I haven't played with dolphin-mixtral, but dolphin-mistral gave me a good impression for generic rag application.

There's also developments and experimentation in making it more factual via dpo and laser(afaik so far not very unsuccessful).

cgeier
2 replies
1d10h

I haven't read a lot of LLM papers, but I believe this is a rather weak paper low on details (note: not the results achieved of the LLM, but the paper itself). If it had landed on my desk for a review, I probably would have sent it back just based on that.

For example, they never really say how they trained the experts or which dataset they used.

Is this the current standard in the field?

ShamelessC
0 replies
1d10h

Is this the current standard in the field?

It’s becoming pretty common, yeah. The two things you mentioned: training particulars and dataset mixture are also basically the only competitive advantage companies have. Since the code/architecture is trivial to reproduce, anyone with enough money can make a competing model “easily”.

OpenAI started this trend and cemented it with GPT4’s “technical report” which didn’t even specify the number of parameters in the model. They’ve been historically vague about their dataset for far longer than that though.

MichaelRazum
0 replies
1d4h

Exactly, same thought. Actually I would expect, that they trained each expert separately and later together, since you need to train the router network as well. I'm far from an expert in LLMs. But this would be interesting to know, especially how different training setups influence the performance.

Mashimo
2 replies
1d10h

Anyone know of a decent model for coding assistant that can run on a 16GB vram rtx 4060ti?

stavros
0 replies
1d8h

The chatbot arena has tons of good models that can do that, and Mistral, at the very least, isn't half bad.

gsharma
0 replies
1d3h

It will probably require some trial and error, but this leaderboard is a good starting point. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

~13B models should work well with plenty of room for other applications. Lately, I’ve heard good things about Solar10B, but new models come in a dozen by day, so it might have already been changed.

invert_franklin
1 replies
1d14h

Does anyone know what Figure 8 at the end shows?

It looks like each expert is used interchangeably with no clear pattern. And earlier they say "Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic."

So then, what is the point of the "expert"?

Could this extra performance just be through the 8 expert architectural design, and not be based on the underlying training material? For example, if all 8 experts were ArXiv papers, would the performance be different?

jakderrida
0 replies
1d13h

Not sure if it answers your question, but the expertise of each expert arise from the training process and aren't assigned by a human. Thus, they wouldn't be discernable to a human. The choice of the model to use, I think, is something called like a "gating network". That's also trained to favor most appropriate model based on training.

SubiculumCode
1 replies
1d12h

questions running through my head: Is there some magic with the number 8? Why not 6? 11?

Each of these 8 models were 7B models. What about using 80 x tinyllama 1B models?

pushfoo
0 replies
19h11m

TL;DR: It looks like a convenient and explainable starting point for the proof-of-concept.

Manually combining specialist variants is a known technique. This paper automates it with a router component which mixes 2 sub-models at any given time. Training 8 slight variants of a base seems safe and configurable compared to n > 16 specialists. The latter seems like the parts could interact unpredictably.

Also, the memory usage seems predictable: it follows 2^m memory conventions by mixing 2 models at a time, so ~2x the memory is actively used at a time. I'm not up to date on the hardware implications, so it might not mean anything yet. It might one day if this approach works well enough to design around.

noman-land
0 replies
1d14h

If anyone wants to try out this model, I believe it's one of the ones released as a Llamafile by Mozilla/jart[0].

1) Download llamafile[1] (30.03 GB): https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-ll...

2) chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile

3) ./mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile

[0] https://hacks.mozilla.org/2023/11/introducing-llamafile/

[1] https://github.com/Mozilla-Ocho/llamafile#quickstart

justinl33
0 replies
1d15h

tldr;

- 'sparse mixture of experts' means each layer contains 8 mini-neural-nets (experts), each trained to be good at certain data/tasks

- a router network passes each token (word) to 2 experts which suit it best

- since only 2/8 experts are utilized, the model effectively only uses 13/47B parameters during inference (text generation)

- this expert mechanism makes it very efficient and effective (uses less params, is able to speciailize to tokens)

- beats llmana 70b and gpt-3.5, especially at math and coding and language

- has fine tuned model able to beat gemini pro, as well as llmana 70b and gpt-3.5

dang
0 replies
1d12h

Recent and related:

Mixtral of experts - https://news.ycombinator.com/item?id=38598559 - Dec 2023 (300 comments)

Mistral-8x7B-Chat - https://news.ycombinator.com/item?id=38594578 - Dec 2023 (69 comments)

Mistral "Mixtral" 8x7B 32k model [magnet] - https://news.ycombinator.com/item?id=38570537 - Dec 2023 (239 comments)

arxiv_papers
0 replies
15h50m
Reubend
0 replies
21h31m

Is there a description of each "expert"? Does one of the 8 models specialize in multi-lingual translation, while another specializes in coding?

I don't see any answer to this in the paper, although I only skimmed it.