Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all?
First test I tried to run a random taxation question through it
Output: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...
Within the first paragraph, it outputs:
GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGE
Thought that was hilarious.
That's not the model this post is about. You used the base model, not trained for tasks. (The instruct model is probably not on ollama yet.)
I absolutely did not:
ollama run mixtral:8x22b
EDIT: I like how you ninja-editted your comment ;)
Considering "mixtral:8x22b" on ollama was last updated yesterday, and Mixtral-8x22B-Instruct-v0.1 (the topic of this post) was released about 2 hours ago, they are not the same model.
Are we looking at the same page?
And even the direct tag page: https://ollama.com/library/mixtral:8x22b shows 40-something minutes ago: https://imgur.com/a/WNhv70B
I get:
ollama run mixtral:8x22b
Error: exception create_tensor: tensor 'blk.0.ffn_gate.0.weight' not found
You need to update ollama to 0.1.32.
Thanks. That did it.
Let me clarify.
Mixtral-8x22B-v0.1 was released a couple days ago. The "mixtral:8x22b" tag on ollama currently refers to it, so it's what you got when you did "ollama run mixtral:8x22b". It's a base model only capable of text completion, not any other tasks, which is why you got a terrible result when you gave it instructions.
Mixtral-8x22B-Instruct-v0.1 is an instruction-following model based on Mixtral-8x22B-v0.1. It was released two hours ago and it's what this post is about.
(The last updated 44 minutes ago refers to the entire "mixtral" collection.)
And where does it say that's the instruct model?
Yeah this is exactly what happens when you ask a base model a question. It'll just attempt to continue what you already wrote based off its training set, so if you say have it continue a story you've written it may wrap up the story and then ask you to subscribe for part 2, followed by a bunch of social media comments with reviews.
Looks like an issue with the quantization that ollama (i.e llama.cpp) uses and not the model itself. It's common knowledge from Mixtral 8x7B that quantizing the MoE gates is pernicious to model perplexity. And yet they continue to do it. :)
No, it's unrelated to quantization, they just weren't using the instruct model.
Not instruct tuned. You're (actually) "holding it wrong".
The `mixtral:8x22b` tag still points to the text completion model – instruct is on the way, sorry!
Update: mixtral:8x22b now points to the instruct model:
ollama pull mixtral:8x22b
ollama run mixtral:8x22b
Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data.
...And a massive pile of cash/compute hardware.
not that massive, we're talking six figures. There was a blogpost about this a while back on the startpage of HN.
for finetuning or parameter training from scratch?
from scratch: https://research.myshell.ai/jetmoe
That's for an 8B model.
This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.
Assuming linear growth to number of parameters that's 7.5 figures instead of 6 for 8x22B model.
6 figures are a massive pile of cash.
There's a large amount of liability in disclosing your training data.
Calling the model 'truly open' without is not technically correct though.
It's open enough for all practical purposes IMO.
It's not "open enough" to do an honest evaluation of these systems by constructing adversarial benchmarks.
As open as an executable binary that you are allowed to download and use for free.
It feels absolutely amazing to build an AI startup right now. It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.
- We first struggled with limited context windows [solved]
- We had issues with consistent JSON ouput [solved]
- We had rate limiting and performance issues for the large 3rd party models [solved]
- Hosting our own OSS models for small and medium complex tasks was a pain [solved]
Obivously every startup still needs to build up defensibility and focus on differentiating with everything “non-AI”.
We are going to quickly reach the point where most of these AI startups (which do nothing but provide thin wrappers on top of public LLMs) aren't going to be needed at all. The differentiation will need to come from the value of the end product put in front of customers, not the AI backend.
Sure, in the same way SaaS companies are just thin wrappers on top of databases and the open web.
You will find that a disproportionately large amount of work and innovation in an AI product is in the backing model (GPT, Mixtral, etc.). While there's a huge amount of work in databases and the open web, SaaS products typically add a lot more than a thin API layer and a shiny website (well some do but you know what I mean)
I'd argue the comment before you is describing accessibility, features, and services -- yes, the core component has a wrapper, but that wrapper differentiates the use.
The same happened to image recognition. We have great algorithms for many years now. You can't make a company out of having the best image recognition algorithm, but you absolutely can make a company out of a device that spots defects in the paintjob in a car factory, or that spots concrete cracks in the tunnel segments used by a tunnel boring machine, or by building a wildlife camera that counts wildlife and exports that to a central website. All of them just fine-tune existing algorithms, but the value delivered is vastly different.
Or you can continue selling shovels. Still lots of expensive labeling services out there, to stay in the image-recognition parallel
The key thing is AI models are services not products. The real world changes, so you have to change your model. Same goes for new training data (examples, yes/no labels, feedback from production use), updating biases (compliance, changing societal mores). And running models in a highly-available way is also expertise. Not every company wants to be in the ML-ops business.
If you don't mind, I'm trying to experiment w/ local models more. Just now getting into messing w/ these but I'm struggling to come up w/ good use cases.
Would you happen to know of any cool OSS model projects that might be good inspiration for a side project?
Wondering what most people use these local models for
One idea that I've been mulling over; Given how controllable linux is from the command line, I think it would be somewhat easy to set up a voice to text to a local LLM that could control pretty much everything on command.
It would flat out embarass alexa. Imagine 'Hal play a movie', or 'Hal play some music' and it's all running locally, with your content.
No ideas about side projects or anything "productive" but for a concrete example look at SillyTavern. Making fictional characters. Finding narratives, stories, role-play for tabletop games. You can even have group chats of AI characters interacting. No good use cases for profit but plenty right now for exploration and fun.
It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.
and so do your competitor's products.
We had issues with consistent JSON ouput [solved]
It says the JSON output is constrained via their platform (on la Plateforme).
Does that mean JSON output is only available in the hosted version? Are there any small models that can be self hosted that output valid JSON.
The progress is insane. A few days ago I started being very impressed with LLM coding skills. I wanted Golang code, instead of Python, which you can see in many demos. The prompt was:
Write a Golang func, which accepts the path into a .gpx file and outputs a JSON string with points(x=tolal distance in km, y=elevation). Don't use any library.
How are you approaching hosting? vLLM?
"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 Turbo
Maybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex's podcast?
maybe we'll look back at token context windows like we look back at how much ram we have in a system.
I agree with this in the sense that once you have enough, you stop caring about the metric.
And how much RAM do you need to run Mixtral 8*22B? Probably not enough on a personal laptop.
I run it fine on my 64gb RAM beast.
At what quantization? 4-bit is 80GB. Less than 4-bit is rarely good enough at this point.
Is that normal ram of GPU ram?
Generally about ~1gb ram per billion parameters. I've run a 30b model (vicuna) on my 32gb laptop (but it was slow).
While there is a lot more HBM (or UMA if you're a Mac system) you need to run these LLM models, my overarching point is that at this point most systems don't have RAM constraints for most of the software you need to run and as a result, RAM becomes less of a selling point except in very specialized instances like graphic design or 3D rendering work.
If we have cheap billion token context windows, 99% of your use cases aren't going to hit anywhere close to that limit and as a result, your models will "just run"
I still don’t have enough RAM though ?
FWIW, the 128k context window for GPT-4 is only for input. I believe the output content is still only 4k.
How does that make any sense on a decoder-only architecture?
Wasn't there a paper yesterday that turned context evaluation linear (instead of quadratic) and made effectively unlimited context windows possible? Between that and 1.58b quantization I feel like we're overdue for an LLM revolution.
Pricing?
Found it: https://mistral.ai/technology/#pricing
It'd useful to add a link to the blog post. While it's an open model, most will only be able to use it via the API.
It's open source, you can just download and run it for free on your own hardware.
"Who among us doesn't have 8 H100 cards?"
Four V100s will do. They're about $1k each on ebay.
$1500 each, plus the server they go in, plus plus plus plus.
Sure, but it's still a lot less than 8 h100s.
~$8k for an LLM server with 128GB of VRAM vs like $250k+ for 8 H100s.
Well, I don't have hardware to run a 141B parameters model, even if only 39B are active during inference.
It will be quantized in a matter of days and runnable on most laptops.
8 bit is 149G. 4 bit is 80G.
I wouldn’t call this runnable on most laptops.
That looks expensive compared to what groq was offering: https://wow.groq.com/
Can't wait for 8x22B to make it to Groq! Having an LLM at near GPT-4 performance with Groq speed would be incredible, especially for real-time voice chat.
I also assume groq is 10-15x faster
These LLMs are making RAM great again.
Wish I had invested in the extra 32GB for my mac laptop.
You can't upgrade it?
Edit: I haven't owned a laptop for years, probably could have surmised they'd be more user hostile nowadays.
Everything is soldered in these days.
It's complete garbage. And most of the other vendors just copy Apple so even things like Lenovo have the same problems.
The current state of laptops is such trash
Plenty of laptops still have SO-DIMM, such as EliteBook for example.
People need to vote with their wallet, and not buy stuff that goes against their principles.
There are so many variables though ... most of the time you have to compromise on a few things.
With SO-DIMM you gain expandability at the cost of higher power draw and latency as well as lower throughput.
SO-DIMM memory is inherently slower than soldered memory. Moreover, considering the fact that SO-DIMM has a maximum speed of 6,400MHz means that it won’t be able to handle the DDR6 standard, which is already in the works.
These days with Apple Silicon, RAM is a part of the SoC. It's not even soldered on, it's a part of the chip. Although TBF, they also offer insane memory bandwidths.
most of the other vendors just copy Apple
Weird conspiracy theories aside, the low power variant of RAM (LPDDR) has to be soldered onto the motherboard, so laptops designed for longer battery life have been using it for years now.
The good news is that a newer variant of low power RAM has just been standardized that features low power RAM in memory modules, although they attach with screws and not clips.
I really really like my Macbook Pro. But dammit, you can't upgrade the thing (Mac laptops aren't upgrade-able anymore). I got M1 Max in 2021 with 32GB of RAM. I did not anticipate needing more than 32GB for anything I'd be doing on it. Turns out, a couple of years later, I like to run local LLMs that max out my available memory.
I say 2021, but truth is the supply chain was so trash that year that it took almost a year to actually get delivered. I don't think I actually started using the thing until 2022.
You are getting downvoted because you vaguely suggested something negative about an Apple product, as is my comment below
mac laptop
I just find it hilarious how approximately 100% of models beat all other models on benchmarks.
What do you mean?
Virtually every announcement of a new model release has some sort of table or graph matching it up against a bunch of other models on various benchmarks, and they're always selected in such a way that the newly-released model dominates along several axes.
It turns interpreting the results into an exercise in detecting which models and benchmarks were omitted.
It would make sense, wouldn't it? Just as we've seen rising fuel efficiency, safety, dependability, etc. over the lifecycle of a particular car model.
The different teams are learning from each other and pushing boundaries; there's virtually no reason for any of the teams to release a model or product that is somehow inferior to a prior one (unless it had some secondary attribute such as requiring lower end hardware).
We're simply not seeing the ones that came up short; we don't even see the ones where it fell short of current benchmarks because they're not worth releasing to the public.
That's a valid theory, a priori, but if you actually follow up you'll find that the vast majority of these benchmark results don't end up matching anyone's subjective experience with the models. The churn at the top is not nearly as fast as the press releases make it out to be.
Subjective experience is not a benchmark that you can measure success against. Also, of course new models are better on some set of benchmarks. Why would someone bother releasing a "new" model that is inferior to old ones? (Aside from attributes like more preferable licensing).
This is completely normal, the opposite would be strange.
Sibling comment made a good point about benchmarks not being a great indiactor of real world quality. Every time something scores near GPT-4 on benchmarks, I try it out and it ends up being less reliable than GPT-3 within a few minutes of usage.
That's totally fine, but benchmarks are like standardized tests like the SAT. They measure something and it totally makes sense that each release bests the prior in the context of these benchmarks.
It may even be the case that in measuring against the benchmarks, these product teams sacrifice some real world performance (just as a student that only studies for the SAT might sacrifice some real world skills).
Benchmarks published by the company itself should be treated no differently than advertising. For actual signal check out more independent leaderboards and benchmarks (like HuggingFace, Chatbot Arena, MMLU, AlpacaEval). Of course, even then it is impossible to come up with an objective ranking since there is no consensus on what to even measure.
Benchmarks are often weird because of what a benchmark inherently needs to be.
If you compare LLMs by asking them to tell you how to catch dragonflies - the free text chat answer you get will be impossible to objectively evaluate.
Whereas if you propose four ways to catch dragonflies and ask each model to choose option A, B, C or D (or check the relative probability the model assigns to those four output logits) the result is easy to objectively evaluate - you just check if it chose the one right answer.
Hence a lot of the most famous benchmarks are multiple-choice questions - even though 99.9% of LLM usage doesn't involve answering multiple-choice questions.
gotta cherry pick your benchmarks as much as possible
Just because of the pace of innovation and scaling, right now, it seems pretty natural that any new model is going to be better than the previous comparable models.
I'm really excited about this model. Just need someone to quantize it to ~3 bits so it'll run on a 64GB MacBook Pro. I've gotten a lot of use from the 8x7b model. Paired with llamafile and it's just so good.
Can you explain your use case? I tried to get into offline llms, on my machine and even android but without discrete graphics, its a slow hog so I didnt enjoy it but suppose I buy one, what then ?
I run Mistral-7B on an old laptop. It's not very fast and it's not very good, but it's just good enough to be useful.
My use case is that I'm more productive working with a LLM but being online is a constant temptation and distraction.
Most of the time I'll reach for offline docs to verify. So the LLM just points me in the right direction.
I also miss Google offline, so I'm working on a search engine. I thought I could skip crawling by just downloading common crawl, but unfortnately it's enormous and mostly junk or unsuitable for my needs. So my next project is how to data-mine common crawl to extract just the interesting (to me) bits...
When I have a search engine and a LLM I'll be able to run my own Phind, which will be really cool.
Presumably you could run things like PageRank, I'm sure people do this sort of thing with CommonCrawl. There are lots of variants of graph connectivity scoring methods and classifiers. What a time to be alive eh?
Can you explain your use case?
pretty sure you can run it un-censored... that would be my use case
Yes, I have a side project that uses local whisper.cpp to transcribe a podcast I love and shows a nice UI to search and filter the contents. I use Mixtral 8x7b in chat interface via llamafile primarily to help me write python and sqlite code and as a general Q&A agent. I ask it all sorts of technical questions, learn about common tools, libraries, and idioms in an ecosystem I'm not familiar with, and then I can go to official documentation and dig in.
It has been a huge force multiplier for me and most importantly of all, it removes the dread of not knowing where to start and the dread of sending your inner monologue to someone's stupid cloud.
If you're curious: https://github.com/noman-land/transcript.fish/ though this doesn't include any Mixtral stuff because I don't use it programmatically (yet). I soon hope to use it to answer questions about the episodes like who the special guest is and whatnot, which is something I do manually right now.
Shopping for a new mbp. Do you think going with more ram would be wise?
Is this the best permissively licensed model out there?
So far it is Command R+. Let's see how this will fare on Chatbot Arena after a few weeks of use.
So far it is Command R+
Most people would not consider Command R+ to count as the "best permissively licensed model" since CC-BY-NC is not usually considered "permissively licensed" – the "NC" part means "non-commercial use only"
My bad, I remembered wrongly it was Apache too.
Today. Might change tomorrow at the pace this sector is at.
I'm considering switching my function calling requests from OpenAI's API to Mistral. Are they using similar formats? What's the easiest way to use Mistral? Is it by using Huggingface?
easiest is probably with ollama [0]. I think the ollama API is OpenAI compatible.
Ollama runs locally. What's the best option for calling the new Mixtral model on someone else's server programmatically?
Openrouter lists several options: https://openrouter.ai/models/mistralai/mixtral-8x22b
Most inference servers are OpenAI-compatibile. Even the "official" llama-cpp server should work fine: https://github.com/ggerganov/llama.cpp/blob/master/examples/...
What's the best way to run this on my Macbook Pro?
I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is less convenient.
Ideally, I'd like something that allows me to edit my settings quite granularly, similar to what I can do in OpenLM, with the QoL from the hosted online platforms, particularly the ease of editing my prompts that I use extensively.
Ollama with WebUI https://github.com/open-webui/open-webui
Not sure why your comment was downvoted. ^ is absolutely the right answer.
Open WebUI is functionally identical to the ChatGPT interface. You can even use it with the OpenAI APIs to have your own pay per use GPT 4. I did this.
openrouter.ai is a fantastic idea if you don't want to self host
You can try Msty as well. I am the author.
Dumb question: Are "non-instructed" versions of LLMs just raw, no-guardrail versions of the "instructed" versions that most end-users see? And why does Mixtral need one, when OpenAI LLMs do not?
LLM’s are first trained to predict the next most likely word (or token if you want to be accurate) from web crawls. These models are basically great at continuing unfinished text but can’t really be used for instructions e.g. Q&A or chatting - this is the “non-instructed” version. These models are then fine tuned for instructions using additional data from human interaction - these are the “instructed” versions which are what end users (e.g. ChatGPT, Gemini, etc.) see.
Very helpful, thank you.
I appreciate the correction, thanks!
I'm confused on the instruction fine-tuning part that is mentioned briefly, in passing. Is there an open weight instruct variant they've released? Or is that only on their platform? Edit: It's on HuggingFace, great, thanks replies!
I just found this on HuggingFace: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
How much vram is need to run this?
80GB in 4bit.
But because it only activates one expert at a time, it can run on a fast CPU in reasonable time. So 96GB of DDR4 will do. 96GB of DDR5 is better.
WizardLM-2 8x22b (which was a fine tune of the Mixtral 8x22b base model) at 4bit was only 80GB.
The development never stops. In a few years we will look back and see how the previous models were and how they're now. How we couldn't run LLaMa 70B on MacBook Air and now we can.
Yes it's pretty cool. There was a neat comparison of deep learning development that I think resonates quite well here.
Around 5 years ago, it took a lambda user some pretty significant hardware, software and time (around a full night), to try to create a short deepfake. Now, you don't need any fancy hardware and you can have some decent results within 5 min on your average computer.
That part isn’t very good.
https://www.nytimes.com/2024/04/08/technology/deepfake-ai-nu...
Isn't equating active parameters with cost a little unfair since you still need full memory for all the inactive parameters?
Well, since it affects inference speed it means you can handle more in less time, needing less concurrency.
Fewer parameters at inference time makes a massive difference in cost for batch jobs, assuming vram usage is the same
So this one is 3x the size but only 7% better on MMLU? Given Moores law is mostly dead, this trend is going to make for even more extremely expensive compute for next gen AI models.
That's 25% fewer errors.
It wasn't clear but how much hardware does it take to run Mixtral 8x22B (mistral.ai) next to me locally?
a macbook with 64g of ram
Seems that Perplexity Labs already offers a free demo of it.
That's the old/regular model. This post is about the new "instruct" model.
How does this compare to ChatGPT4?
is this different than their "large" model
I can't even begin to describe how excited I am for the future of AI.
Did anyone have success getting danswer and ollama to work together?
Good to continue to see a permissive license here.
Is 8x22B gonna make it to Le Chat in the near future?
Curious to see how it performs against GPT-4.
Mixtral8x22 beats CommandR+, which is at GPT-4-level in LMSYS' leaderboard.
We rolled out Mixtral 8x22b to our LLM Litmus Test at s0.dev for Cody AI. Don't have enough data to say it's better or worse that other LLMs yet, but if you want to try it out for coding purposes, let me know your experience.
I have been using mixtral daily since it was released for all kinds of writing and coding tasks. Love it and massively invested in mistrals mission.
Keep on doing this great work.
Edit: been using the previous version, seems like this one is even better?
We need larger context windows, otherwise we’re running the same path with marginally different results.
This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.
During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.
The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.
Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?
Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.
Maybe a real expert can confirm if this is correct :)
Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.
Sounds like the "you only use 10% of your brain" myth, but actually real this time.
Not quite, you don't save memory, only compute.
It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.
It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.
I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)
Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.
I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.
It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.
We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".
Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.
The latter. Yes, it all needs to stay in memory.
Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.
Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...
Yes there are many fine tunes on huggingface. Search "8x1B huggingface"
The previous mixtral is 8x7B
Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?
Not really. The “expert” term is a misnomer; it would be better put as “brain region”.
Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.
Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?
Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU
A decent loose analogy might be database sharding.
Basically you're sharding the neural network by "something" that is itself tuned during the learning process.
Ignore the "experts" part, it misleads a lot of people [0]. There is no explicit specialization in the most popular setups, it is achieved implicitly through training. In short: MoEs add multiple MLP sublayers and a routing mechanism after each attention sublayer and let the training procedure learn the MLP parameters and the routing parameters.
In a longer, but still rough, form...
How these transformers work is roughly:
``` x_{l+1} = mlp_l(attention_l(x_l)) ```
where `x_l` is the hidden representation at layer l, `attention_l` is the attention sublayer at layer l, and `mlp_l` is the multilayer perceptron at sublayer l.
This MLP layer is very expensive because it is fully connected (i.e. every input has a weight to every output). So! MoEs instead of creating an even bigger, more expensive MLP to get more capability, they create K MLP sublayers (the "experts") and a router that decides which MLP sublayers to use. This router spits out an importance score for each MLP "expert" and then you choose the top T MLPs and do an average weighed on importance, so roughly:
``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) * importance_score_{l, e} ```
where the `importance_score_{l, e}` is the score computed by the router at layer l for "expert" e. That is, `importance_score_{l} = attention_l(x_l)`. Note that here we are adding all experts, but in reality we choose the top T, often 2, and use that.
[0] some architectures do, in fact, combine domain experts to make a greater whole, but not the currently popular flavor
So it is somewhat like a classic random forest or maybe bagging, where you're trying to stop overfitting, but you're also trying to train that top layer to know who could be the "experts" given the current inputs so that you're minimising the number of multiple MLP sublayers called during inference?
It's really a kind of enforced sparsity, in that it requires that only a limited amount of blocks be active at a time during inference. What blocks will be active for each token is decided by the network itself as part of training.
(Notably, MoE should not be conflated with ensemble techniques, which is where you would train entire separate networks, then use heuristic techniques to run inference across all of them simultaneously and combine the results.)
Not quite a layman's explanation, but if you're familiar with the implementation(s) of vanilla decoder only transformers, mixture-of-experts is just a small extension.
During inference, instead of a single MLP in each transformer layer, MoEs have `n` MLPs and a single layer "gate" in each transformer layer. In the forward pass, softmax of the gate's output is used to pick the top `k` (where k is < n) MLPs to use. The relevant code snippet in the HF transformers implementation is very readable IMO, and only about 40 lines.
https://github.com/huggingface/transformers/blob/main/src/tr...
It’s not “experts” in the typical sense of the word. There is no discrete training to learn a particular skill in one expert. It’s more closely modeled as a bunch of smaller models grafted together.
These models are actually a collection of weights for different parts of the system. It’s not “one” neural network. Transformers are composed of layers of transformations to the input, and each step can have its own set of weights. There was a recent video on the front page that had a good introduction to this. There is the MLP, there are the attention heads, etc.
With that in mind, a MoE model is basically where one of those layers has X different versions of the weights, and then an added layer (another neural network with its own weights) that picks the version of “expert” weights to use.
Nobody decides. The network itself determines which expert(s) to activate based on the context. It uses a small neural network for the task.
It typically won't behave like human experts - you might find one of the networks is an expert in determining where to place capital letters or full stops for example.
MoE's do not really improve accuracy - instead they are to reduce the amount of compute required. And, assuming you have a fixed compute budget, that in turn might mean you can make the model bigger to get better accuracy.
There is some good documentation around mergekit available that actually explains a lot and might be a good place to start.
As always, code is the best documentation: https://github.com/ggerganov/llama.cpp/blob/8dd1ec8b3ffbfa2d...
maybe there's one that is maitre d'llm?
Correct, the experts are determined by Algo, not anything humans would understand.