return to table of content

Mixtral 8x22B

jjice
30 replies
3h23m

Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all?

hlfshell
19 replies
3h17m

This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.

During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.

The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.

fire_lake
6 replies
3h13m

Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?

rgbrgb
3 replies
3h1m

Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)

samus
0 replies
1h9m

Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.

nialv7
0 replies
1h42m

Sounds like the "you only use 10% of your brain" myth, but actually real this time.

cjbprime
0 replies
1h58m

Not quite, you don't save memory, only compute.

og_kalu
0 replies
2h56m

It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.

imjonse
0 replies
3h4m

It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.

andai
3 replies
3h1m

I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)

Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.

I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.

samus
0 replies
1h15m

It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.

We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".

MPSimmons
0 replies
1h15m

I heard MoE reduces inference costs

Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.

Filligree
0 replies
2h53m

The latter. Yes, it all needs to stay in memory.

andai
2 replies
2h53m

Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.

Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...

jasonjmcghee
0 replies
2h30m

Yes there are many fine tunes on huggingface. Search "8x1B huggingface"

auspiv
0 replies
2h52m

The previous mixtral is 8x7B

wenc
1 replies
3h0m

Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?

Filligree
0 replies
2h51m

Not really. The “expert” term is a misnomer; it would be better put as “brain region”.

Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.

andai
1 replies
2h59m

Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?

wongarsu
0 replies
2h55m

Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU

api
0 replies
3h14m

A decent loose analogy might be database sharding.

Basically you're sharding the neural network by "something" that is itself tuned during the learning process.

huevosabio
1 replies
2h59m

Ignore the "experts" part, it misleads a lot of people [0]. There is no explicit specialization in the most popular setups, it is achieved implicitly through training. In short: MoEs add multiple MLP sublayers and a routing mechanism after each attention sublayer and let the training procedure learn the MLP parameters and the routing parameters.

In a longer, but still rough, form...

How these transformers work is roughly:

``` x_{l+1} = mlp_l(attention_l(x_l)) ```

where `x_l` is the hidden representation at layer l, `attention_l` is the attention sublayer at layer l, and `mlp_l` is the multilayer perceptron at sublayer l.

This MLP layer is very expensive because it is fully connected (i.e. every input has a weight to every output). So! MoEs instead of creating an even bigger, more expensive MLP to get more capability, they create K MLP sublayers (the "experts") and a router that decides which MLP sublayers to use. This router spits out an importance score for each MLP "expert" and then you choose the top T MLPs and do an average weighed on importance, so roughly:

``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) * importance_score_{l, e} ```

where the `importance_score_{l, e}` is the score computed by the router at layer l for "expert" e. That is, `importance_score_{l} = attention_l(x_l)`. Note that here we are adding all experts, but in reality we choose the top T, often 2, and use that.

[0] some architectures do, in fact, combine domain experts to make a greater whole, but not the currently popular flavor

Quarrel
0 replies
57m

So it is somewhat like a classic random forest or maybe bagging, where you're trying to stop overfitting, but you're also trying to train that top layer to know who could be the "experts" given the current inputs so that you're minimising the number of multiple MLP sublayers called during inference?

zozbot234
0 replies
2h59m

It's really a kind of enforced sparsity, in that it requires that only a limited amount of blocks be active at a time during inference. What blocks will be active for each token is decided by the network itself as part of training.

(Notably, MoE should not be conflated with ensemble techniques, which is where you would train entire separate networks, then use heuristic techniques to run inference across all of them simultaneously and combine the results.)

woadwarrior01
0 replies
2h48m

Not quite a layman's explanation, but if you're familiar with the implementation(s) of vanilla decoder only transformers, mixture-of-experts is just a small extension.

During inference, instead of a single MLP in each transformer layer, MoEs have `n` MLPs and a single layer "gate" in each transformer layer. In the forward pass, softmax of the gate's output is used to pick the top `k` (where k is < n) MLPs to use. The relevant code snippet in the HF transformers implementation is very readable IMO, and only about 40 lines.

https://github.com/huggingface/transformers/blob/main/src/tr...

vineyardmike
0 replies
2h28m

It’s not “experts” in the typical sense of the word. There is no discrete training to learn a particular skill in one expert. It’s more closely modeled as a bunch of smaller models grafted together.

These models are actually a collection of weights for different parts of the system. It’s not “one” neural network. Transformers are composed of layers of transformations to the input, and each step can have its own set of weights. There was a recent video on the front page that had a good introduction to this. There is the MLP, there are the attention heads, etc.

With that in mind, a MoE model is basically where one of those layers has X different versions of the weights, and then an added layer (another neural network with its own weights) that picks the version of “expert” weights to use.

londons_explore
0 replies
3h19m

Nobody decides. The network itself determines which expert(s) to activate based on the context. It uses a small neural network for the task.

It typically won't behave like human experts - you might find one of the networks is an expert in determining where to place capital letters or full stops for example.

MoE's do not really improve accuracy - instead they are to reduce the amount of compute required. And, assuming you have a fixed compute budget, that in turn might mean you can make the model bigger to get better accuracy.

jsemrau
0 replies
3h4m

There is some good documentation around mergekit available that actually explains a lot and might be a good place to start.

Keyframe
0 replies
3h20m

maybe there's one that is maitre d'llm?

HeatrayEnjoyer
0 replies
3h18m

Correct, the experts are determined by Algo, not anything humans would understand.

orost
9 replies
2h38m

That's not the model this post is about. You used the base model, not trained for tasks. (The instruct model is probably not on ollama yet.)

byteknight
7 replies
2h33m

I absolutely did not:

ollama run mixtral:8x22b

EDIT: I like how you ninja-editted your comment ;)

orost
6 replies
2h31m

Considering "mixtral:8x22b" on ollama was last updated yesterday, and Mixtral-8x22B-Instruct-v0.1 (the topic of this post) was released about 2 hours ago, they are not the same model.

belter
2 replies
2h24m

I get:

ollama run mixtral:8x22b

Error: exception create_tensor: tensor 'blk.0.ffn_gate.0.weight' not found

Me1000
1 replies
2h20m

You need to update ollama to 0.1.32.

belter
0 replies
2h9m

Thanks. That did it.

orost
0 replies
2h25m

Let me clarify.

Mixtral-8x22B-v0.1 was released a couple days ago. The "mixtral:8x22b" tag on ollama currently refers to it, so it's what you got when you did "ollama run mixtral:8x22b". It's a base model only capable of text completion, not any other tasks, which is why you got a terrible result when you gave it instructions.

Mixtral-8x22B-Instruct-v0.1 is an instruction-following model based on Mixtral-8x22B-v0.1. It was released two hours ago and it's what this post is about.

(The last updated 44 minutes ago refers to the entire "mixtral" collection.)

gliptic
0 replies
2h24m

And where does it say that's the instruct model?

mysteria
0 replies
2h24m

Yeah this is exactly what happens when you ask a base model a question. It'll just attempt to continue what you already wrote based off its training set, so if you say have it continue a story you've written it may wrap up the story and then ask you to subscribe for part 2, followed by a bunch of social media comments with reviews.

woadwarrior01
1 replies
2h35m

Looks like an issue with the quantization that ollama (i.e llama.cpp) uses and not the model itself. It's common knowledge from Mixtral 8x7B that quantizing the MoE gates is pernicious to model perplexity. And yet they continue to do it. :)

cjbprime
0 replies
2h0m

No, it's unrelated to quantization, they just weren't using the instruct model.

renewiltord
0 replies
2h22m

Not instruct tuned. You're (actually) "holding it wrong".

jmorgan
0 replies
2h29m

The `mixtral:8x22b` tag still points to the text completion model – instruct is on the way, sorry!

Update: mixtral:8x22b now points to the instruct model:

  ollama pull mixtral:8x22b
  ollama run mixtral:8x22b

imjonse
13 replies
3h34m

Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data.

generalizations
7 replies
3h30m

...And a massive pile of cash/compute hardware.

kiney
6 replies
3h26m

not that massive, we're talking six figures. There was a blogpost about this a while back on the startpage of HN.

htrp
4 replies
3h24m

for finetuning or parameter training from scratch?

kaibee
2 replies
2h49m

That's for an 8B model.

cptcobalt
1 replies
2h14m

This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.

lostmsu
0 replies
1h16m

Assuming linear growth to number of parameters that's 7.5 figures instead of 6 for 8x22B model.

moffkalast
0 replies
1h13m

6 figures are a massive pile of cash.

ru552
4 replies
3h28m

There's a large amount of liability in disclosing your training data.

imjonse
3 replies
3h18m

Calling the model 'truly open' without is not technically correct though.

Lacerda69
2 replies
3h11m

It's open enough for all practical purposes IMO.

nicklecompte
0 replies
2h16m

It's not "open enough" to do an honest evaluation of these systems by constructing adversarial benchmarks.

imjonse
0 replies
2h14m

As open as an executable binary that you are allowed to download and use for free.

hubraumhugo
13 replies
3h18m

It feels absolutely amazing to build an AI startup right now. It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.

- We first struggled with limited context windows [solved]

- We had issues with consistent JSON ouput [solved]

- We had rate limiting and performance issues for the large 3rd party models [solved]

- Hosting our own OSS models for small and medium complex tasks was a pain [solved]

Obivously every startup still needs to build up defensibility and focus on differentiating with everything “non-AI”.

paxys
5 replies
3h14m

We are going to quickly reach the point where most of these AI startups (which do nothing but provide thin wrappers on top of public LLMs) aren't going to be needed at all. The differentiation will need to come from the value of the end product put in front of customers, not the AI backend.

layble
2 replies
3h5m

Sure, in the same way SaaS companies are just thin wrappers on top of databases and the open web.

imjonse
1 replies
3h0m

You will find that a disproportionately large amount of work and innovation in an AI product is in the backing model (GPT, Mixtral, etc.). While there's a huge amount of work in databases and the open web, SaaS products typically add a lot more than a thin API layer and a shiny website (well some do but you know what I mean)

tomrod
0 replies
2h48m

I'd argue the comment before you is describing accessibility, features, and services -- yes, the core component has a wrapper, but that wrapper differentiates the use.

wongarsu
1 replies
2h46m

The same happened to image recognition. We have great algorithms for many years now. You can't make a company out of having the best image recognition algorithm, but you absolutely can make a company out of a device that spots defects in the paintjob in a car factory, or that spots concrete cracks in the tunnel segments used by a tunnel boring machine, or by building a wildlife camera that counts wildlife and exports that to a central website. All of them just fine-tune existing algorithms, but the value delivered is vastly different.

Or you can continue selling shovels. Still lots of expensive labeling services out there, to stay in the image-recognition parallel

pradn
0 replies
1h7m

The key thing is AI models are services not products. The real world changes, so you have to change your model. Same goes for new training data (examples, yes/no labels, feedback from production use), updating biases (compliance, changing societal mores). And running models in a highly-available way is also expertise. Not every company wants to be in the ML-ops business.

sleepingreset
2 replies
3h13m

If you don't mind, I'm trying to experiment w/ local models more. Just now getting into messing w/ these but I'm struggling to come up w/ good use cases.

Would you happen to know of any cool OSS model projects that might be good inspiration for a side project?

Wondering what most people use these local models for

wing-_-nuts
0 replies
2h18m

One idea that I've been mulling over; Given how controllable linux is from the command line, I think it would be somewhat easy to set up a voice to text to a local LLM that could control pretty much everything on command.

It would flat out embarass alexa. Imagine 'Hal play a movie', or 'Hal play some music' and it's all running locally, with your content.

sosuke
0 replies
2h35m

No ideas about side projects or anything "productive" but for a concrete example look at SillyTavern. Making fictional characters. Finding narratives, stories, role-play for tabletop games. You can even have group chats of AI characters interacting. No good use cases for profit but plenty right now for exploration and fun.

yodsanklai
0 replies
40m

It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.

and so do your competitor's products.

neillyons
0 replies
2h23m

We had issues with consistent JSON ouput [solved]

It says the JSON output is constrained via their platform (on la Plateforme).

Does that mean JSON output is only available in the hosted version? Are there any small models that can be self hosted that output valid JSON.

milansuk
0 replies
2h32m

The progress is insane. A few days ago I started being very impressed with LLM coding skills. I wanted Golang code, instead of Python, which you can see in many demos. The prompt was:

Write a Golang func, which accepts the path into a .gpx file and outputs a JSON string with points(x=tolal distance in km, y=elevation). Don't use any library.

jasonjmcghee
0 replies
2h28m

How are you approaching hosting? vLLM?

mdrzn
12 replies
3h27m

"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 Turbo

Maybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex's podcast?

htrp
8 replies
3h24m

maybe we'll look back at token context windows like we look back at how much ram we have in a system.

frabjoused
5 replies
3h22m

I agree with this in the sense that once you have enough, you stop caring about the metric.

paradite
4 replies
3h19m

And how much RAM do you need to run Mixtral 8*22B? Probably not enough on a personal laptop.

Lacerda69
2 replies
3h14m

I run it fine on my 64gb RAM beast.

coder543
0 replies
3h10m

At what quantization? 4-bit is 80GB. Less than 4-bit is rarely good enough at this point.

apexalpha
0 replies
3h8m

Is that normal ram of GPU ram?

user_7832
0 replies
3h13m

Generally about ~1gb ram per billion parameters. I've run a 30b model (vicuna) on my 32gb laptop (but it was slow).

htrp
0 replies
2h38m

While there is a lot more HBM (or UMA if you're a Mac system) you need to run these LLM models, my overarching point is that at this point most systems don't have RAM constraints for most of the software you need to run and as a result, RAM becomes less of a selling point except in very specialized instances like graphic design or 3D rendering work.

If we have cheap billion token context windows, 99% of your use cases aren't going to hit anywhere close to that limit and as a result, your models will "just run"

bamboozled
0 replies
3h6m

I still don’t have enough RAM though ?

pseudosavant
1 replies
2h2m

FWIW, the 128k context window for GPT-4 is only for input. I believe the output content is still only 4k.

moffkalast
0 replies
1h12m

How does that make any sense on a decoder-only architecture?

creshal
0 replies
2h56m

Wasn't there a paper yesterday that turned context evaluation linear (instead of quadratic) and made effectively unlimited context windows possible? Between that and 1.58b quantization I feel like we're overdue for an LLM revolution.

tinyhouse
11 replies
3h28m

Pricing?

Found it: https://mistral.ai/technology/#pricing

It'd useful to add a link to the blog post. While it's an open model, most will only be able to use it via the API.

MacsHeadroom
7 replies
3h21m

It's open source, you can just download and run it for free on your own hardware.

astrodust
3 replies
3h18m

"Who among us doesn't have 8 H100 cards?"

MacsHeadroom
2 replies
2h58m

Four V100s will do. They're about $1k each on ebay.

astrodust
1 replies
2h56m

$1500 each, plus the server they go in, plus plus plus plus.

MacsHeadroom
0 replies
1h34m

Sure, but it's still a lot less than 8 h100s.

~$8k for an LLM server with 128GB of VRAM vs like $250k+ for 8 H100s.

tinyhouse
2 replies
3h20m

Well, I don't have hardware to run a 141B parameters model, even if only 39B are active during inference.

navbaker
1 replies
2h58m

It will be quantized in a matter of days and runnable on most laptops.

azinman2
0 replies
2h11m

8 bit is 149G. 4 bit is 80G.

I wouldn’t call this runnable on most laptops.

theolivenbaum
2 replies
3h21m

That looks expensive compared to what groq was offering: https://wow.groq.com/

pants2
0 replies
2h37m

Can't wait for 8x22B to make it to Groq! Having an LLM at near GPT-4 performance with Groq speed would be incredible, especially for real-time voice chat.

naiv
0 replies
3h15m

I also assume groq is 10-15x faster

jonnycomputer
11 replies
3h23m

These LLMs are making RAM great again.

Wish I had invested in the extra 32GB for my mac laptop.

Workaccount2
10 replies
3h15m

You can't upgrade it?

Edit: I haven't owned a laptop for years, probably could have surmised they'd be more user hostile nowadays.

kristopolous
5 replies
3h9m

Everything is soldered in these days.

It's complete garbage. And most of the other vendors just copy Apple so even things like Lenovo have the same problems.

The current state of laptops is such trash

sva_
2 replies
2h54m

Plenty of laptops still have SO-DIMM, such as EliteBook for example.

People need to vote with their wallet, and not buy stuff that goes against their principles.

popf1
0 replies
2h11m

There are so many variables though ... most of the time you have to compromise on a few things.

GeekyBear
0 replies
2h1m

With SO-DIMM you gain expandability at the cost of higher power draw and latency as well as lower throughput.

SO-DIMM memory is inherently slower than soldered memory. Moreover, considering the fact that SO-DIMM has a maximum speed of 6,400MHz means that it won’t be able to handle the DDR6 standard, which is already in the works.

https://fossbytes.com/camm2-ram-standard/

woadwarrior01
0 replies
2h39m

These days with Apple Silicon, RAM is a part of the SoC. It's not even soldered on, it's a part of the chip. Although TBF, they also offer insane memory bandwidths.

GeekyBear
0 replies
2h35m

most of the other vendors just copy Apple

Weird conspiracy theories aside, the low power variant of RAM (LPDDR) has to be soldered onto the motherboard, so laptops designed for longer battery life have been using it for years now.

The good news is that a newer variant of low power RAM has just been standardized that features low power RAM in memory modules, although they attach with screws and not clips.

https://fossbytes.com/camm2-ram-standard/

jonnycomputer
1 replies
3h7m

I really really like my Macbook Pro. But dammit, you can't upgrade the thing (Mac laptops aren't upgrade-able anymore). I got M1 Max in 2021 with 32GB of RAM. I did not anticipate needing more than 32GB for anything I'd be doing on it. Turns out, a couple of years later, I like to run local LLMs that max out my available memory.

jonnycomputer
0 replies
3h6m

I say 2021, but truth is the supply chain was so trash that year that it took almost a year to actually get delivered. I don't think I actually started using the thing until 2022.

paxys
0 replies
2h43m

You are getting downvoted because you vaguely suggested something negative about an Apple product, as is my comment below

paxys
0 replies
3h13m

mac laptop
apetresc
11 replies
3h28m

I just find it hilarious how approximately 100% of models beat all other models on benchmarks.

squirrel23
6 replies
3h28m

What do you mean?

apetresc
5 replies
3h26m

Virtually every announcement of a new model release has some sort of table or graph matching it up against a bunch of other models on various benchmarks, and they're always selected in such a way that the newly-released model dominates along several axes.

It turns interpreting the results into an exercise in detecting which models and benchmarks were omitted.

CharlieDigital
4 replies
3h23m

It would make sense, wouldn't it? Just as we've seen rising fuel efficiency, safety, dependability, etc. over the lifecycle of a particular car model.

The different teams are learning from each other and pushing boundaries; there's virtually no reason for any of the teams to release a model or product that is somehow inferior to a prior one (unless it had some secondary attribute such as requiring lower end hardware).

We're simply not seeing the ones that came up short; we don't even see the ones where it fell short of current benchmarks because they're not worth releasing to the public.

apetresc
1 replies
3h10m

That's a valid theory, a priori, but if you actually follow up you'll find that the vast majority of these benchmark results don't end up matching anyone's subjective experience with the models. The churn at the top is not nearly as fast as the press releases make it out to be.

tensor
0 replies
30m

Subjective experience is not a benchmark that you can measure success against. Also, of course new models are better on some set of benchmarks. Why would someone bother releasing a "new" model that is inferior to old ones? (Aside from attributes like more preferable licensing).

This is completely normal, the opposite would be strange.

andai
1 replies
3h4m

Sibling comment made a good point about benchmarks not being a great indiactor of real world quality. Every time something scores near GPT-4 on benchmarks, I try it out and it ends up being less reliable than GPT-3 within a few minutes of usage.

CharlieDigital
0 replies
3h0m

That's totally fine, but benchmarks are like standardized tests like the SAT. They measure something and it totally makes sense that each release bests the prior in the context of these benchmarks.

It may even be the case that in measuring against the benchmarks, these product teams sacrifice some real world performance (just as a student that only studies for the SAT might sacrifice some real world skills).

paxys
0 replies
3h18m

Benchmarks published by the company itself should be treated no differently than advertising. For actual signal check out more independent leaderboards and benchmarks (like HuggingFace, Chatbot Arena, MMLU, AlpacaEval). Of course, even then it is impossible to come up with an objective ranking since there is no consensus on what to even measure.

michaelt
0 replies
33m

Benchmarks are often weird because of what a benchmark inherently needs to be.

If you compare LLMs by asking them to tell you how to catch dragonflies - the free text chat answer you get will be impossible to objectively evaluate.

Whereas if you propose four ways to catch dragonflies and ask each model to choose option A, B, C or D (or check the relative probability the model assigns to those four output logits) the result is easy to objectively evaluate - you just check if it chose the one right answer.

Hence a lot of the most famous benchmarks are multiple-choice questions - even though 99.9% of LLM usage doesn't involve answering multiple-choice questions.

htrp
0 replies
3h26m

gotta cherry pick your benchmarks as much as possible

empath-nirvana
0 replies
2h55m

Just because of the pace of innovation and scaling, right now, it seems pretty natural that any new model is going to be better than the previous comparable models.

noman-land
6 replies
3h26m

I'm really excited about this model. Just need someone to quantize it to ~3 bits so it'll run on a 64GB MacBook Pro. I've gotten a lot of use from the 8x7b model. Paired with llamafile and it's just so good.

2Gkashmiri
4 replies
3h22m

Can you explain your use case? I tried to get into offline llms, on my machine and even android but without discrete graphics, its a slow hog so I didnt enjoy it but suppose I buy one, what then ?

andai
1 replies
3h8m

I run Mistral-7B on an old laptop. It's not very fast and it's not very good, but it's just good enough to be useful.

My use case is that I'm more productive working with a LLM but being online is a constant temptation and distraction.

Most of the time I'll reach for offline docs to verify. So the LLM just points me in the right direction.

I also miss Google offline, so I'm working on a search engine. I thought I could skip crawling by just downloading common crawl, but unfortnately it's enormous and mostly junk or unsuitable for my needs. So my next project is how to data-mine common crawl to extract just the interesting (to me) bits...

When I have a search engine and a LLM I'll be able to run my own Phind, which will be really cool.

luke-stanley
0 replies
2h49m

Presumably you could run things like PageRank, I'm sure people do this sort of thing with CommonCrawl. There are lots of variants of graph connectivity scoring methods and classifiers. What a time to be alive eh?

popf1
0 replies
2h13m

Can you explain your use case?

pretty sure you can run it un-censored... that would be my use case

noman-land
0 replies
2h54m

Yes, I have a side project that uses local whisper.cpp to transcribe a podcast I love and shows a nice UI to search and filter the contents. I use Mixtral 8x7b in chat interface via llamafile primarily to help me write python and sqlite code and as a general Q&A agent. I ask it all sorts of technical questions, learn about common tools, libraries, and idioms in an ecosystem I'm not familiar with, and then I can go to official documentation and dig in.

It has been a huge force multiplier for me and most importantly of all, it removes the dread of not knowing where to start and the dread of sending your inner monologue to someone's stupid cloud.

If you're curious: https://github.com/noman-land/transcript.fish/ though this doesn't include any Mixtral stuff because I don't use it programmatically (yet). I soon hope to use it to answer questions about the episodes like who the special guest is and whatnot, which is something I do manually right now.

mathverse
0 replies
18m

Shopping for a new mbp. Do you think going with more ram would be wise?

sa-code
4 replies
3h29m

Is this the best permissively licensed model out there?

imjonse
2 replies
3h27m

So far it is Command R+. Let's see how this will fare on Chatbot Arena after a few weeks of use.

skissane
1 replies
2h53m

So far it is Command R+

Most people would not consider Command R+ to count as the "best permissively licensed model" since CC-BY-NC is not usually considered "permissively licensed" – the "NC" part means "non-commercial use only"

imjonse
0 replies
2h9m

My bad, I remembered wrongly it was Apache too.

ru552
0 replies
3h27m

Today. Might change tomorrow at the pace this sector is at.

clementmas
4 replies
3h24m

I'm considering switching my function calling requests from OpenAI's API to Mistral. Are they using similar formats? What's the easiest way to use Mistral? Is it by using Huggingface?

ru552
3 replies
3h21m

easiest is probably with ollama [0]. I think the ollama API is OpenAI compatible.

[0]https://ollama.com/

pants2
1 replies
2h39m

Ollama runs locally. What's the best option for calling the new Mixtral model on someone else's server programmatically?

ayolisup
4 replies
3h1m

What's the best way to run this on my Macbook Pro?

I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is less convenient.

Ideally, I'd like something that allows me to edit my settings quite granularly, similar to what I can do in OpenLM, with the QoL from the hosted online platforms, particularly the ease of editing my prompts that I use extensively.

shaunkoh
0 replies
2h41m

Not sure why your comment was downvoted. ^ is absolutely the right answer.

Open WebUI is functionally identical to the ChatGPT interface. You can even use it with the OpenAI APIs to have your own pay per use GPT 4. I did this.

mcbuilder
0 replies
48m

openrouter.ai is a fantastic idea if you don't want to self host

chown
0 replies
59m

You can try Msty as well. I am the author.

https://msty.app

CharlesW
4 replies
1h35m

Dumb question: Are "non-instructed" versions of LLMs just raw, no-guardrail versions of the "instructed" versions that most end-users see? And why does Mixtral need one, when OpenAI LLMs do not?

kingsleyopara
1 replies
1h24m

LLM’s are first trained to predict the next most likely word (or token if you want to be accurate) from web crawls. These models are basically great at continuing unfinished text but can’t really be used for instructions e.g. Q&A or chatting - this is the “non-instructed” version. These models are then fine tuned for instructions using additional data from human interaction - these are the “instructed” versions which are what end users (e.g. ChatGPT, Gemini, etc.) see.

CharlesW
0 replies
1h19m

Very helpful, thank you.

CharlesW
0 replies
1h20m

I appreciate the correction, thanks!

luke-stanley
2 replies
3h2m

I'm confused on the instruction fine-tuning part that is mentioned briefly, in passing. Is there an open weight instruct variant they've released? Or is that only on their platform? Edit: It's on HuggingFace, great, thanks replies!

doublextremevil
2 replies
3h26m

How much vram is need to run this?

MacsHeadroom
1 replies
3h1m

80GB in 4bit.

But because it only activates one expert at a time, it can run on a fast CPU in reasonable time. So 96GB of DDR4 will do. 96GB of DDR5 is better.

Me1000
0 replies
2h10m

WizardLM-2 8x22b (which was a fine tune of the Mixtral 8x22b base model) at 4bit was only 80GB.

dd-dreams
2 replies
3h35m

The development never stops. In a few years we will look back and see how the previous models were and how they're now. How we couldn't run LLaMa 70B on MacBook Air and now we can.

squirrel23
1 replies
3h25m

Yes it's pretty cool. There was a neat comparison of deep learning development that I think resonates quite well here.

Around 5 years ago, it took a lambda user some pretty significant hardware, software and time (around a full night), to try to create a short deepfake. Now, you don't need any fancy hardware and you can have some decent results within 5 min on your average computer.

brokensegue
2 replies
3h27m

Isn't equating active parameters with cost a little unfair since you still need full memory for all the inactive parameters?

tartrate
0 replies
3h23m

Well, since it affects inference speed it means you can handle more in less time, needing less concurrency.

sa-code
0 replies
41m

Fewer parameters at inference time makes a massive difference in cost for batch jobs, assuming vram usage is the same

kristianp
1 replies
2h41m

So this one is 3x the size but only 7% better on MMLU? Given Moores law is mostly dead, this trend is going to make for even more extremely expensive compute for next gen AI models.

GaggiX
0 replies
2h23m

That's 25% fewer errors.

iFire
1 replies
3h9m

It wasn't clear but how much hardware does it take to run Mixtral 8x22B (mistral.ai) next to me locally?

ru552
0 replies
2h59m

a macbook with 64g of ram

elorant
1 replies
1h4m

Seems that Perplexity Labs already offers a free demo of it.

https://labs.perplexity.ai/

batperson
0 replies
21m

That's the old/regular model. This post is about the new "instruct" model.

yodsanklai
0 replies
35m

How does this compare to ChatGPT4?

stainablesteel
0 replies
2h37m

is this different than their "large" model

spenceryonce
0 replies
3h14m

I can't even begin to describe how excited I am for the future of AI.

jhoechtl
0 replies
2h21m

Did anyone have success getting danswer and ollama to work together?

endisneigh
0 replies
3h30m

Good to continue to see a permissive license here.

austinsuhr
0 replies
3h1m

Is 8x22B gonna make it to Le Chat in the near future?

arnaudsm
0 replies
3h28m

Curious to see how it performs against GPT-4.

Mixtral8x22 beats CommandR+, which is at GPT-4-level in LMSYS' leaderboard.

ado__dev
0 replies
2h15m

We rolled out Mixtral 8x22b to our LLM Litmus Test at s0.dev for Cody AI. Don't have enough data to say it's better or worse that other LLMs yet, but if you want to try it out for coding purposes, let me know your experience.

Lacerda69
0 replies
3h15m

I have been using mixtral daily since it was released for all kinds of writing and coding tasks. Love it and massively invested in mistrals mission.

Keep on doing this great work.

Edit: been using the previous version, seems like this one is even better?

ChicagoDave
0 replies
3h2m

We need larger context windows, otherwise we’re running the same path with marginally different results.