return to table of content

Mistral "Mixtral" 8x7B 32k model [magnet]

brucethemoose2
46 replies
1d1h

In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:

https://huggingface.co/fblgit/una-xaberius-34b-v1beta

https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16

I mention this because it could theoretically be applied to Mistral Moe. If the uplift is the same as regular Mistral 7B, and Mistral Moe is good, the end result is a scary good model.

This might be an inflection point where desktop-runnable OSS is really breathing down GPT-4's neck.

stavros
23 replies
1d

Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?

nabakin
13 replies
1d

More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them.

Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.

In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.

You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.

Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.

brucethemoose2
6 replies
1d

I'm not saying its better than 70B, just that its very strong from what others are saying.

Actually I am testing the 34B myself (not the 7B), and it seems good.

fblgit
4 replies
21h58m

UNA: Uniform Neural Alignment. Haven't u noticed yet? Each model that I uniform, behaves like a pre-trained.. and you likely can fine-tune it again without damaging it.

If you chatted with them, you know .. that strange sensation, you know what is it.. Intelligence. Xaberius-34B is the highest performer of the board, and is NOT contaminated.

valine
3 replies
20h53m

How much data do you need for UNA? Is a typical fine tuning dataset needed or can you get away with less than that?

fblgit
1 replies
20h23m

doesn't require much data, in a 7B can take a couple hours ~

valine
0 replies
19h59m

That’s cool. A couple hours on a single GPU or like 8x a100s?

brucethemoose2
0 replies
20h14m

In addition to what was said, if its anything like DPO you don't need a lot of data, just a good set. For instance, DPO requires "good" and "bad" responses for each given prompt.

nabakin
0 replies
11h7m

I'm not saying its better than 70B, just that its very strong from what others are saying.

Gotcha

Actually I am testing the 34B myself (not the 7B), and it seems good.

I've heard good things about it

mistrial9
1 replies
20h25m

quick to assert authoritative opinion - yet the one word "better" belies the message ? Certainly there is are more dimensions worth including in the rating?

nabakin
0 replies
11h17m

Certainly, there may be aspects of a particular 7b model which could beat another particular 70b model and greater detail into different pros and cons of different models are worth considering but people are trying to rank models and if we're ranking (calling one "better" than another), we might as well do it as accurately as we can since it can be so subjective.

I see too many misleading "NEW 7B MODEL BEATS GPT-4" posts. People test those models a couple of times, come back to the comments section, declare it true, and onlookers know no better than to believe it and in my opinion has led to many people claiming 7b models have gotten as good as Llama 2 70b or GPT-4 when it is not the case when you account for the overfit being exhibited by these models and actually put them to the test via human evaluation.

elcomet
1 replies
23h53m

We can only compare specific training procedures though.

With a 7b and a 70b trained the same way, the 70b should always be better

nabakin
0 replies
11h9m

Makes sense

airgapstopgap
1 replies
20h12m

today we have no architecture or training methodology which would allow it to be possible.

We clearly see that Mistral-7B is in some important, representative respects (eg coding) superior to Falcon-180B, and superior across the board to stuff like OPT-175B or Bloom-175B.

"Well trained" is relative. Models are, overwhelmingly, functions of their data, not just scale and architecture. Better data allows for yet-unknown performance jumps, and data curation techniques are a closely-guarded secret. I have no doubt that a 7B beating our best 60-70Bs is possible already, eg using something like Phi methods for data and more powerful architectures like some variation of universal transformer.

nabakin
0 replies
11h43m

I mean, I 100% agree size is not everything. You can have a model which is massive but not trained well so it actually performs worse than a smaller, better/more efficiently trained model. That's why we use Llama 2 70b over Falcon-180b, OPT-175b, and Bloom-175b.

I don't know how Mistral performs on codegen specifically, but models which are finetuned for a specific use case can definitely punch above their weight class. As I stated, I'm just talking about the general case.

But so far we don't know of a 7b model (there could be a private one we don't know about) which is able to beat a modern 70b model such as Llama 2 70b. Could one have been created which is able to do that but we simply don't know about it? Yes. Could we apply Phi's technique to 7b models and be able to reach Llama 2 70b levels of performance? Maybe, but I'll believe it when we have a 7b model based on it and a human evaluation study to confirm. It's been months now since the Phi study came out and I haven't heard about any new 7b model being built on it. If it really was such a breakthrough to allow 10x parameter reduction and 100x dataset reduction, it would be dumb for these companies to not pursue it.

sbierwagen
5 replies
13h30m

If you don't like machine evaluations, you can take a look at the lmsys chatbot arena. You give a prompt, two chatbots answer anonymously, and you pick which answer is better: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.

(The top 9 chatbots are GPT and Claude versions. Tenth place is a 70B model. While it's great that there's so much interest in 7B models, and it's incredible that people are pushing them so far, I selfishly wish more effort would go into 13B models... since those are the biggest that my macbook can run.)

behnamoh
4 replies
13h2m

I think the current approach — train 7b models and then do MoE on them — is the future. It’ll still be only runnable on high end customer devices. As for 13b + MoE, I don’t think any customer device could handle that in the next couple years.

stolsvik
1 replies
9h48m

I have no formal credentials to say this, but intuitively I feel this is obviously wrong. You couldn’t have taken 50 rats brains and “mixed” them and expected the result to produce new science.

For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.

Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.

I believe size matters, a lot.

brucethemoose2
0 replies
2h11m

The MOE models are essentially trained as a single model. Its not 7 independent models, individually (AFAIK) they are all totally useless without each other.

Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.

The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.

sbierwagen
1 replies
12h54m

My years-old M1 macbook with 16GB of ram runs them just fine. Several Geforce 40-series cards have at least 16GB of vram. Macbook pros go up to 128GB of ram and the mac studio goes up to 192GB. Running regular CPU inference on lots of system ram is cheap-ish and not intolerably slow.

These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.

behnamoh
0 replies
12h34m

1. I wouldn't consider Mac Studio ($7,000) a customer product.

2. Yes, and my MBP M1 Pro can run quantized 34b models. My point was that when you do MoE, memory requirements suddenly become too challenging. A 7b Q8 is roughly 7GB (7b parameters × 8 bits each). But 8x of that would be 56GB, and all of that must be in memory to run.

typon
0 replies
23h1m

Yes. The only thing that is relevant is a hidden benchmark that's never released and run by a trusted third party.

puttycat
0 replies
20h7m

I wonder how it will rank on benchmarks which are password-protected to prevent test contamination, for example: https://github.com/taucompling/bliss

brucethemoose2
0 replies
1d

Yes, absolutely. I was just preaching this.

But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.

eurekin
5 replies
19h49m

I just played with 7b version. It really feels different than anything I tried before. It could explain a docker compose file. It generated a simple vue application component.

I asked around a bit about the example and it was strangely coherent and focused across the whole conversation. It was really well detecting, where I'm starting a new thread (without clearing a context) or referring to things before.

It caught me off guard as well with this:

me: What does following mean [content of the docker compose]

cybertron-7b: In the provided YAML configuration, "following" refers to specifying dependencies

I've never seen any model using my exact wording in quotes in conversation like that.

mark_l_watson
3 replies
16h27m

How did you run it? Are there model files in Ollama format? Are you running on NVidia or Apple Silicon?

EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”

brucethemoose2
1 replies
15h13m

My recommendation is:

- Exui with exl2 files on good GPUs.

- Koboldcpp with gguf files for small GPUs and Apple silicon.

There are many reasons, but in a nutshell they are the fastest and most VRAM efficient.

I can fit 34Bs with about 75K context on a single 24GB 3090 before the quality drop from quantization really starts to get dramatic.

mark_l_watson
0 replies
4h35m

Thanks! I will check out Koboldcpp.

eurekin
0 replies
10h24m

In the textgeneration web ui on NVidia gpu

brucethemoose2
0 replies
15h15m

Yeah, the Yi version is quite something too.

3abiton
4 replies
19h49m

HF leaderboards are rarely reflective of real world performance especially in small variations, but nonetheless, this is promising. What are the HW requirements for this latest Mistral7B?

eyegor
2 replies
17h6m

Any 7b can run well (~50 tok/s) on an 8gb gpu if you tune the context size. 13b can sometimes run well but typically you'll end up with a tiny context window or slow inference. For cpu, I wouldn't recommend going above 1.3b unless you don't mind waiting around.

aarushsah
1 replies
16h34m

How so? I'm only getting 12 t/s using Mistral in LM Studio.

eyegor
0 replies
16h30m

The lazy way is to use text-generation-webui, use an exllamav2 conversion of your model, and turn down context length until it fits (and tick the 8 bit cache option). If you go over your vram it will cut your speed substantially. Like 60/s down to 15/s for an extra 500 context length over what fits. Similar idea applies to any other backends, but you need to shove all the layers into vram if you want decent tok/s. To give you a starting point, typically for 7b models I have to use 4k-6k context length and I use 4-6 bit quantizations for an 8gb gpu. So start at 4 bit, 4k context and adjust up as you can.

You can find most popular models converted for you on huggingface.co if you add exl2 to your search and start with the 4 bit quantized version. Don't bother going above 6 bits even if you have spare vram, practically it doesn't offer much benefit.

For reference I max out around 60 tok/s at 4bit, 50 tok/s at 5bit, 40 at 6bit for some random 7b parameter model on a rtx 2070.

brucethemoose2
0 replies
19h30m

What are the HW requirements for this latest Mistral7B

Pretty much anything with ~6-8GB of memory that's not super old.

It will run on my 6GB laptop RTX 2060 extremely quickly. It will run on my IGP or Phone with MLC-LLM. It will run fast on a laptop with a small GPU, with the rest offloaded to CPU.

Small, CPU only servers are kinda the only questionable thing. It runs, just not very fast, especially with long prompts (which are particularly hard for CPUs). There's also not a lot of support for AI ASICs.

_boffin_
3 replies
1d

Interesting. One thing i noticed is that Mistral has a `max_position_embeddings` of ~32k while these have it at 4096.

Any thoughts on that?

brucethemoose2
2 replies
1d

Is complicated.

The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.

The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.

ComputerGuru
1 replies
19h38m

Did you mean "but it uses a sliding window beyond" *8K*? Because I don't understand how the sentence would work otherwise.

brucethemoose2
0 replies
19h27m

Yeah exactly, sorry.

swyx
1 replies
21h19m

what is neural alignment? who came up with it?

brucethemoose2
0 replies
17h19m

@fblgit apparently, from earlier in this thread.

screye
1 replies
17h32m

Yeah, and Mistral doesn't particularly care about lobotomizing the model with 'safety-training'. So it can achieve much better performance per-parameter than anthropic/google/OpenAI while being more steerable as well.

behnamoh
0 replies
13h5m

until Mistral gets too big for lawyers to ignore.

fblgit
1 replies
21h53m

Correct. UNA can align the MoE at multiple layers, experts, nearly any part of the neural network I would say. Xaberius 34B v1 "BETA".. is the king, and its just that.. the beta. I'll be focusing on the Mixtral, its a christmas gift.. modular in that way, thanks for the lab @mistral!

brucethemoose2
0 replies
21h20m

Do a Yi 200K version as well! That would make my Christmas, as Mistral Moe is only maybe 32K.

nikvdp
0 replies
10h10m

This piqued my interest so I made an ollama modelfile of it for the smallest variant (from TheBloke's GGUF [1] version). It does indeed seem impressively gpt4-ish for such a small model! Feels more coherent than openhermes2.5-mistral which was my previous goto local llm.

If you have ollama installed you can try it out with `ollama run nollama/una-cybertron-7b-v2`.

[1]: https://huggingface.co/TheBloke/una-cybertron-7B-v2-GGUF

kcorbitt
43 replies
1d2h

No public statement from Mistral yet. What we know:

- Mixture of Experts architecture.

- 8x 7B parameters experts (potentially trained starting with their base 7B model?).

- 96GB of weights. You won't be able to run this on your home GPU.

jlokier
15 replies
1d1h

- 96GB of weights. You won't be able to run this on your home GPU.

You can these days, even in a portable device running on battery.

96GB fits comfortably in some laptop GPUs released this year.

refulgentis
11 replies
1d1h

This is extremely misleading. source: been working in local LLMs since 10 months ago. Got my Mac laptop too. I'm bullish too. But we shouldn't breezily dismiss those concerns out of hand. In practice, it's single digit tokens a second on a $4500 laptop for a model with weights half this size (Llama 2 70B Q2 GGUF => 29 GB, Q8 => 36 GB)

coolspot
7 replies
1d

$4500

Which is more than a price of RTX A6000 48gb ($4k used on ebay)

brucethemoose2
4 replies
23h11m

Which is outrageously priced, in case thats not clear. Its an 2020 RTX 3090 with doubled up memory ICs, which is not much extra BoM.

baq
3 replies
21h56m

Clearly it’s worth what people are willing to pay for it. At least it isn’t being used to compute hashes of virtual gold.

brucethemoose2
1 replies
21h2m

Its a artificial supply constraint due to artificial market segmentation enabled by Nvidia/AMD.

Honestly its crazy that AMD indulges in this, especially now. Their workstation market share is comparatively tiny, and instead they could have a swarm of devs (like me) pecking away at AMD compatibility on AI repos if they sold cheap 32GB/48GB cards.

baq
0 replies
2h38m

Never said it was ok! Just saying that there are people willing to pay this much, so it costs this much. I'd very much like to buy a 40GB GPU for this to, but at these prices this is not happening - I'd have to turn it into a business to justify this expense, but I just don't feel like it.

tucnak
0 replies
21h23m

People are also willing to die for all kinds of stupid reasons, and it's not indicative of _anything_ let alone a clever comment on the online forum. Show some decorum, please!

CamperBob2
1 replies
23h25m

How fast does it run on that?

refulgentis
0 replies
19h48m

quantization makes it hard to have exactly one answer -- I'd make a q0 joke, except that's real now -- i.e. reduce the 3.4 * 10^38 range of float 32 to 2, a boolean.

it's not very good, at all, but now we can claim some pretty massive speedups.

I can't find anything for llama 2 70B on 4090 after 10 minutes of poking around, 13B is about 30 tkn/s. it looks like people generally don't run 70B unless they have multiple 4090s.

MacsHeadroom
2 replies
23h8m

Mixtral 8x7b only needs 12B of weights in RAM per generation.

2B for the attention head and 5B from each of 2 experts.

It should be able to run slightly faster than a 13B desnse model, in as little as 16GB of RAM with room to spare.

filterfiber
0 replies
22h19m

in as little as 16GB of RAM with room to spare.

I don't think that's the case, for full speed you still need (5B*8)/2+2~fewB overhead.

I think the experts chosen per-token? That means that yes you technically only need two in VRAM memory+router/overhead per token, but you'll have to constantly be loading in different experts unless you can fit them all, which would still be terrible for performance.

So you'll still be PCIE/RAM speed limited unless you can fit all of the experts into memory (or get really lucky and only need two experts).

dkarras
0 replies
16h12m

no doesn't work that way. experts can change per token so for interactive speeds you need all in memory unless you want to wait for model swaps between tokens.

michaelt
2 replies
1d1h

Be a lot cooler if you said what laptop, and how much quantisation you're assuming :)

tvararu
0 replies
1d1h

They're probably referring to the new MacBook Pros with up to 128GB of unified memory.

jlokier
0 replies
15h6m

Sibling commenter tvararu is correct. 2023 Apple Macbook with 128GiB RAM, all available to the GPU. No quantisation required :)

Other sibling commenter refulgentis is correct too. The Apple M{1-3} Max chips have up to 400GB/s memory bandwidth. I think that's noticably faster than every other consumer CPU out there. But it's slower than a top Nvidia GPU. If the entire 96GB model has to be read by the GPU for each token, that will limit unquantised performance to 4 tokens/s at best. However, as the "Mixtral" model under discussion is a mixture-of-experts, it doesn't have to read the whole model for each token, so it might go faster. Perhaps still single-digit tokens/s though, for unquantised.

coder543
13 replies
1d2h

96GB of weights. You won't be able to run this on your home GPU.

This seems like a non-sequitur. Doesn't MoE select an expert for each token? Presumably, the same expert would frequently be selected for a number of tokens in a row. At that point, you're only running a 7B model, which will easily fit on a GPU. It will be slower when "swapping" experts if you can't fit them all into VRAM at the same time, but it shouldn't be catastrophic for performance in the way that being unable to fit all layers of an LLM is. It's also easy to imagine caching the N most recent experts in VRAM, where N is the largest number that still fits into your VRAM.

ttul
4 replies
1d2h

Someone smarter will probably correct me, but I don’t think that is how MoE works. With MoE, a feed-forward network assesses the tokens and selects the best two of eight experts to generate the next token. The choice of experts can change with each new token. For example, let’s say you have two experts that are really good at answering physics questions. For some of the generation, those two will be selected. But later on, maybe the context suggests you need two models better suited to generate French language. This is a silly simplification of what I understand to be going on.

ttul
1 replies
1d2h

This being said, presumably if you’re running a huge farm of GPUs, you could put each expert onto its own slice of GPUs and orchestrate data to flow between GPUs as needed. I have no idea how you’d do this…

alchemist1e9
0 replies
1d

Ideally those many GPUs could be on different hosts connected with a commodity interconnect like 10gbe.

If MOE models do well it could be great for commodity hw based distributed inference approaches.

wongarsu
0 replies
1d2h

One viable strategy might be to offload as many experts as possible to the GPU, and evaluate the other ones on the CPU. If you collect some statistics which experts are used most in your use cases and select those for GPU acceleration you might get some cheap but notable speedups over other approaches.

Philpax
0 replies
1d2h

Yes, that's more or less it - there's no guarantee that the chosen expert will still be used for the next token, so you'll need to have all of them on hand at any given moment.

read_if_gay_
4 replies
1d2h

however, if you need to swap experts on each token, you might as well run on cpu.

tarruda
3 replies
1d2h

Presumably, the same expert would frequently be selected for a number of tokens in a row

In other words, assuming you ask a coding question and there's a coding expert in the mix, it would answer it completely.

read_if_gay_
1 replies
1d2h

yes I read that. do you think it's reasonable to assume that the same expert will be selected so consistently that model swapping times won't dominate total runtime?

tarruda
0 replies
1d1h

No idea TBH, we'll have to wait and see. Some say it might be possible to efficiently swap the expert weights if you can fit everything in RAM: https://x.com/brandnarb/status/1733163321036075368?s=20

ttul
0 replies
1d2h

See my poorly educated answer above. I don’t think that’s how MoE actually works. A new mixture of experts is chosen for every new context.

tarruda
1 replies
1d2h

I will be super happy if this is true.

Even if you can't fit all of them in the VRAM, you could load everything in tmpfs, which at least removes disk I/O penalty.

cjbprime
0 replies
23h26m

Just mentioning in case it helps anyone out: Linux already has a disk buffer cache. If you have available RAM, it will hold on to pages that have been read from disk until there is enough memory pressure to remove them (and then it will only remove some of them, not all of them). If you don't have available RAM, then the tmpfs wouldn't work. The tmpfs is helpful if you know better than the paging subsystem about how much you really want this data to always stay in RAM no matter what, but that is also much less flexible, because sometimes you need to burst in RAM usage.

numeri
0 replies
1d2h

You're not necessarily wrong, but I'd imagine this is almost prohibitively slow. Also, this model seems to use two experts per token.

shubb
4 replies
1d2h

> You won't be able to run this on your home GPU.

Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

terafo
1 replies
1d1h

Yes, but you wouldn't want to do that. You will be able to run that on a single 24gb GPU by the end of this weekend.

brucethemoose2
0 replies
23h12m

Maybe two weekends.

dragonwriter
1 replies
1d1h

Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

I would think no differently than you can run a large regular model on a multiGPU setup (which people do!). Its still all one network even if not all of it is activated for each token, and since its much smaller than a 56B model, it seems like there are significant components of the network that are shared.

terafo
0 replies
1d1h

Attention is shared. It's ~30% of params here. So ~2B params are shared between experts and ~5B params are unique to each expert.

faldore
2 replies
1d2h

at 4 bits you could run it on a 3090 right?

brucethemoose2
1 replies
19h24m

Its crazy how the 3090 is such a ubiquitous local llm card these days. I despise Nvidia on linux... And yet I ended up with a 3090.

How are AMD/Intel totally missing this boat?

nicolas03
0 replies
18h53m

LMAO SAME. I hate nvidia yet got a used 3090 for $600. I’ve been biting my nails hoping china dosent resort to 3090’s, because I really want to buy another and I’m not paying more than 600.

miven
1 replies
1d1h

You won't be able to run this on your home GPU.

As far as I understand in a MOE model only one/few experts are actually used at the same time, shouldn't the inference speed for this new MOE model be roughly the same as for a normal Mistral 7B then?

7B models have a reasonable throughput when ran on a beefy CPU, especially when quantized down to 4bit precision, so couldn't Mixtral be comfortably ran on a CPU too then, just with 8 times the memory footprint?

filterfiber
0 replies
21h6m

So this specific model ships with a default config of 2 experts per token.

So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.

Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.

So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.

~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.

MacsHeadroom
1 replies
1d2h

That is only 24GB in 4bit.

People are running models 2-4 times that size on local GPUs.

What's more, this will run on a MacBook CPU just fine-- and at an extremely high speed.

brucethemoose2
0 replies
1d1h

Yeah, 70B is much larger and fits on a 24GB, admitedly with very lossy quantization.

This is just about right for 24GB. I bet that is intentional on their part.

tarruda
0 replies
1d2h

Theoretically it could fit into a single 24GB GPU if 4-bit quantized. Exllama v2 has even more efficient quantization algorithm, and was able to fit 70B models in 24GB gpu, but only with 2048 tokens of context.

mareksotak
20 replies
1d2h

Some companies spend weeks on landing pages, demos and cute thought through promo videos and then there is Mistral, casually dropping a magnet link on Friday.

tananaev
13 replies
1d2h

I'm sure it's also a marketing move to build a certain reputation. Looks like it's working.

OscarTheGrinch
11 replies
1d2h

Not geoblocking the entirety of Europe also makes them stand out like a ringmaster amongst clowns.

peanuty1
8 replies
23h3m

Google Bard is still not available in Canada.

oh_sigh
6 replies
21h57m

Are there some regulatory reasons why it would not be available? It seems weird if Google would intentionally block users merely to block them.

mrandish
2 replies
21h4m

I think there are still some pretty onerous laws about French localization of products and services made available in the French-speaking part of Canada. Could be that...

simonerlic
1 replies
20h44m

I originally thought so too, but as far as I know Bard is available in France- so I have a feeling that language isn't the roadblock here.

dpwm
0 replies
12h39m

Can confirm Bard is available in France and the UI has been translated to French.

ComputerGuru
1 replies
19h35m

Google and Facebook were, up until just a couple of days ago, in a cold war within the Canadian government.

peanuty1
0 replies
5h23m

Google made a deal to pay 100M/year to news organizations in Canada but Meta is continuing to block news links.

wadefletch
0 replies
19h41m

There's a proposed framework[1] in the EU that's rather restrictive. Seems like they're just not even bothering, perhaps to make a point.

[1] https://digital-strategy.ec.europa.eu/en/policies/regulatory...

noonething
0 replies
8h11m

neither is facebooks image one

moffkalast
1 replies
1d1h

Well they are French after all. They should be geoblocking the USA in response for a bit to make a point lol.

fredoliveira
0 replies
1d1h

Not with their cap table, they won't ;-)

throwaway4aday
0 replies
4h46m

technically, it is marketing but at this level marketing is indistinguishable from shipping

tarruda
5 replies
1d2h

I'm curious about their business model.

nuz
2 replies
1d2h

They can make plenty by offering consulting fees for finetuning and general support around their models.

realce
0 replies
1d2h

"plenty" is not a word some of these people understand however

behnamoh
0 replies
22h39m

You mean they put on a Redhat?

jorge-d
1 replies
1d2h

Well so far their business model seems to be mostly centered about raising money[1]. I do hope they succeed in becoming a succesful contender against OpenAI.

[1] https://www.bloomberg.com/news/articles/2023-12-04/openai-ri...

udev4096
0 replies
1d1h
nulld3v
17 replies
1d2h

Looks to be Mixture of Experts, here is the params.json:

    {
        "dim": 4096,
        "n_layers": 32,
        "head_dim": 128,
        "hidden_dim": 14336,
        "n_heads": 32,
        "n_kv_heads": 8,
        "norm_eps": 1e-05,
        "vocab_size": 32000,
        "moe": {
            "num_experts_per_tok": 2,
            "num_experts": 8
        }
    }

sockaddr
11 replies
1d1h

What does expert mean in this context?

moffkalast
10 replies
1d1h

It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B.

rishabhjain1198
2 replies
19h56m

In a MoE model with experts_per_token = 2 and each expert having 7B params, after picking the experts it should run as fast as the slowest 7B expert, not a comparable 14B model.

nullc
1 replies
18h12m

Only assuming it's able to hide the faster one in free parallelism.

moffkalast
0 replies
8h13m

My CPU trying its best to run inference: parallelwhat?

tavavex
1 replies
22h2m

Does anyone here know roughly how an expert gets chosen? It seems like a very open-ended problem, and I'm not sure on how it can be implemented easily.

rishabhjain1198
0 replies
19h53m

[Relevant paper](https://arxiv.org/abs/1701.06538).

TL;DR you can think of it as the initial part of the model is essentially dedicated to learning which experts to choose.

dragonwriter
1 replies
1d1h

GPT 4 is based on the same architecture, but at 8*222B.

Do we actually either no that it is MoE or that size? IIRC both if those started as outsidr guesses that somehow just became accepted knowledge without any actual confirmation.

moffkalast
0 replies
1d

Iirc some of the other things the same source stated were later confirmed, so this is likely to be true as well, but I might be misremembering.

WeMoveOn
1 replies
20h27m

How did you come up with 40b for the memory? specifically, why 0.7 * total params?

moffkalast
0 replies
8h7m

It's just a rough estimate given that these things are fairly linear, the original 7B mistral was 15 GB and the new one is 86 GB, whereas a fully duplicated 8 * 15 GB would suggest a 120 GB size, so 86/120 = 0.71 for actual size, suggesting 29% memory savings. This of course doesn't really account for any multiple vs single file saving overhead and such, so it's likely to be a bit off.

sockaddr
0 replies
1d

Fascinating. Thanks

sp332
4 replies
1d2h

I don't see any code in there. What runtime could load these weights?

brucethemoose2
3 replies
1d2h

Its presumably llama just like Mistral.

Everything open source is llama now. Facebook all but standardized the architecture.

I dunno about the moe. Is there existing transformers code for that part? It kinda looks like there is based on the config.

jasonjmcghee
1 replies
1d1h

Mistral is not llama architecture.

https://github.com/mistralai/mistral-src

brucethemoose2
0 replies
1d1h

Its basically llama architecture, all but drop in compatible with llama runtimes.

refulgentis
0 replies
1d1h

Because it's JSON? :)

MyFirstSass
13 replies
1d1h

Hot take but Mistral 7B is the actual state of the art of LLM's.

ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but it's huge, runs on server farms far away and is more or less a black box.

Mistral is tiny, and amazingly coherent and useful for it's size for both general questions and code, uncensored, and a leap i wouldn't have believed possible in just a year.

I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

tarruda
3 replies
1d

I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.

MyFirstSass
2 replies
1d

Thanks for the tip. I'm on the M2 Air with 16 GB's of ram.

If anyone has faster than 12tkps on Air's let me know.

I'm using the LM Studio GUI over llama.cpp with the "Apple Metal GPU" option. Increasing CPU threads seemingly does nothing either without metal.

Ram usage hovers at 5.5GB with a q5_k_m of Mistral.

ukuina
0 replies
11h38m

LlamaFile typically outperforms LM Studio and even Ollama.

M4v3R
0 replies
22h39m

Try different quantization variations. I got vastly different speeds depending on which quantization I chose. I believe q4_0 worked very well for me. Although for a 7B model q8_0 runs just fine too with better quality.

emporas
2 replies
21h21m

Given that 50% of all information consumed in the internet is produced in the last 24 hours, smaller models could hold a serious advantage over bigger models.

If an LLM or a SmallLM can be retrained or fine-tuned constantly, every week or every day to incorporate recent information then outdated models trained a year or two years back hold no chance to keep up. Dunno about the licensing but OpenAI could incorporate a smaller model like Mistral7B into their GPT stack, re-train it from scratch every week, and charge the same as GPT-4. There are users who might certainly prefer the weaker, albeit updated models.

refulgentis
1 replies
19h32m

It's much easier to do RAG than try to shoehorn the entirety of the universe into 7B parameters every 24 hours. Mistral's great at being coherent and processing info at 7B, but you wouldn't want it as an oracle.

emporas
0 replies
12h57m

I didn't know about RAG, thanks for sharing. I am not sure, if outdated information can be tackled with RAG though, especially in coding.

Just today, i asked GPT and Bard(Gemini) to write code using slint, neither of them had any idea of slint. Slint being a relatively new library, like two and a half (0.1 version) to one and a half (0.2 version) years back [1] is not something they trained on.

Natural language doesn't change that much over the course of a handful of years, but in coding 2 years back may as well be a century. My argument is that, SmallLMs not only they are relevant, they are actually desirable, if the best solution is to be retrained from scratch.

If on the other hand a billion token context window proves to be practical, or the RAG technique solves most of use cases, then LLMs might suffice. This RAG technique, could it be aware, of million of git commits daily, on several projects, and keep it's knowledge base up to date? I don't know about that.

https://github.com/slint-ui/slint/releases?page=3

ipsum2
1 replies
22h20m

State of the art for something you can run on a Macbook air, but not state of the art for LLMs, or even open source. Yi 34B and Llama2 70B still beat it.

MyFirstSass
0 replies
22h4m

True but it's ahead of the competition when size is considered, which is why i really look forward to their 13B, 33B models etc. because if they are as potent who knows what leaps we'll take soon.

I remember running llama1 33B 8 months ago that as i remember was on Mistral 7B's level while other 7B models were a rambling mess.

The jump in "potency" is what is so extreme.

andy_xor_andrew
1 replies
23h11m

I am with you on this. Mistral 7B is amazingly good. There are finetunes of it (the Intel one, and Berkeley Starling) that feel like they are within throwing distance of gpt3.5T... at only 7B!

I was really hoping for a 13B Mistral. I'm not sure if this MOE will run on my 3090 with 24GB. Fingers crossed that quantization + offloading + future tricks will make it runnable.

MyFirstSass
0 replies
22h10m

True i've been using the OpenOrca finetune and just downloaded the new UNA Cybertron model both tuned on the Mistral base.

They are not far from GPT-3 logic wise i'd say if you consider the breadth of data, ie. very little in 7GB's; so missing other languages, niche topics and prose styles etc.

I honestly wouldn't be surprised if 13B would be indistinguishable from GPT-3.5 on some levels. And if that is the case - then coupled with the latest developments in decoding - like Ultrafastbert, Speculative, Jacobi, Lookahead etc. i honestly wouldn't be surprised to see local LLM's on current GPT-4 level within a few years.

nabakin
0 replies
21h11m

Not a hot take, I think you're right. If it was scaled up to 70b, I think it would be better than Llama 2 70b. Maybe if it was then scaled up to 180b and turned into a MoE it would be better than GPT-4.

123yawaworht456
0 replies
22h30m

it really is. it feels at the very least equal to llama2 13b. if mistral 70b had existed and was as much an improvement over llama2 70b as it is at 7b size, it would definitely be on part with gpt3.5

tarruda
9 replies
1d2h

Still 7B, but now with 32k context. Looking forward to see how it compares with the previous one, and what the community does with it.

seydor
3 replies
1d1h

unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models

brucethemoose2
2 replies
1d1h

Not really, looks like a ~40B class model which is very runnable.

MacsHeadroom
1 replies
22h59m

It's actually ~13B class at runtime. 2B for attention is shared across each expert and then it runs 2 experts at a time.

So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.

brucethemoose2
0 replies
20h28m

Yeah. I just mean in terms of VRAM usage.

MacsHeadroom
3 replies
1d2h

Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

dragonwriter
2 replies
1d2h

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

stavros
0 replies
1d

Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".

MacsHeadroom
0 replies
23h5m

Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.

brucethemoose2
0 replies
19h22m

We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.

maremmano
6 replies
1d2h

Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph...

eurekin
4 replies
1d1h

I find that a way more bold and confident than dropping a obviously manipulated and unrealistic marketing page or video

maremmano
3 replies
23h21m

Frankly I don't know why Google continues to act this way. Let's remind the "Google Duplex: A.I. Assistant Calls Local Businesses To Make Appointments" story. https://www.youtube.com/watch?v=D5VN56jQMWM

Not that this affects Google's user base in any way, at the moment.

polygamous_bat
1 replies
21h28m

Frankly I don't know why Google continues to act this way.

Unfortunately, that's because they have Wall St. analysts looking at their videos who will (indirectly) determine how big of a bonus Sundar and co takes home at the end of the year. Mistral doesn't have to worry about that.

eurekin
0 replies
21h16m

This makes so much sense! Thanks

eurekin
0 replies
21h51m

They obviously have both money and great talent. Maybe they put out minimal effort only for investors that expect their presence in consumer space?

seydor
0 replies
22h5m

FILE_ID.DIZ

YetAnotherNick
6 replies
1d2h

86 GB. So it's likely a Mixture of experts model with 8 experts. Exciting.

tarruda
5 replies
1d2h

Damn, I was hoping it was still a single 7B model that I would be able to run on my GPU

renonce
4 replies
1d2h

You can, wait for a 4-bit quantized version

tarruda
3 replies
1d2h

I only have a RTX 3070 with 8GB VRam. It can run quantized 7B models well, but this is 8 x 7B. Maybe an RTX 3090 with 24GB VRAM can do it.

espadrine
0 replies
1d1h

Once on llama.cpp, it will likely run on CPU with enough RAM, especially given that the GGUF mmap code only seems to use RAM for the parts of the weights that get used.

burke
0 replies
1d1h

Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that.

It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.

brucethemoose2
0 replies
1d2h

It would be very tight. 8x7B 24GB (currently) has more overhead than 70B.

Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)

EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.

BryanLegend
5 replies
1d

Andrej Karpathy's take:

New open weights LLM from @MistralAI

params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2

Likely related code: https://github.com/mistralai/megablocks-public

Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

If people are wondering why there is so much AI activity right around now, it's because the biggest deep learning conference (NeurIPS) is next week.

https://twitter.com/karpathy/status/1733181701361451130

crakenzak
1 replies
19h33m

it's because the biggest deep learning conference (NeurIPS) is next week.

Can we expect some big announcements (new architectures, models, etc) at the conference from different companies? Sorry, not too familiar what the culture for research conferences is.

jbarrow
0 replies
18h45m

Typically not. Google as an example: the transformer paper (Vaswani et al., 2017) was arxiv'd in June of 2017, and NeurIPS (the conference in which it was published) was in December of that year; BERT (Devlin et al., 2019) was similarly arxiv'd before publication.

Recent announcements from companies tend to be even more divorced from conference dates, as they release anemic "Technical Reports" that largely wouldn't pass muster in a peer review.

henrysg
0 replies
21h25m

Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.
GaggiX
0 replies
19h6m

-hidden_dim / dim = 14336/4096 => 3.5X MLP expand

- n_heads / n_kv_heads = 32/8 => 4X

These two are exactly the same as the old Mistral-7B

Der_Einzige
0 replies
16h34m

Also, because EMNLP 2023 is happening right now.

politician
4 replies
1d2h

Honest question: Why isn't this on Huggingface? Is this one a leaked model with a questionable training or alignment methodology?

EDIT: I mean, I guess they didn't hack their own twitter account, but still.

kcorbitt
3 replies
1d2h

It'll be on Huggingface soon. This is how they dropped their original 7B model as well. It's a marketing thing, but it works!

politician
1 replies
1d2h

Ah, well, ok. I appreciate the torrent link -- much faster distribution.

ComputerGuru
0 replies
19h30m

Also more reliable. I had to write my own script to clone hf repos on Windows because git+lfs to an smb share would only partially download.

politician
0 replies
1d2h

@kcorbitt Low priority, probably not worth an email: Does using OpenPipe.ai to fine-tune a model result in a downloadable LoRA adapter? It's not clear from the website if the fine-tune is exportable.

maremmano
4 replies
23h9m

Who know if I can run this on MBC Pro M3 max 128gb? at what TPS?

treprinum
0 replies
22h17m

I would say so based on LLaMA 2 70B; if it's 8x inference in MoE then I guess you'd see <20 tokens/sec?

marci
0 replies
22h38m

If I understand correctly:

RAM Wise, you can easily run a 70b with 128GB, 8x7B is obviously less than that.

Compute wise, I suppose it would be a bit slower than running a 13b.

edit: "actually", I think it might be faster than a 13b. 8 random 7b ~= 115GB, Mixtral is under 90. I will have to wait for more info/understanding.

deoxykev
0 replies
22h58m

I would like to know this as well.

M4v3R
0 replies
22h47m

Big chance that you’ll be able to run it using Ollama app soon enough.

skghori
3 replies
1d2h

multimodal? 32k context is pretty impressive, curious to test instructability

brucethemoose2
2 replies
1d1h

MistralLite is already 32K, and Yi 200K actually works pretty well out to at least 75K (the most I tested)

civilitty
1 replies
22h24m

What kind of tests did you run out to that length? (Needle in haystack, summarization, structured data extraction, etc)

What is the max number of tokens in the output?

brucethemoose2
0 replies
20h24m

Long stories mostly, either novel or chat format. Sometimes summarization or insights, notably tests that you could't possible do with RAG chunking. Mostly short responses, not rewriting documents or huge code blocks or anything like that.

MistralLite is basically overfit to summarize and retrieve in its 32K context, but its extremely good at that for a 7B. Its kinda useless for anything else.

Yi 200K is... smart with the long context. An example I often cite is a Captain character in a story I 'wrote' with the llm. A Yi 200K finetune generated a debriefing for like 40K of context in a story, correctly assessing what plot points should be kept secret and making some very interesting deductions. You could never possibly do that with RAG on a 4K model, or even models that "cheat" with their huge attention like Anthropic.

I test at 75K just because that's the most my 3090 will hold.

aubanel
3 replies
1d2h

Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini!

brucethemoose2
2 replies
1d2h

I will take weights over docs.

Its does remind me how some Google employee was bragging that they disclosed the weights for the Gemini, and only the small mobile Gemini, as if that's a generous step over other companies.

refulgentis
1 replies
22h37m

I don't think that's true, because quite simply, they have not.

I am 100% in agreement with your viewpoint, but feel squeamish seeing an un-needed lie coupled to it to justify it. Just so much Othering these days.

brucethemoose2
0 replies
20h9m

I was referencing this tweet:

https://twitter.com/zacharynado/status/1732425598465900708

(Alt: https://nitter.net/zacharynado/status/1732425598465900708)

That is fair though, this was an impulsive addition on my part.

seydor
2 replies
1d1h

looks like they're too busy being awesome. i need a fake video to understand this!

What memory will this need? I guess it won't run on my 12GB of vram

"moe": {"num_experts_per_tok": 2, "num_experts": 8}

I bet many people will re-discover bittorrent tonight

syntaxing
0 replies
23h0m

BitTorrent was the craze when llama was leaked on torrent. Then Facebook started taking down all huggingface repos and a bunch of people transitioned to torrent released temporarily. llama 2 changed all this but it was a fun time.

brucethemoose2
0 replies
1d1h

Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.

leobg
2 replies
22h2m

I love Mistral.

It’s crazy what can be done with this small model and 2 hours of fine tuning.

Chatbot with function calling? Check.

90 +% accuracy multi label classifier, even when you only have 15 examples for each label? Check.

Craaaazy powerful.

leodriesch
0 replies
19h3m

Could you link me to a finetune optimized for function calling? I was looking for one a few weeks ago but did not find any.

jeanloolz
0 replies
46m

Can you point me to a function calling fine tune mistral model? This is the only feature that keeps me from migrating away from openai. I searched a few time but could not find anything in HG

yodsanklai
1 replies
17h46m

Can anyone explain what this means?

ukuina
0 replies
11h15m

Possibly a huge leap forward in open-source model capability. GPT4's prowess supposedly comes from strong dataset + RLHF + MoE (Mixture of Experts).

Mixtral brings MoE to an already-powerful model.

cloudhan
1 replies
1d2h

Might be the training code related with the model https://github.com/mistralai/megablocks-public/tree/pstock/m...

cloudhan
0 replies
1d2h

Mixtral-8x7B support --> Support new model

https://github.com/stanford-futuredata/megablocks/pull/45

asolidtime1
1 replies
22h42m

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen/b...

Holy shit, this is some clever marketing.

Kinda wonder if any of their employees were part of the warez scene at some point.

userbinator
0 replies
16h8m

They certainly got that aesthetic right; the only thing that stands out (but might be a necessity) is using real names instead of handles.

_fizz_buzz_
1 replies
21h26m

Does anybody have a tutorial or documentation how I can run this and play around with this locally. A „getting started“ guide of sorts?

0cf8612b2e1e
0 replies
20h59m

Even better if a llamafile gets released.

udev4096
0 replies
1d2h

based mistral casually dropping a magnet link

udev4096
0 replies
1d2h
stevebmark
0 replies
18h43m

Mistral Mixtral Model Magnet

Mistral Mixtral Model Magnet

Mistral Mixtral Model Magnet

smlacy
0 replies
22h6m
sigmar
0 replies
1d2h

Not exactly similar companies in terms of their goals, but pretty hilarious to contrast this model announcement with Google's Gemini announcement two days ago.

sergiotapia
0 replies
1d2h

Stuck on "Retrieving data" from the Magnet link and "Downloading metadata" when adding the magnet to the download list.

I had to manually add these trackers and now it works: https://gist.github.com/mcandre/eab4166938ed4205bef4

poulpy123
0 replies
22h41m

is it eight 7b models in a trench coat ?

manojlds
0 replies
1d2h

Google - Fake demo

Mistral - magnet link and that's it

lxe
0 replies
19h32m

If anyone can help running this, would be appreciated. Resources so far:

- https://github.com/dzhulgakov/llama-mistral

lagniappe
0 replies
19h32m

Magnet link says invalid for me

jpdus
0 replies
18h20m

We now have a (experimental) working HF version here: https://huggingface.co/DiscoResearch/mixtral-7b-8expert

fortunefox
0 replies
22h28m

Releasing a model with a magnet link and some ascii art gives me way more confidence in the product than any OpenAI blog post ever could.

Excited to play with this once it's somewhat documented on how to get it running on a dual 4090 Setup.

dzhulgakov
0 replies
13h27m

You can try Mixtral live at https://app.fireworks.ai/ (soon to be faster too)

Warning: the implementation might be off as there's no official one. We at Fireworks tried to reverse-engineer model architecture today with the help of awsome folks from the community. The generations look reasonably good, but there might be some details missing.

If you want to follow the reverse-engineering story: https://twitter.com/dzhulgakov/status/1733330954348085439

cuuupid
0 replies
1d

Stark contrast with Google's "all demo no model" approach from earlier this week! Seems to be trained off Stanford's Megablocks: https://github.com/mistralai/megablocks-public

balnazzar
0 replies
18h17m

Might be relevant: https://twitter.com/dzhulgakov/status/1733217065811742863.

Anyway, if the vanilla version requires 2x80gb cards, I wonder how would it run on a M2 Ultra 192gb Mac Studio.

Anyone having the machine could try?

ahmetkca
0 replies
1d

Let’s go multimodal

Jayakumark
0 replies
23h44m