HN comments for: Mixtral 8x22B

jjice

30 replies

3h23m

2024-04-17 15:03:30 UTC

Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all?

hlfshell

19 replies

3h17m

2024-04-17 15:08:52 UTC

This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.

During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.

The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.

fire_lake

6 replies

3h13m

2024-04-17 15:13:40 UTC

Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?

rgbrgb

3 replies

3h1m

2024-04-17 15:25:38 UTC

Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)

samus

0 replies

1h9m

2024-04-17 17:17:24 UTC

Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.

nialv7

0 replies

1h42m

2024-04-17 16:44:44 UTC

Sounds like the "you only use 10% of your brain" myth, but actually real this time.

cjbprime

0 replies

1h58m

2024-04-17 16:28:05 UTC

Not quite, you don't save memory, only compute.

og_kalu

0 replies

2h56m

2024-04-17 15:29:56 UTC

It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.

imjonse

0 replies

3h4m

2024-04-17 15:22:18 UTC

It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.

andai

3 replies

3h1m

2024-04-17 15:25:29 UTC

I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)

Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.

I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.

samus

0 replies

1h15m

2024-04-17 17:11:45 UTC

It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.

We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".

MPSimmons

0 replies

1h15m

2024-04-17 17:11:32 UTC

I heard MoE reduces inference costs

Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.

Filligree

0 replies

2h53m

2024-04-17 15:33:28 UTC

The latter. Yes, it all needs to stay in memory.

andai

2 replies

2h53m

2024-04-17 15:33:03 UTC

Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.

Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...

jasonjmcghee

0 replies

2h30m

2024-04-17 15:56:49 UTC

Yes there are many fine tunes on huggingface. Search "8x1B huggingface"

auspiv

0 replies

2h52m

2024-04-17 15:34:35 UTC

The previous mixtral is 8x7B

wenc

1 replies

3h0m

2024-04-17 15:26:32 UTC

Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?

Filligree

0 replies

2h51m

2024-04-17 15:34:57 UTC

Not really. The “expert” term is a misnomer; it would be better put as “brain region”.

Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.

andai

1 replies

2h59m

2024-04-17 15:27:13 UTC

Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?

wongarsu

0 replies

2h55m

2024-04-17 15:31:31 UTC

Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU

api

0 replies

3h14m

2024-04-17 15:12:38 UTC

A decent loose analogy might be database sharding.

Basically you're sharding the neural network by "something" that is itself tuned during the learning process.

huevosabio

1 replies

2h59m

2024-04-17 15:27:33 UTC

Ignore the "experts" part, it misleads a lot of people [0]. There is no explicit specialization in the most popular setups, it is achieved implicitly through training. In short: MoEs add multiple MLP sublayers and a routing mechanism after each attention sublayer and let the training procedure learn the MLP parameters and the routing parameters.

In a longer, but still rough, form...

How these transformers work is roughly:

``` x_{l+1} = mlp_l(attention_l(x_l)) ```

where `x_l` is the hidden representation at layer l, `attention_l` is the attention sublayer at layer l, and `mlp_l` is the multilayer perceptron at sublayer l.

This MLP layer is very expensive because it is fully connected (i.e. every input has a weight to every output). So! MoEs instead of creating an even bigger, more expensive MLP to get more capability, they create K MLP sublayers (the "experts") and a router that decides which MLP sublayers to use. This router spits out an importance score for each MLP "expert" and then you choose the top T MLPs and do an average weighed on importance, so roughly:

``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) * importance_score_{l, e} ```

where the `importance_score_{l, e}` is the score computed by the router at layer l for "expert" e. That is, `importance_score_{l} = attention_l(x_l)`. Note that here we are adding all experts, but in reality we choose the top T, often 2, and use that.

[0] some architectures do, in fact, combine domain experts to make a greater whole, but not the currently popular flavor

Quarrel

0 replies

57m

2024-04-17 17:29:47 UTC

So it is somewhat like a classic random forest or maybe bagging, where you're trying to stop overfitting, but you're also trying to train that top layer to know who could be the "experts" given the current inputs so that you're minimising the number of multiple MLP sublayers called during inference?

zozbot234

0 replies

2h59m

2024-04-17 15:26:53 UTC

It's really a kind of enforced sparsity, in that it requires that only a limited amount of blocks be active at a time during inference. What blocks will be active for each token is decided by the network itself as part of training.

(Notably, MoE should not be conflated with ensemble techniques, which is where you would train entire separate networks, then use heuristic techniques to run inference across all of them simultaneously and combine the results.)

woadwarrior01

0 replies

2h48m

2024-04-17 15:38:40 UTC

Not quite a layman's explanation, but if you're familiar with the implementation(s) of vanilla decoder only transformers, mixture-of-experts is just a small extension.

During inference, instead of a single MLP in each transformer layer, MoEs have `n` MLPs and a single layer "gate" in each transformer layer. In the forward pass, softmax of the gate's output is used to pick the top `k` (where k is < n) MLPs to use. The relevant code snippet in the HF transformers implementation is very readable IMO, and only about 40 lines.

https://github.com/huggingface/transformers/blob/main/src/tr...

vineyardmike

0 replies

2h28m

2024-04-17 15:58:49 UTC

It’s not “experts” in the typical sense of the word. There is no discrete training to learn a particular skill in one expert. It’s more closely modeled as a bunch of smaller models grafted together.

These models are actually a collection of weights for different parts of the system. It’s not “one” neural network. Transformers are composed of layers of transformations to the input, and each step can have its own set of weights. There was a recent video on the front page that had a good introduction to this. There is the MLP, there are the attention heads, etc.

With that in mind, a MoE model is basically where one of those layers has X different versions of the weights, and then an added layer (another neural network with its own weights) that picks the version of “expert” weights to use.

londons_explore

0 replies

3h19m

2024-04-17 15:07:28 UTC

Nobody decides. The network itself determines which expert(s) to activate based on the context. It uses a small neural network for the task.

It typically won't behave like human experts - you might find one of the networks is an expert in determining where to place capital letters or full stops for example.

MoE's do not really improve accuracy - instead they are to reduce the amount of compute required. And, assuming you have a fixed compute budget, that in turn might mean you can make the model bigger to get better accuracy.

jsemrau

0 replies

3h4m

2024-04-17 15:22:44 UTC

There is some good documentation around mergekit available that actually explains a lot and might be a good place to start.

adtac

0 replies

2h43m

2024-04-17 15:43:05 UTC

As always, code is the best documentation: https://github.com/ggerganov/llama.cpp/blob/8dd1ec8b3ffbfa2d...

Keyframe

0 replies

3h20m

2024-04-17 15:06:50 UTC

maybe there's one that is maitre d'llm?

HeatrayEnjoyer

0 replies

3h18m

2024-04-17 15:08:26 UTC

Correct, the experts are determined by Algo, not anything humans would understand.

byteknight

14 replies

2h42m

2024-04-17 15:44:02 UTC

First test I tried to run a random taxation question through it

Output: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...

Within the first paragraph, it outputs:

GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGE

Thought that was hilarious.

orost

9 replies

2h38m

2024-04-17 15:47:56 UTC

That's not the model this post is about. You used the base model, not trained for tasks. (The instruct model is probably not on ollama yet.)

byteknight

7 replies

2h33m

2024-04-17 15:53:30 UTC

I absolutely did not:

ollama run mixtral:8x22b

EDIT: I like how you ninja-editted your comment ;)

orost

6 replies

2h31m

2024-04-17 15:55:43 UTC

Considering "mixtral:8x22b" on ollama was last updated yesterday, and Mixtral-8x22B-Instruct-v0.1 (the topic of this post) was released about 2 hours ago, they are not the same model.

byteknight

5 replies

2h29m

2024-04-17 15:57:43 UTC

Are we looking at the same page?

https://imgur.com/a/y6XfpBl

And even the direct tag page: https://ollama.com/library/mixtral:8x22b shows 40-something minutes ago: https://imgur.com/a/WNhv70B

belter

2 replies

2h24m

2024-04-17 16:01:55 UTC

I get:

ollama run mixtral:8x22b

Error: exception create_tensor: tensor 'blk.0.ffn_gate.0.weight' not found

Me1000

1 replies

2h20m

2024-04-17 16:06:37 UTC

You need to update ollama to 0.1.32.

belter

0 replies

2h9m

2024-04-17 16:17:47 UTC

Thanks. That did it.

orost

0 replies

2h25m

2024-04-17 16:01:40 UTC

Let me clarify.

Mixtral-8x22B-v0.1 was released a couple days ago. The "mixtral:8x22b" tag on ollama currently refers to it, so it's what you got when you did "ollama run mixtral:8x22b". It's a base model only capable of text completion, not any other tasks, which is why you got a terrible result when you gave it instructions.

Mixtral-8x22B-Instruct-v0.1 is an instruction-following model based on Mixtral-8x22B-v0.1. It was released two hours ago and it's what this post is about.

(The last updated 44 minutes ago refers to the entire "mixtral" collection.)

gliptic

0 replies

2h24m

2024-04-17 16:02:08 UTC

And where does it say that's the instruct model?

mysteria

0 replies

2h24m

2024-04-17 16:02:23 UTC

Yeah this is exactly what happens when you ask a base model a question. It'll just attempt to continue what you already wrote based off its training set, so if you say have it continue a story you've written it may wrap up the story and then ask you to subscribe for part 2, followed by a bunch of social media comments with reviews.

woadwarrior01

1 replies

2h35m

2024-04-17 15:51:45 UTC

Looks like an issue with the quantization that ollama (i.e llama.cpp) uses and not the model itself. It's common knowledge from Mixtral 8x7B that quantizing the MoE gates is pernicious to model perplexity. And yet they continue to do it. :)

cjbprime

0 replies

2h0m

2024-04-17 16:26:27 UTC

No, it's unrelated to quantization, they just weren't using the instruct model.

renewiltord

0 replies

2h22m

2024-04-17 16:04:49 UTC

Not instruct tuned. You're (actually) "holding it wrong".

jmorgan

0 replies

2h29m

2024-04-17 15:56:51 UTC

The `mixtral:8x22b` tag still points to the text completion model – instruct is on the way, sorry!

Update: mixtral:8x22b now points to the instruct model:

  ollama pull mixtral:8x22b
  ollama run mixtral:8x22b

imjonse

13 replies

3h34m

2024-04-17 14:52:38 UTC

Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data.

generalizations

7 replies

3h30m

2024-04-17 14:56:02 UTC

...And a massive pile of cash/compute hardware.

kiney

6 replies

3h26m

2024-04-17 15:00:11 UTC

not that massive, we're talking six figures. There was a blogpost about this a while back on the startpage of HN.

htrp

4 replies

3h24m

2024-04-17 15:02:39 UTC

for finetuning or parameter training from scratch?

kiney

3 replies

3h15m

2024-04-17 15:11:35 UTC

from scratch: https://research.myshell.ai/jetmoe

kaibee

2 replies

2h49m

2024-04-17 15:37:49 UTC

That's for an 8B model.

cptcobalt

1 replies

2h14m

2024-04-17 16:12:05 UTC

This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.

lostmsu

0 replies

1h16m

2024-04-17 17:09:55 UTC

Assuming linear growth to number of parameters that's 7.5 figures instead of 6 for 8x22B model.

moffkalast

0 replies

1h13m

2024-04-17 17:13:44 UTC

6 figures are a massive pile of cash.

ru552

4 replies

3h28m

2024-04-17 14:58:14 UTC

There's a large amount of liability in disclosing your training data.

imjonse

3 replies

3h18m

2024-04-17 15:08:21 UTC

Calling the model 'truly open' without is not technically correct though.

Lacerda69

2 replies

3h11m

2024-04-17 15:15:00 UTC

It's open enough for all practical purposes IMO.

nicklecompte

0 replies

2h16m

2024-04-17 16:10:22 UTC

It's not "open enough" to do an honest evaluation of these systems by constructing adversarial benchmarks.

imjonse

0 replies

2h14m

2024-04-17 16:12:49 UTC

As open as an executable binary that you are allowed to download and use for free.

hubraumhugo

13 replies

3h18m

2024-04-17 15:08:34 UTC

It feels absolutely amazing to build an AI startup right now. It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.

- We first struggled with limited context windows [solved]

- We had issues with consistent JSON ouput [solved]

- We had rate limiting and performance issues for the large 3rd party models [solved]

- Hosting our own OSS models for small and medium complex tasks was a pain [solved]

Obivously every startup still needs to build up defensibility and focus on differentiating with everything “non-AI”.

paxys

5 replies

3h14m

2024-04-17 15:12:17 UTC

We are going to quickly reach the point where most of these AI startups (which do nothing but provide thin wrappers on top of public LLMs) aren't going to be needed at all. The differentiation will need to come from the value of the end product put in front of customers, not the AI backend.

layble

2 replies

3h5m

2024-04-17 15:21:40 UTC

Sure, in the same way SaaS companies are just thin wrappers on top of databases and the open web.

imjonse

1 replies

3h0m

2024-04-17 15:26:47 UTC

You will find that a disproportionately large amount of work and innovation in an AI product is in the backing model (GPT, Mixtral, etc.). While there's a huge amount of work in databases and the open web, SaaS products typically add a lot more than a thin API layer and a shiny website (well some do but you know what I mean)

tomrod

0 replies

2h48m

2024-04-17 15:37:59 UTC

I'd argue the comment before you is describing accessibility, features, and services -- yes, the core component has a wrapper, but that wrapper differentiates the use.

wongarsu

1 replies

2h46m

2024-04-17 15:40:08 UTC

The same happened to image recognition. We have great algorithms for many years now. You can't make a company out of having the best image recognition algorithm, but you absolutely can make a company out of a device that spots defects in the paintjob in a car factory, or that spots concrete cracks in the tunnel segments used by a tunnel boring machine, or by building a wildlife camera that counts wildlife and exports that to a central website. All of them just fine-tune existing algorithms, but the value delivered is vastly different.

Or you can continue selling shovels. Still lots of expensive labeling services out there, to stay in the image-recognition parallel

pradn

0 replies

1h7m

2024-04-17 17:19:02 UTC

The key thing is AI models are services not products. The real world changes, so you have to change your model. Same goes for new training data (examples, yes/no labels, feedback from production use), updating biases (compliance, changing societal mores). And running models in a highly-available way is also expertise. Not every company wants to be in the ML-ops business.

sleepingreset

2 replies

3h13m

2024-04-17 15:13:29 UTC

If you don't mind, I'm trying to experiment w/ local models more. Just now getting into messing w/ these but I'm struggling to come up w/ good use cases.

Would you happen to know of any cool OSS model projects that might be good inspiration for a side project?

Wondering what most people use these local models for

wing-_-nuts

0 replies

2h18m

2024-04-17 16:08:32 UTC

One idea that I've been mulling over; Given how controllable linux is from the command line, I think it would be somewhat easy to set up a voice to text to a local LLM that could control pretty much everything on command.

It would flat out embarass alexa. Imagine 'Hal play a movie', or 'Hal play some music' and it's all running locally, with your content.

sosuke

0 replies

2h35m

2024-04-17 15:51:07 UTC

No ideas about side projects or anything "productive" but for a concrete example look at SillyTavern. Making fictional characters. Finding narratives, stories, role-play for tabletop games. You can even have group chats of AI characters interacting. No good use cases for profit but plenty right now for exploration and fun.

yodsanklai

0 replies

40m

2024-04-17 17:46:46 UTC

It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.

and so do your competitor's products.

neillyons

0 replies

2h23m

2024-04-17 16:03:43 UTC

We had issues with consistent JSON ouput [solved]

It says the JSON output is constrained via their platform (on la Plateforme).

Does that mean JSON output is only available in the hosted version? Are there any small models that can be self hosted that output valid JSON.

milansuk

0 replies

2h32m

2024-04-17 15:54:18 UTC

The progress is insane. A few days ago I started being very impressed with LLM coding skills. I wanted Golang code, instead of Python, which you can see in many demos. The prompt was:

Write a Golang func, which accepts the path into a .gpx file and outputs a JSON string with points(x=tolal distance in km, y=elevation). Don't use any library.

jasonjmcghee

0 replies

2h28m

2024-04-17 15:58:02 UTC

How are you approaching hosting? vLLM?

mdrzn

12 replies

3h27m

2024-04-17 14:59:16 UTC

"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 Turbo

Maybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex's podcast?

htrp

8 replies

3h24m

2024-04-17 15:01:58 UTC

maybe we'll look back at token context windows like we look back at how much ram we have in a system.

frabjoused

5 replies

3h22m

2024-04-17 15:04:39 UTC

I agree with this in the sense that once you have enough, you stop caring about the metric.

paradite

4 replies

3h19m

2024-04-17 15:07:33 UTC

And how much RAM do you need to run Mixtral 8*22B? Probably not enough on a personal laptop.

Lacerda69

2 replies

3h14m

2024-04-17 15:12:38 UTC

I run it fine on my 64gb RAM beast.

coder543

0 replies

3h10m

2024-04-17 15:16:01 UTC

At what quantization? 4-bit is 80GB. Less than 4-bit is rarely good enough at this point.

apexalpha

0 replies

3h8m

2024-04-17 15:18:29 UTC

Is that normal ram of GPU ram?

user_7832

0 replies

3h13m

2024-04-17 15:13:48 UTC

Generally about ~1gb ram per billion parameters. I've run a 30b model (vicuna) on my 32gb laptop (but it was slow).

htrp

0 replies

2h38m

2024-04-17 15:47:52 UTC

While there is a lot more HBM (or UMA if you're a Mac system) you need to run these LLM models, my overarching point is that at this point most systems don't have RAM constraints for most of the software you need to run and as a result, RAM becomes less of a selling point except in very specialized instances like graphic design or 3D rendering work.

If we have cheap billion token context windows, 99% of your use cases aren't going to hit anywhere close to that limit and as a result, your models will "just run"

bamboozled

0 replies

3h6m

2024-04-17 15:20:41 UTC

I still don’t have enough RAM though ?

pseudosavant

1 replies

2h2m

2024-04-17 16:23:58 UTC

FWIW, the 128k context window for GPT-4 is only for input. I believe the output content is still only 4k.

moffkalast

0 replies

1h12m

2024-04-17 17:14:29 UTC

How does that make any sense on a decoder-only architecture?

creshal

0 replies

2h56m

2024-04-17 15:30:30 UTC

Wasn't there a paper yesterday that turned context evaluation linear (instead of quadratic) and made effectively unlimited context windows possible? Between that and 1.58b quantization I feel like we're overdue for an LLM revolution.

tinyhouse

11 replies

3h28m

2024-04-17 14:58:03 UTC

Pricing?

Found it: https://mistral.ai/technology/#pricing

It'd useful to add a link to the blog post. While it's an open model, most will only be able to use it via the API.

MacsHeadroom

7 replies

3h21m

2024-04-17 15:04:59 UTC

It's open source, you can just download and run it for free on your own hardware.

astrodust

3 replies

3h18m

2024-04-17 15:08:38 UTC

"Who among us doesn't have 8 H100 cards?"

MacsHeadroom

2 replies

2h58m

2024-04-17 15:28:01 UTC

Four V100s will do. They're about $1k each on ebay.

astrodust

1 replies

2h56m

2024-04-17 15:30:19 UTC

$1500 each, plus the server they go in, plus plus plus plus.

MacsHeadroom

0 replies

1h34m

2024-04-17 16:52:00 UTC

Sure, but it's still a lot less than 8 h100s.

~$8k for an LLM server with 128GB of VRAM vs like $250k+ for 8 H100s.

tinyhouse

2 replies

3h20m

2024-04-17 15:06:32 UTC

Well, I don't have hardware to run a 141B parameters model, even if only 39B are active during inference.

navbaker

1 replies

2h58m

2024-04-17 15:28:02 UTC

It will be quantized in a matter of days and runnable on most laptops.

azinman2

0 replies

2h11m

2024-04-17 16:15:45 UTC

8 bit is 149G. 4 bit is 80G.

I wouldn’t call this runnable on most laptops.

theolivenbaum

2 replies

3h21m

2024-04-17 15:05:50 UTC

That looks expensive compared to what groq was offering: https://wow.groq.com/

pants2

0 replies

2h37m

2024-04-17 15:49:45 UTC

Can't wait for 8x22B to make it to Groq! Having an LLM at near GPT-4 performance with Groq speed would be incredible, especially for real-time voice chat.

naiv

0 replies

3h15m

2024-04-17 15:11:24 UTC

I also assume groq is 10-15x faster

jonnycomputer

11 replies

3h23m

2024-04-17 15:03:38 UTC

These LLMs are making RAM great again.

Wish I had invested in the extra 32GB for my mac laptop.

Workaccount2

10 replies

3h15m

2024-04-17 15:11:08 UTC

You can't upgrade it?

Edit: I haven't owned a laptop for years, probably could have surmised they'd be more user hostile nowadays.

kristopolous

5 replies

3h9m

2024-04-17 15:17:33 UTC

Everything is soldered in these days.

It's complete garbage. And most of the other vendors just copy Apple so even things like Lenovo have the same problems.

The current state of laptops is such trash

sva_

2 replies

2h54m

2024-04-17 15:32:33 UTC

Plenty of laptops still have SO-DIMM, such as EliteBook for example.

People need to vote with their wallet, and not buy stuff that goes against their principles.

popf1

0 replies

2h11m

2024-04-17 16:15:38 UTC

There are so many variables though ... most of the time you have to compromise on a few things.

GeekyBear

0 replies

2h1m

2024-04-17 16:25:18 UTC

With SO-DIMM you gain expandability at the cost of higher power draw and latency as well as lower throughput.

SO-DIMM memory is inherently slower than soldered memory. Moreover, considering the fact that SO-DIMM has a maximum speed of 6,400MHz means that it won’t be able to handle the DDR6 standard, which is already in the works.

https://fossbytes.com/camm2-ram-standard/

woadwarrior01

0 replies

2h39m

2024-04-17 15:47:49 UTC

These days with Apple Silicon, RAM is a part of the SoC. It's not even soldered on, it's a part of the chip. Although TBF, they also offer insane memory bandwidths.

GeekyBear

0 replies

2h35m

2024-04-17 15:51:40 UTC

most of the other vendors just copy Apple

Weird conspiracy theories aside, the low power variant of RAM (LPDDR) has to be soldered onto the motherboard, so laptops designed for longer battery life have been using it for years now.

The good news is that a newer variant of low power RAM has just been standardized that features low power RAM in memory modules, although they attach with screws and not clips.

https://fossbytes.com/camm2-ram-standard/

jonnycomputer

1 replies

3h7m

2024-04-17 15:19:25 UTC

I really really like my Macbook Pro. But dammit, you can't upgrade the thing (Mac laptops aren't upgrade-able anymore). I got M1 Max in 2021 with 32GB of RAM. I did not anticipate needing more than 32GB for anything I'd be doing on it. Turns out, a couple of years later, I like to run local LLMs that max out my available memory.

jonnycomputer

0 replies

3h6m

2024-04-17 15:20:30 UTC

I say 2021, but truth is the supply chain was so trash that year that it took almost a year to actually get delivered. I don't think I actually started using the thing until 2022.

paxys

0 replies

2h43m

2024-04-17 15:43:26 UTC

You are getting downvoted because you vaguely suggested something negative about an Apple product, as is my comment below

paxys

0 replies

3h13m

2024-04-17 15:13:44 UTC

mac laptop

apetresc

11 replies

3h28m

2024-04-17 14:58:08 UTC

I just find it hilarious how approximately 100% of models beat all other models on benchmarks.

squirrel23

6 replies

3h28m

2024-04-17 14:58:38 UTC

What do you mean?

apetresc

5 replies

3h26m

2024-04-17 15:00:03 UTC

Virtually every announcement of a new model release has some sort of table or graph matching it up against a bunch of other models on various benchmarks, and they're always selected in such a way that the newly-released model dominates along several axes.

It turns interpreting the results into an exercise in detecting which models and benchmarks were omitted.

CharlieDigital

4 replies

3h23m

2024-04-17 15:03:42 UTC

It would make sense, wouldn't it? Just as we've seen rising fuel efficiency, safety, dependability, etc. over the lifecycle of a particular car model.

The different teams are learning from each other and pushing boundaries; there's virtually no reason for any of the teams to release a model or product that is somehow inferior to a prior one (unless it had some secondary attribute such as requiring lower end hardware).

We're simply not seeing the ones that came up short; we don't even see the ones where it fell short of current benchmarks because they're not worth releasing to the public.

apetresc

1 replies

3h10m

2024-04-17 15:16:07 UTC

That's a valid theory, a priori, but if you actually follow up you'll find that the vast majority of these benchmark results don't end up matching anyone's subjective experience with the models. The churn at the top is not nearly as fast as the press releases make it out to be.

tensor

0 replies

30m

2024-04-17 17:56:46 UTC

Subjective experience is not a benchmark that you can measure success against. Also, of course new models are better on some set of benchmarks. Why would someone bother releasing a "new" model that is inferior to old ones? (Aside from attributes like more preferable licensing).

This is completely normal, the opposite would be strange.

andai

1 replies

3h4m

2024-04-17 15:22:23 UTC

Sibling comment made a good point about benchmarks not being a great indiactor of real world quality. Every time something scores near GPT-4 on benchmarks, I try it out and it ends up being less reliable than GPT-3 within a few minutes of usage.

CharlieDigital

0 replies

3h0m

2024-04-17 15:26:35 UTC

That's totally fine, but benchmarks are like standardized tests like the SAT. They measure something and it totally makes sense that each release bests the prior in the context of these benchmarks.

It may even be the case that in measuring against the benchmarks, these product teams sacrifice some real world performance (just as a student that only studies for the SAT might sacrifice some real world skills).

paxys

0 replies

3h18m

2024-04-17 15:08:18 UTC

Benchmarks published by the company itself should be treated no differently than advertising. For actual signal check out more independent leaderboards and benchmarks (like HuggingFace, Chatbot Arena, MMLU, AlpacaEval). Of course, even then it is impossible to come up with an objective ranking since there is no consensus on what to even measure.

michaelt

0 replies

33m

2024-04-17 17:53:02 UTC

Benchmarks are often weird because of what a benchmark inherently needs to be.

If you compare LLMs by asking them to tell you how to catch dragonflies - the free text chat answer you get will be impossible to objectively evaluate.

Whereas if you propose four ways to catch dragonflies and ask each model to choose option A, B, C or D (or check the relative probability the model assigns to those four output logits) the result is easy to objectively evaluate - you just check if it chose the one right answer.

Hence a lot of the most famous benchmarks are multiple-choice questions - even though 99.9% of LLM usage doesn't involve answering multiple-choice questions.

htrp

0 replies

3h26m

2024-04-17 15:00:38 UTC

gotta cherry pick your benchmarks as much as possible

empath-nirvana

0 replies

2h55m

2024-04-17 15:30:57 UTC

Just because of the pace of innovation and scaling, right now, it seems pretty natural that any new model is going to be better than the previous comparable models.

noman-land

6 replies

3h26m

2024-04-17 15:00:40 UTC

I'm really excited about this model. Just need someone to quantize it to ~3 bits so it'll run on a 64GB MacBook Pro. I've gotten a lot of use from the 8x7b model. Paired with llamafile and it's just so good.

2Gkashmiri

4 replies

3h22m

2024-04-17 15:04:06 UTC

Can you explain your use case? I tried to get into offline llms, on my machine and even android but without discrete graphics, its a slow hog so I didnt enjoy it but suppose I buy one, what then ?

andai

1 replies

3h8m

2024-04-17 15:17:53 UTC

I run Mistral-7B on an old laptop. It's not very fast and it's not very good, but it's just good enough to be useful.

My use case is that I'm more productive working with a LLM but being online is a constant temptation and distraction.

Most of the time I'll reach for offline docs to verify. So the LLM just points me in the right direction.

I also miss Google offline, so I'm working on a search engine. I thought I could skip crawling by just downloading common crawl, but unfortnately it's enormous and mostly junk or unsuitable for my needs. So my next project is how to data-mine common crawl to extract just the interesting (to me) bits...

When I have a search engine and a LLM I'll be able to run my own Phind, which will be really cool.

luke-stanley

0 replies

2h49m

2024-04-17 15:37:09 UTC

Presumably you could run things like PageRank, I'm sure people do this sort of thing with CommonCrawl. There are lots of variants of graph connectivity scoring methods and classifiers. What a time to be alive eh?

popf1

0 replies

2h13m

2024-04-17 16:13:39 UTC

Can you explain your use case?

pretty sure you can run it un-censored... that would be my use case

noman-land

0 replies

2h54m

2024-04-17 15:31:56 UTC

Yes, I have a side project that uses local whisper.cpp to transcribe a podcast I love and shows a nice UI to search and filter the contents. I use Mixtral 8x7b in chat interface via llamafile primarily to help me write python and sqlite code and as a general Q&A agent. I ask it all sorts of technical questions, learn about common tools, libraries, and idioms in an ecosystem I'm not familiar with, and then I can go to official documentation and dig in.

It has been a huge force multiplier for me and most importantly of all, it removes the dread of not knowing where to start and the dread of sending your inner monologue to someone's stupid cloud.

If you're curious: https://github.com/noman-land/transcript.fish/ though this doesn't include any Mixtral stuff because I don't use it programmatically (yet). I soon hope to use it to answer questions about the episodes like who the special guest is and whatnot, which is something I do manually right now.

mathverse

0 replies

18m

2024-04-17 18:08:38 UTC

Shopping for a new mbp. Do you think going with more ram would be wise?

sa-code

4 replies

3h29m

2024-04-17 14:57:14 UTC

Is this the best permissively licensed model out there?

imjonse

2 replies

3h27m

2024-04-17 14:59:49 UTC

So far it is Command R+. Let's see how this will fare on Chatbot Arena after a few weeks of use.

skissane

1 replies

2h53m

2024-04-17 15:33:21 UTC

So far it is Command R+

Most people would not consider Command R+ to count as the "best permissively licensed model" since CC-BY-NC is not usually considered "permissively licensed" – the "NC" part means "non-commercial use only"

imjonse

0 replies

2h9m

2024-04-17 16:17:04 UTC

My bad, I remembered wrongly it was Apache too.

ru552

0 replies

3h27m

2024-04-17 14:58:59 UTC

Today. Might change tomorrow at the pace this sector is at.

clementmas

4 replies

3h24m

2024-04-17 15:02:17 UTC

I'm considering switching my function calling requests from OpenAI's API to Mistral. Are they using similar formats? What's the easiest way to use Mistral? Is it by using Huggingface?

ru552

3 replies

3h21m

2024-04-17 15:05:46 UTC

easiest is probably with ollama [0]. I think the ollama API is OpenAI compatible.

[0]https://ollama.com/

pants2

1 replies

2h39m

2024-04-17 15:46:52 UTC

Ollama runs locally. What's the best option for calling the new Mixtral model on someone else's server programmatically?

Arcuru

0 replies

2h29m

2024-04-17 15:57:38 UTC

Openrouter lists several options: https://openrouter.ai/models/mistralai/mixtral-8x22b

talldayo

0 replies

2h51m

2024-04-17 15:34:55 UTC

Most inference servers are OpenAI-compatibile. Even the "official" llama-cpp server should work fine: https://github.com/ggerganov/llama.cpp/blob/master/examples/...

ayolisup

4 replies

3h1m

2024-04-17 15:25:44 UTC

What's the best way to run this on my Macbook Pro?

I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is less convenient.

Ideally, I'd like something that allows me to edit my settings quite granularly, similar to what I can do in OpenLM, with the QoL from the hosted online platforms, particularly the ease of editing my prompts that I use extensively.

duckkg5

1 replies

2h58m

2024-04-17 15:28:45 UTC

Ollama with WebUI https://github.com/open-webui/open-webui

shaunkoh

0 replies

2h41m

2024-04-17 15:45:25 UTC

Not sure why your comment was downvoted. ^ is absolutely the right answer.

Open WebUI is functionally identical to the ChatGPT interface. You can even use it with the OpenAI APIs to have your own pay per use GPT 4. I did this.

mcbuilder

0 replies

48m

2024-04-17 17:38:38 UTC

openrouter.ai is a fantastic idea if you don't want to self host

chown

0 replies

59m

2024-04-17 17:27:37 UTC

You can try Msty as well. I am the author.

https://msty.app

CharlesW

4 replies

1h35m

2024-04-17 16:50:57 UTC

Dumb question: Are "non-instructed" versions of LLMs just raw, no-guardrail versions of the "instructed" versions that most end-users see? And why does Mixtral need one, when OpenAI LLMs do not?

kingsleyopara

1 replies

1h24m

2024-04-17 17:02:47 UTC

LLM’s are first trained to predict the next most likely word (or token if you want to be accurate) from web crawls. These models are basically great at continuing unfinished text but can’t really be used for instructions e.g. Q&A or chatting - this is the “non-instructed” version. These models are then fine tuned for instructions using additional data from human interaction - these are the “instructed” versions which are what end users (e.g. ChatGPT, Gemini, etc.) see.

CharlesW

0 replies

1h19m

2024-04-17 17:06:53 UTC

Very helpful, thank you.

hnuser123456

1 replies

1h28m

2024-04-17 16:58:24 UTC

https://platform.openai.com/docs/models/gpt-base

https://platform.openai.com/docs/guides/text-generation/comp...

CharlesW

0 replies

1h20m

2024-04-17 17:06:42 UTC

I appreciate the correction, thanks!

luke-stanley

2 replies

3h2m

2024-04-17 15:24:47 UTC

I'm confused on the instruction fine-tuning part that is mentioned briefly, in passing. Is there an open weight instruct variant they've released? Or is that only on their platform? Edit: It's on HuggingFace, great, thanks replies!

sva_

0 replies

3h0m

2024-04-17 15:25:59 UTC

https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

freedmand

0 replies

3h1m

2024-04-17 15:25:39 UTC

I just found this on HuggingFace: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

doublextremevil

2 replies

3h26m

2024-04-17 15:00:37 UTC

How much vram is need to run this?

MacsHeadroom

1 replies

3h1m

2024-04-17 15:25:39 UTC

80GB in 4bit.

But because it only activates one expert at a time, it can run on a fast CPU in reasonable time. So 96GB of DDR4 will do. 96GB of DDR5 is better.

Me1000

0 replies

2h10m

2024-04-17 16:15:51 UTC

WizardLM-2 8x22b (which was a fine tune of the Mixtral 8x22b base model) at 4bit was only 80GB.

dd-dreams

2 replies

3h35m

2024-04-17 14:51:50 UTC

The development never stops. In a few years we will look back and see how the previous models were and how they're now. How we couldn't run LLaMa 70B on MacBook Air and now we can.

squirrel23

1 replies

3h25m

2024-04-17 15:01:44 UTC

Yes it's pretty cool. There was a neat comparison of deep learning development that I think resonates quite well here.

Around 5 years ago, it took a lambda user some pretty significant hardware, software and time (around a full night), to try to create a short deepfake. Now, you don't need any fancy hardware and you can have some decent results within 5 min on your average computer.

azinman2

0 replies

2h13m

2024-04-17 16:13:39 UTC

That part isn’t very good.

https://www.nytimes.com/2024/04/08/technology/deepfake-ai-nu...

brokensegue

2 replies

3h27m

2024-04-17 14:58:56 UTC

Isn't equating active parameters with cost a little unfair since you still need full memory for all the inactive parameters?

tartrate

0 replies

3h23m

2024-04-17 15:02:54 UTC

Well, since it affects inference speed it means you can handle more in less time, needing less concurrency.

sa-code

0 replies

41m

2024-04-17 17:45:31 UTC

Fewer parameters at inference time makes a massive difference in cost for batch jobs, assuming vram usage is the same

kristianp

1 replies

2h41m

2024-04-17 15:45:10 UTC

So this one is 3x the size but only 7% better on MMLU? Given Moores law is mostly dead, this trend is going to make for even more extremely expensive compute for next gen AI models.

GaggiX

0 replies

2h23m

2024-04-17 16:03:22 UTC

That's 25% fewer errors.

iFire

1 replies

3h9m

2024-04-17 15:17:04 UTC

It wasn't clear but how much hardware does it take to run Mixtral 8x22B (mistral.ai) next to me locally?

ru552

0 replies

2h59m

2024-04-17 15:27:31 UTC

a macbook with 64g of ram

elorant

1 replies

1h4m

2024-04-17 17:22:11 UTC

Seems that Perplexity Labs already offers a free demo of it.

https://labs.perplexity.ai/

batperson

0 replies

21m

2024-04-17 18:05:17 UTC

That's the old/regular model. This post is about the new "instruct" model.

yodsanklai

0 replies

35m

2024-04-17 17:51:46 UTC

How does this compare to ChatGPT4?

stainablesteel

0 replies

2h37m

2024-04-17 15:48:57 UTC

is this different than their "large" model

spenceryonce

0 replies

3h14m

2024-04-17 15:11:54 UTC

I can't even begin to describe how excited I am for the future of AI.

jhoechtl

0 replies

2h21m

2024-04-17 16:05:09 UTC

Did anyone have success getting danswer and ollama to work together?

endisneigh

0 replies

3h30m

2024-04-17 14:56:18 UTC

Good to continue to see a permissive license here.

austinsuhr

0 replies

3h1m

2024-04-17 15:24:57 UTC

Is 8x22B gonna make it to Le Chat in the near future?

arnaudsm

0 replies

3h28m

2024-04-17 14:58:22 UTC

Curious to see how it performs against GPT-4.

Mixtral8x22 beats CommandR+, which is at GPT-4-level in LMSYS' leaderboard.

ado__dev

0 replies

2h15m

2024-04-17 16:11:21 UTC

We rolled out Mixtral 8x22b to our LLM Litmus Test at s0.dev for Cody AI. Don't have enough data to say it's better or worse that other LLMs yet, but if you want to try it out for coding purposes, let me know your experience.

Lacerda69

0 replies

3h15m

2024-04-17 15:10:59 UTC

I have been using mixtral daily since it was released for all kinds of writing and coding tasks. Love it and massively invested in mistrals mission.

Keep on doing this great work.

Edit: been using the previous version, seems like this one is even better?

ChicagoDave

0 replies

3h2m

2024-04-17 15:24:31 UTC

We need larger context windows, otherwise we’re running the same path with marginally different results.