Mistral NeMo

Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!

Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See and

I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.

This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.

I love that domain name.

I’d probably check in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.

Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.

Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?

We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D

We're working on it!

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

llama.cpp or ollama both have apis for most models

Try Lm Studio or Ollama. Load up the model, and there you go.

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.

The explanation is easier, I think. Consider what data these models are trained on, and who are the immediate developers of these models.

The models are trained on a vast set of whatever is available on the internet. They are developed by tech people/programmers who are surprisingly blind to their own biases and interests. There's no surprise that one of the main things they want to try and do is programming, using vast open quantities of Stack Overflow, GitHub and various programming forums.

For finance and legal you need to:

- think a bit outside the box

- be interested in finance and legal

- be prepared to carry actual legal liability for the output of your models

Then again, we just had this on the front page:

We first document a significant decline in stock trading volume during ChatGPT outages and find that the effect is stronger for firms with corporate news released immediately before or during the outages. We further document similar declines in the short-run price impact, return variance, and bid-ask spreads, consistent with a reduction in informed trading during the outage periods. Lastly, we use trading volume changes during outages to construct a firm-level measure of the intensity of GAI-assisted trading and provide early evidence of a positive effect of GAI-assisted trading on long-run stock price informativeness.

They're being used, but nobody is really saying anything because the stock market is a zero sum game these days and letting anyone else know that this holds water is a recipe for competition. Programming is about the opposite, the more you give, the more you get, so it makes sense to popularize it as a feature.

- be prepared to carry actual legal liability for the output of your models

Section 230.

It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).

Nobody has successfully sued.

Finance already has their own models and has had them for decades. Market predictions and high frequency trading is literally what all the hedge funds and the like have been doing for a few decades now. Including advanced sources of information like (take with a grain of salt, I've heard it on the internet) using satellite images to measure factory activity and thus predict results.

Understandably they're all quite secretive about their tooling because they don't want the competition to have access to the same competitive advantages, and an open source model / third party developing a model doesn't really make sense.

1 replies

I guess finance is not in need of a large language model?

It does but everything is a joke...

Where's the finance or legal models from Mistral or Meta or OpenAI?

Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.

Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.

Finance is weird in its own way, as that requires a lot more boring, highly-precise calculations, and LLMs are notoriously bad at those. I suspect that finance is always going to be some hybrid of an LLM driving an "old school" computer to do the hard math, via a programming language or some other, yet-unenvisioned protocol.

programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.

This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.

Yes, it appears to be the clear successful usecase for the technology, in a way that hasn't been replicated for other professions.

I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.

Those are regulated industries, where as software development is not.

An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.

Generally I agree! I saw a guy shamefully admit he didn't read the output carefully enough when using generated code (that ran), but there was a min() instead of a max(), and it messed up a month of his metrics!

Generating code has significant economical benefit. The code once generated can be execute so many times without requiring high computing resources, unlike AI model.

It's just easier to iterate and improve on a coding specialist AI when that is also the skill required to iterate on said AI.

Products that build on general LLM tech are already being used in other fields. For example, my lawyer friend has started using one by LexisNexis[0] and is duly impressed by how it works. It's only a matter of time before models like that get increasingly specialized for that kind of work, it's just harder for lawyers to drive that kind of change alone. Plus, there's a lot more resistance in 'legacy' professions to any kind of change, much less one that is perceived to threaten the livelihoods of established professionals.

Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.


Coding models solve a clear problem and have a clear integration into a developer's workflow - it's like your own personal StackOverflow and it can autocomplete code for you. It's not as clear when it comes to finance or legal, you wouldn't want to rely on an AI that may hallucinate financial numbers or laws. These other professions are also a lot slower to react to change, compared to software development where people are already used to learning new frameworks every year

I work in the field. The reason has not been mentioned yet.

It's because (for an unknown reason), having coding and software development in the training mix is really helpful at most other tasks. It improves everything to do with logical thinking by a large margin, and that seems to help with many other downstream tasks.

Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.

I don't know how much that is true for legal or financial resources.

13 replies

Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move

I actually meant execution speed from quantisation awareness - agreed that comparing against smaller models is a bit cheating.

I strongly disagree, have you used fp16 or q8 llama3 8b?

reasoning and generalization

any example use-cases or prompts? how do you define those?

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

it's very common to run local models in 8 bit int.

This was trained to be run at FP8 with no quality loss.

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

A 4090 will just narrowly fit a 34B parameter model at 4-bit quantisation.

A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

0 replies

9 replies

I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.

Generally yes, with caveats.

There was some research showing that training a model on facts like "the mother of John Smith is Alice" but in German allowed it to answer questions like "who's the mother of John Smith", but not questions like "what's the name of Alice's child", regardless of language. Not sure if this holds at larger model sizes though, it's the sort of problem that's usually fixable by throwing more parameters at it.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

Do you have any good sources that explain this? I was always thinking LLMs are indeed stochastic parrots, but language (that is the unified corpus of all languages in the training data) already inherently contains the „generalization“. So the intelligence is encoded in the language humans speak.

language already inherently contains the „generalization“

The mental gymnastics required to handwave language model capabilities are getting funnier and funnier every day.

I don't have explanations but I can point you to one of the papers: which calls it "the reversal curse" and does a bunch of experiments showing models that are successful at questions like "Who is Tom Cruise’s mother?" (Mary Lee Pfeiffer) will not be equally successful at answering "Who is Mary Lee Pfeiffer’s son?"

Anecdata, but I did some continued pretraining on a toy LLM using machine-translated data; of the original dataset.

Performance improved across all benchmarks; in English (the original language).

Am I understanding correctly? You look an English dataset, trained an LLM, machine translated the English dataset to e.g. Spanish, continued training the model, and performance for queries in English improved? That’s really interesting.

1 replies

no, it is basically an 'auto-correct' spell checker from the phone. It only knows what it was trained on. But it has been shown that a coding LLM that has never seen a programming language or a library can "learn" a new one faster than, say, a generic LLM.

That's not true, LLMs can answer questions in one language even if they were only trained on that data in another language.

IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English

There is evidence code training helps with reasoning so if you count code as another language then, this makes sense

6 replies

I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?

but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

Why let Amazon/Cloudflare repackage it?

How would you stop them?

The license is Apache 2.

That's my question -- why license as Apache 2

1 replies

What license would allow complete freedom for everyone else, but constrain Amazon and Cloudflare?

The LLaMa license is a good start.

Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.

0 replies

0 replies

0 replies

As written here:

0 replies

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.

I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.

Same could be said for onnx.

Depends on which community you are in as to what you want.

Right - imagine how much of an impact a model release could have if it included GGUF and ONNX and MLX along with PyTorch.

0 replies

0 replies

llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open

Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.

5 replies

Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.

If I "only" have 16GB of ram on a macbook pro, would that still work ?

0 replies

If it's an M-series one with "unified memory" (shared RAM between the CPU, GPU and NPU on the same chip), yes.

Keep in mind that Gemma is a larger model but it only has 8k context. The Mistral 12B will need less VRAM to store the weights but you'll need a much larger KV cache if you intend to use the full 128k context, especially if the KV is unquantized. Note sure if this new model has GQA but those without it absolutely eat memory when you increase the context size (looking at you Command R).

How much memory does employing the complete 128k window take, though? I've sadly noticed that it can take a significant amount of VRAM to use a larger context window.

edit: e.g. I wouldn't know the correct parameters for this calculator, but going from 8k window to 128k window goes from 1.5 GB to 23 GB:

4 replies

Does it have any relation to Nvidia's Nemo? Otherwise, it's unfortunate naming

Click the link, read the first sentence.

1 replies

Yeah, not my brightest HN moment, to be honest.

At least you didn't ask about finding a particular fish.

4 replies

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.

I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.

2 replies

no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.

They can't release it without admitting to copyright infringement.

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.

4 replies

What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.

I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.

Ah, yes.

Thanks, I confused those numbers!

Also, are these small models OSS?

From the very first paragraph on the page:

released under the Apache 2.0 license.
What's the reason for measuring the model size in context window length and not GB?

there are 2 different things.

The context window is how many tokens ii's context can contain, so on a big model you could put in the context a few books and articles and then start your questions, on a small context model you can start a conversation and after a short time it will start forgetting eh first prompts. Big context will use more memory and will cost on performance but imagine you could give it your entire code project and then you can ask it questions, so often I know there is some functions already there that does soemthing but I can't remember the name.

2 replies

Two questions:

1) Anyone have any idea of VRAM requirements?

2) When will this be available on ollama?

1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.

The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU
2 replies

Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: was shown to be a more efficient encoding as far back as GPT-2 in 2019.

SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.

The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.

1 replies

Did anyone try to check how are it's multilingual skills vs. Gemma 2? On the page, it's compared with LLama 3 only.

Well it's not on Le Chat, it's not on LMSys, it has a new tokenizer that breaks llama.cpp compatibility, and I'm sure as hell not gonna run it with Crapformers at 0.1x speed which as of right now seems to be the only way to actually test it out.

0 replies

0 replies

0 replies

0 replies

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

0 replies

0 replies

Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

*Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.
Is the base model unaligned? Disappointing to see alignment from allegedly "open" models.

The last time I tried a Mistral model, it didn't answer most of my questions, because of "policy" reasons. I hope they fixed that. OpenAI at least only tells me that it's a policy issue but still answers most of the time.