HN comments for: Mistral NeMo

pixelatedindex

16 replies

3h5m

2024-07-18 15:26:35 UTC

Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!

simpaticoder

3 replies

2h58m

2024-07-18 15:32:48 UTC

Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile

wkat4242

1 replies

1h45m

2024-07-18 16:46:15 UTC

I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.

simpaticoder

0 replies

1h19m

2024-07-18 17:12:06 UTC

This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.

illusive4080

0 replies

2h41m

2024-07-18 15:50:12 UTC

I love that domain name.

codetrotter

3 replies

3h1m

2024-07-18 15:30:18 UTC

I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.

hedgehog

0 replies

2h35m

2024-07-18 15:56:16 UTC

Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.

andrethegiant

0 replies

2h2m

2024-07-18 16:29:08 UTC

Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?

Patrick_Devine

0 replies

46m

2024-07-18 17:45:17 UTC

We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D

Raed667

1 replies

2h52m

2024-07-18 15:39:05 UTC

First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo

Patrick_Devine

0 replies

45m

2024-07-18 17:46:33 UTC

We're working on it!

nostromo

0 replies

2h37m

2024-07-18 15:54:19 UTC

Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

kanwisher

0 replies

3h2m

2024-07-18 15:29:08 UTC

llama.cpp or ollama both have apis for most models

homarp

0 replies

57m

2024-07-18 17:34:28 UTC

llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

and expose an OpenAI compatible server, or you can use their python bindings

d13

0 replies

1h42m

2024-07-18 16:49:09 UTC

Try Lm Studio or Ollama. Load up the model, and there you go.

bezbac

0 replies

3h3m

2024-07-18 15:28:21 UTC

AFAIK, Ollama supports most of these models locally and will expose a REST API[0]

[0]: https://github.com/ollama/ollama/blob/main/docs/api.md

RockyMcNuts

0 replies

2h42m

2024-07-18 15:49:20 UTC

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

madeofpalk

14 replies

3h2m

2024-07-18 15:29:45 UTC

I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.

troupo

2 replies

2h54m

2024-07-18 15:37:13 UTC

The explanation is easier, I think. Consider what data these models are trained on, and who are the immediate developers of these models.

The models are trained on a vast set of whatever is available on the internet. They are developed by tech people/programmers who are surprisingly blind to their own biases and interests. There's no surprise that one of the main things they want to try and do is programming, using vast open quantities of Stack Overflow, GitHub and various programming forums.

For finance and legal you need to:

- think a bit outside the box

- be interested in finance and legal

- be prepared to carry actual legal liability for the output of your models

moffkalast

0 replies

2024-07-18 18:22:56 UTC

Then again, we just had this on the front page: https://news.ycombinator.com/item?id=40957990

We first document a significant decline in stock trading volume during ChatGPT outages and find that the effect is stronger for firms with corporate news released immediately before or during the outages. We further document similar declines in the short-run price impact, return variance, and bid-ask spreads, consistent with a reduction in informed trading during the outage periods. Lastly, we use trading volume changes during outages to construct a firm-level measure of the intensity of GAI-assisted trading and provide early evidence of a positive effect of GAI-assisted trading on long-run stock price informativeness.

They're being used, but nobody is really saying anything because the stock market is a zero sum game these days and letting anyone else know that this holds water is a recipe for competition. Programming is about the opposite, the more you give, the more you get, so it makes sense to popularize it as a feature.

dannyw

0 replies

2h48m

2024-07-18 15:43:10 UTC

- be prepared to carry actual legal liability for the output of your models

Section 230.

It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).

Nobody has successfully sued.

sofixa

2 replies

2h32m

2024-07-18 15:59:28 UTC

Finance already has their own models and has had them for decades. Market predictions and high frequency trading is literally what all the hedge funds and the like have been doing for a few decades now. Including advanced sources of information like (take with a grain of salt, I've heard it on the internet) using satellite images to measure factory activity and thus predict results.

Understandably they're all quite secretive about their tooling because they don't want the competition to have access to the same competitive advantages, and an open source model / third party developing a model doesn't really make sense.

madeofpalk

1 replies

2h27m

2024-07-18 16:04:11 UTC

I guess finance is not in need of a large language model?

Foobar8568

0 replies

2h16m

2024-07-18 16:14:50 UTC

It does but everything is a joke...

miki123211

1 replies

2h41m

2024-07-18 15:50:33 UTC

Where's the finance or legal models from Mistral or Meta or OpenAI?

Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.

Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.

Finance is weird in its own way, as that requires a lot more boring, highly-precise calculations, and LLMs are notoriously bad at those. I suspect that finance is always going to be some hybrid of an LLM driving an "old school" computer to do the hard math, via a programming language or some other, yet-unenvisioned protocol.

programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.

This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.

madeofpalk

0 replies

2h30m

2024-07-18 16:01:41 UTC

Yes, it appears to be the clear successful usecase for the technology, in a way that hasn't been replicated for other professions.

I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.

MikeKusold

1 replies

2h55m

2024-07-18 15:36:37 UTC

Those are regulated industries, where as software development is not.

An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.

knicholes

0 replies

2h52m

2024-07-18 15:39:02 UTC

Generally I agree! I saw a guy shamefully admit he didn't read the output carefully enough when using generated code (that ran), but there was a min() instead of a max(), and it messed up a month of his metrics!

sakesun

0 replies

2h50m

2024-07-18 15:41:17 UTC

Generating code has significant economical benefit. The code once generated can be execute so many times without requiring high computing resources, unlike AI model.

drewmate

0 replies

2h17m

2024-07-18 16:14:26 UTC

It's just easier to iterate and improve on a coding specialist AI when that is also the skill required to iterate on said AI.

Products that build on general LLM tech are already being used in other fields. For example, my lawyer friend has started using one by LexisNexis[0] and is duly impressed by how it works. It's only a matter of time before models like that get increasingly specialized for that kind of work, it's just harder for lawyers to drive that kind of change alone. Plus, there's a lot more resistance in 'legacy' professions to any kind of change, much less one that is perceived to threaten the livelihoods of established professionals.

Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.

[0] https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page

a2128

0 replies

2h56m

2024-07-18 15:35:05 UTC

Coding models solve a clear problem and have a clear integration into a developer's workflow - it's like your own personal StackOverflow and it can autocomplete code for you. It's not as clear when it comes to finance or legal, you wouldn't want to rely on an AI that may hallucinate financial numbers or laws. These other professions are also a lot slower to react to change, compared to software development where people are already used to learning new frameworks every year

317070

0 replies

2h25m

2024-07-18 16:06:31 UTC

I work in the field. The reason has not been mentioned yet.

It's because (for an unknown reason), having coding and software development in the training mix is really helpful at most other tasks. It improves everything to do with logical thinking by a large margin, and that seems to help with many other downstream tasks.

Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.

I don't know how much that is true for legal or financial resources.

yjftsjthsd-h

13 replies

3h29m

2024-07-18 15:02:36 UTC

Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.

wongarsu

3 replies

2h27m

2024-07-18 16:04:09 UTC

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move

yjftsjthsd-h

0 replies

1h26m

2024-07-18 17:05:07 UTC

I actually meant execution speed from quantisation awareness - agreed that comparing against smaller models is a bit cheating.

mistercheph

0 replies

2h10m

2024-07-18 16:21:26 UTC

I strongly disagree, have you used fp16 or q8 llama3 8b?

amrrs

0 replies

1h35m

2024-07-18 16:55:55 UTC

reasoning and generalization

any example use-cases or prompts? how do you define those?

bernaferrari

3 replies

2h31m

2024-07-18 15:59:47 UTC

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

hislaziness

2 replies

2h28m

2024-07-18 16:03:14 UTC

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

fzzzy

0 replies

55m

2024-07-18 17:36:14 UTC

it's very common to run local models in 8 bit int.

ancientworldnow

0 replies

2h25m

2024-07-18 16:06:43 UTC

This was trained to be run at FP8 with no quality loss.

Bumblonono

1 replies

2h29m

2024-07-18 16:02:41 UTC

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

michaelt

0 replies

1h56m

2024-07-18 16:35:28 UTC

A 4090 will just narrowly fit a 34B parameter model at 4-bit quantisation.

A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.

xena

0 replies

3h11m

2024-07-18 15:20:01 UTC

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

renewiltord

0 replies

2h33m

2024-07-18 15:58:20 UTC

According to nvidia https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/ it was made to fit on a 4090 so it should work with 24 GB.

exe34

0 replies

2h51m

2024-07-18 15:40:35 UTC

tensors look about 20gb. not sure what that's like in vram.

jorgesborges

9 replies

3h2m

2024-07-18 15:29:04 UTC

I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.

miki123211

3 replies

2h53m

2024-07-18 15:38:34 UTC

Generally yes, with caveats.

There was some research showing that training a model on facts like "the mother of John Smith is Alice" but in German allowed it to answer questions like "who's the mother of John Smith", but not questions like "what's the name of Alice's child", regardless of language. Not sure if this holds at larger model sizes though, it's the sort of problem that's usually fixable by throwing more parameters at it.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

planb

2 replies

2h25m

2024-07-18 16:06:03 UTC

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

Do you have any good sources that explain this? I was always thinking LLMs are indeed stochastic parrots, but language (that is the unified corpus of all languages in the training data) already inherently contains the „generalization“. So the intelligence is encoded in the language humans speak.

moffkalast

0 replies

41m

2024-07-18 17:49:59 UTC

language already inherently contains the „generalization“

The mental gymnastics required to handwave language model capabilities are getting funnier and funnier every day.

michaelt

0 replies

1h41m

2024-07-18 16:50:16 UTC

I don't have explanations but I can point you to one of the papers: https://arxiv.org/pdf/2309.12288 which calls it "the reversal curse" and does a bunch of experiments showing models that are successful at questions like "Who is Tom Cruise’s mother?" (Mary Lee Pfeiffer) will not be equally successful at answering "Who is Mary Lee Pfeiffer’s son?"

dannyw

1 replies

2h45m

2024-07-18 15:46:03 UTC

Anecdata, but I did some continued pretraining on a toy LLM using machine-translated data; of the original dataset.

Performance improved across all benchmarks; in English (the original language).

benmanns

0 replies

2h30m

2024-07-18 16:01:02 UTC

Am I understanding correctly? You look an English dataset, trained an LLM, machine translated the English dataset to e.g. Spanish, continued training the model, and performance for queries in English improved? That’s really interesting.

bernaferrari

1 replies

2h59m

2024-07-18 15:32:37 UTC

no, it is basically an 'auto-correct' spell checker from the phone. It only knows what it was trained on. But it has been shown that a coding LLM that has never seen a programming language or a library can "learn" a new one faster than, say, a generic LLM.

StevenWaterman

0 replies

2h48m

2024-07-18 15:43:26 UTC

That's not true, LLMs can answer questions in one language even if they were only trained on that data in another language.

IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English

bionhoward

0 replies

2h45m

2024-07-18 15:46:41 UTC

There is evidence code training helps with reasoning so if you count code as another language then, this makes sense

https://openreview.net/forum?id=KIPJKST4gw

Is symbolic language a fuzzy sort of code? Absolutely, because it conveys logic and information. TLDR: yes!

andrethegiant

6 replies

1h41m

2024-07-18 16:50:17 UTC

I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?

kaoD

5 replies

1h34m

2024-07-18 16:57:14 UTC

but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

andrethegiant

4 replies

1h27m

2024-07-18 17:04:03 UTC

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

Why let Amazon/Cloudflare repackage it?

bilbo0s

3 replies

1h14m

2024-07-18 17:16:56 UTC

How would you stop them?

The license is Apache 2.

andrethegiant

2 replies

1h12m

2024-07-18 17:19:12 UTC

That's my question -- why license as Apache 2

bilbo0s

1 replies

1h0m

2024-07-18 17:30:54 UTC

What license would allow complete freedom for everyone else, but constrain Amazon and Cloudflare?

supriyo-biswas

0 replies

38m

2024-07-18 17:53:24 UTC

The LLaMa license is a good start.

Workaccount2

6 replies

3h26m

2024-07-18 15:05:21 UTC

Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

voiper1

2 replies

3h12m

2024-07-18 15:19:27 UTC

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

https://mistral.ai/technology/#pricing

dannyw

0 replies

2h50m

2024-07-18 15:41:30 UTC

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.

Palmik

0 replies

2h20m

2024-07-18 16:11:16 UTC

They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.

marci

0 replies

2h44m

2024-07-18 15:47:13 UTC

For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"

eyeswideopen

0 replies

3h17m

2024-07-18 15:14:38 UTC

As written here: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one

causal

0 replies

2h46m

2024-07-18 15:44:59 UTC

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.

simonw

5 replies

2h55m

2024-07-18 15:36:05 UTC

I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.

sroussey

1 replies

2h48m

2024-07-18 15:42:52 UTC

Same could be said for onnx.

Depends on which community you are in as to what you want.

simonw

0 replies

1h33m

2024-07-18 16:58:33 UTC

Right - imagine how much of an impact a model release could have if it included GGUF and ONNX and MLX along with PyTorch.

dannyw

0 replies

2h54m

2024-07-18 15:37:38 UTC

I think it's actually reasonable to leave some opportunities to the community. It's an Apache 2.0 model. It's meant for everyone to build upon freely.

a2128

0 replies

2h52m

2024-07-18 15:39:29 UTC

llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open

Patrick_Devine

0 replies

40m

2024-07-18 17:51:07 UTC

Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.

pants2

5 replies

3h25m

2024-07-18 15:06:11 UTC

Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.

mythz

3 replies

3h14m

2024-07-18 15:17:17 UTC

IMO Google's Gemma2 27B [1] is the sweet spot for running locally on commodity 16GB VRAM cards.

[1] https://ollama.com/library/gemma2:27b

Raed667

1 replies

2h51m

2024-07-18 15:40:41 UTC

If I "only" have 16GB of ram on a macbook pro, would that still work ?

sofixa

0 replies

2h29m

2024-07-18 16:01:56 UTC

If it's an M-series one with "unified memory" (shared RAM between the CPU, GPU and NPU on the same chip), yes.

mysteria

0 replies

14m

2024-07-18 18:16:52 UTC

Keep in mind that Gemma is a larger model but it only has 8k context. The Mistral 12B will need less VRAM to store the weights but you'll need a much larger KV cache if you intend to use the full 128k context, especially if the KV is unquantized. Note sure if this new model has GQA but those without it absolutely eat memory when you increase the context size (looking at you Command R).

_flux

0 replies

3h18m

2024-07-18 15:13:14 UTC

How much memory does employing the complete 128k window take, though? I've sadly noticed that it can take a significant amount of VRAM to use a larger context window.

edit: e.g. I wouldn't know the correct parameters for this calculator, but going from 8k window to 128k window goes from 1.5 GB to 23 GB: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

pantulis

4 replies

3h29m

2024-07-18 15:02:16 UTC

Does it have any relation to Nvidia's Nemo? Otherwise, it's unfortunate naming

refulgentis

2 replies

3h5m

2024-07-18 15:26:40 UTC

Click the link, read the first sentence.

pantulis

1 replies

2h1m

2024-07-18 16:30:06 UTC

Yeah, not my brightest HN moment, to be honest.

SubiculumCode

0 replies

1h22m

2024-07-18 17:09:11 UTC

At least you didn't ask about finding a particular fish.

markab21

0 replies

3h28m

2024-07-18 15:02:47 UTC

It looks like it was built jointly with nvidia: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

mcemilg

4 replies

2h52m

2024-07-18 15:39:18 UTC

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.

wongarsu

3 replies

2h39m

2024-07-18 15:51:54 UTC

I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.

gooob

2 replies

2h8m

2024-07-18 16:23:40 UTC

no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.

candiddevmike

0 replies

2h0m

2024-07-18 16:31:20 UTC

They can't release it without admitting to copyright infringement.

bilbo0s

0 replies

1h28m

2024-07-18 17:03:20 UTC

Uh..

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.

k__

4 replies

2h46m

2024-07-18 15:45:46 UTC

What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.

kaoD

1 replies

2h22m

2024-07-18 16:09:33 UTC

I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.

k__

0 replies

1h19m

2024-07-18 17:12:30 UTC

Ah, yes.

Thanks, I confused those numbers!

yjftsjthsd-h

0 replies

1h3m

2024-07-18 17:28:26 UTC

Also, are these small models OSS?

From the very first paragraph on the page:

released under the Apache 2.0 license.

simion314

0 replies

2h31m

2024-07-18 16:00:21 UTC

What's the reason for measuring the model size in context window length and not GB?

there are 2 different things.

The context window is how many tokens ii's context can contain, so on a big model you could put in the context a few books and articles and then start your questions, on a small context model you can start a conversation and after a short time it will start forgetting eh first prompts. Big context will use more memory and will cost on performance but imagine you could give it your entire code project and then you can ask it questions, so often I know there is some functions already there that does soemthing but I can't remember the name.

saberience

2 replies

3h2m

2024-07-18 15:28:51 UTC

Two questions:

1) Anyone have any idea of VRAM requirements?

2) When will this be available on ollama?

causal

0 replies

2h54m

2024-07-18 15:36:49 UTC

1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.

alecco

0 replies

2h48m

2024-07-18 15:42:51 UTC

The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU

minimaxir

2 replies

3h17m

2024-07-18 15:14:24 UTC

Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.

zwaps

0 replies

2024-07-18 18:23:49 UTC

SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.

rockinghigh

0 replies

19m

2024-07-18 18:12:42 UTC

The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.

davidzweig

1 replies

1h6m

2024-07-18 17:24:59 UTC

Did anyone try to check how are it's multilingual skills vs. Gemma 2? On the page, it's compared with LLama 3 only.

moffkalast

0 replies

46m

2024-07-18 17:44:54 UTC

Well it's not on Le Chat, it's not on LMSys, it has a new tokenizer that breaks llama.cpp compatibility, and I'm sure as hell not gonna run it with Crapformers at 0.1x speed which as of right now seems to be the only way to actually test it out.

p1esk

0 replies

3h8m

2024-07-18 15:22:48 UTC

Interesting how it will compete with 4o mini.

ofermend

0 replies

2h12m

2024-07-18 16:18:52 UTC

Congrats. Very exciting to see continued innovation around smaller models, that can perform much better than larger models. This enables faster inference and makes them more ubiquitous.

obblekk

0 replies

2h38m

2024-07-18 15:53:06 UTC

Worth noting this model has 50% more parameters than llama3. There are performance gains but some of the gains might be from using more compute rather than performance per unit compute.

dpflan

0 replies

2h25m

2024-07-18 16:06:30 UTC

These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

bugglebeetle

0 replies

2h40m

2024-07-18 15:51:26 UTC

Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.

alecco

0 replies

2h48m

2024-07-18 15:43:34 UTC

Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

*Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.

adt

0 replies

2h23m

2024-07-18 16:08:19 UTC

That's 3 releases for Mistral in 24 hours.

https://lifearchitect.ai/models-table/

LoganDark

0 replies

2h35m

2024-07-18 15:56:34 UTC

Is the base model unaligned? Disappointing to see alignment from allegedly "open" models.

I_am_tiberius

0 replies

41m

2024-07-18 17:50:12 UTC

The last time I tried a Mistral model, it didn't answer most of my questions, because of "policy" reasons. I hope they fixed that. OpenAI at least only tells me that it's a policy issue but still answers most of the time.