return to table of content

Mistral NeMo

pixelatedindex
16 replies
3h5m

Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!

simpaticoder
3 replies
2h58m

Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile

wkat4242
1 replies
1h45m

I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.

simpaticoder
0 replies
1h19m

This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.

illusive4080
0 replies
2h41m

I love that domain name.

codetrotter
3 replies
3h1m

I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.

hedgehog
0 replies
2h35m

Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.

andrethegiant
0 replies
2h2m

Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?

Patrick_Devine
0 replies
46m

We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D

Patrick_Devine
0 replies
45m

We're working on it!

nostromo
0 replies
2h37m

Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

kanwisher
0 replies
3h2m

llama.cpp or ollama both have apis for most models

d13
0 replies
1h42m

Try Lm Studio or Ollama. Load up the model, and there you go.

RockyMcNuts
0 replies
2h42m

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

madeofpalk
14 replies
3h2m

I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.

troupo
2 replies
2h54m

The explanation is easier, I think. Consider what data these models are trained on, and who are the immediate developers of these models.

The models are trained on a vast set of whatever is available on the internet. They are developed by tech people/programmers who are surprisingly blind to their own biases and interests. There's no surprise that one of the main things they want to try and do is programming, using vast open quantities of Stack Overflow, GitHub and various programming forums.

For finance and legal you need to:

- think a bit outside the box

- be interested in finance and legal

- be prepared to carry actual legal liability for the output of your models

moffkalast
0 replies
8m

Then again, we just had this on the front page: https://news.ycombinator.com/item?id=40957990

We first document a significant decline in stock trading volume during ChatGPT outages and find that the effect is stronger for firms with corporate news released immediately before or during the outages. We further document similar declines in the short-run price impact, return variance, and bid-ask spreads, consistent with a reduction in informed trading during the outage periods. Lastly, we use trading volume changes during outages to construct a firm-level measure of the intensity of GAI-assisted trading and provide early evidence of a positive effect of GAI-assisted trading on long-run stock price informativeness.

They're being used, but nobody is really saying anything because the stock market is a zero sum game these days and letting anyone else know that this holds water is a recipe for competition. Programming is about the opposite, the more you give, the more you get, so it makes sense to popularize it as a feature.

dannyw
0 replies
2h48m

- be prepared to carry actual legal liability for the output of your models

Section 230.

It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).

Nobody has successfully sued.

sofixa
2 replies
2h32m

Finance already has their own models and has had them for decades. Market predictions and high frequency trading is literally what all the hedge funds and the like have been doing for a few decades now. Including advanced sources of information like (take with a grain of salt, I've heard it on the internet) using satellite images to measure factory activity and thus predict results.

Understandably they're all quite secretive about their tooling because they don't want the competition to have access to the same competitive advantages, and an open source model / third party developing a model doesn't really make sense.

madeofpalk
1 replies
2h27m

I guess finance is not in need of a large language model?

Foobar8568
0 replies
2h16m

It does but everything is a joke...

miki123211
1 replies
2h41m

Where's the finance or legal models from Mistral or Meta or OpenAI?

Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.

Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.

Finance is weird in its own way, as that requires a lot more boring, highly-precise calculations, and LLMs are notoriously bad at those. I suspect that finance is always going to be some hybrid of an LLM driving an "old school" computer to do the hard math, via a programming language or some other, yet-unenvisioned protocol.

programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.

This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.

madeofpalk
0 replies
2h30m

Yes, it appears to be the clear successful usecase for the technology, in a way that hasn't been replicated for other professions.

I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.

MikeKusold
1 replies
2h55m

Those are regulated industries, where as software development is not.

An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.

knicholes
0 replies
2h52m

Generally I agree! I saw a guy shamefully admit he didn't read the output carefully enough when using generated code (that ran), but there was a min() instead of a max(), and it messed up a month of his metrics!

sakesun
0 replies
2h50m

Generating code has significant economical benefit. The code once generated can be execute so many times without requiring high computing resources, unlike AI model.

drewmate
0 replies
2h17m

It's just easier to iterate and improve on a coding specialist AI when that is also the skill required to iterate on said AI.

Products that build on general LLM tech are already being used in other fields. For example, my lawyer friend has started using one by LexisNexis[0] and is duly impressed by how it works. It's only a matter of time before models like that get increasingly specialized for that kind of work, it's just harder for lawyers to drive that kind of change alone. Plus, there's a lot more resistance in 'legacy' professions to any kind of change, much less one that is perceived to threaten the livelihoods of established professionals.

Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.

[0] https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page

a2128
0 replies
2h56m

Coding models solve a clear problem and have a clear integration into a developer's workflow - it's like your own personal StackOverflow and it can autocomplete code for you. It's not as clear when it comes to finance or legal, you wouldn't want to rely on an AI that may hallucinate financial numbers or laws. These other professions are also a lot slower to react to change, compared to software development where people are already used to learning new frameworks every year

317070
0 replies
2h25m

I work in the field. The reason has not been mentioned yet.

It's because (for an unknown reason), having coding and software development in the training mix is really helpful at most other tasks. It improves everything to do with logical thinking by a large margin, and that seems to help with many other downstream tasks.

Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.

I don't know how much that is true for legal or financial resources.

yjftsjthsd-h
13 replies
3h29m

Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.

wongarsu
3 replies
2h27m

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move

yjftsjthsd-h
0 replies
1h26m

I actually meant execution speed from quantisation awareness - agreed that comparing against smaller models is a bit cheating.

mistercheph
0 replies
2h10m

I strongly disagree, have you used fp16 or q8 llama3 8b?

amrrs
0 replies
1h35m

reasoning and generalization

any example use-cases or prompts? how do you define those?

bernaferrari
3 replies
2h31m

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

hislaziness
2 replies
2h28m

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

fzzzy
0 replies
55m

it's very common to run local models in 8 bit int.

ancientworldnow
0 replies
2h25m

This was trained to be run at FP8 with no quality loss.

Bumblonono
1 replies
2h29m

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

michaelt
0 replies
1h56m

A 4090 will just narrowly fit a 34B parameter model at 4-bit quantisation.

A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.

xena
0 replies
3h11m

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

exe34
0 replies
2h51m

tensors look about 20gb. not sure what that's like in vram.

jorgesborges
9 replies
3h2m

I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.

miki123211
3 replies
2h53m

Generally yes, with caveats.

There was some research showing that training a model on facts like "the mother of John Smith is Alice" but in German allowed it to answer questions like "who's the mother of John Smith", but not questions like "what's the name of Alice's child", regardless of language. Not sure if this holds at larger model sizes though, it's the sort of problem that's usually fixable by throwing more parameters at it.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

planb
2 replies
2h25m

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

Do you have any good sources that explain this? I was always thinking LLMs are indeed stochastic parrots, but language (that is the unified corpus of all languages in the training data) already inherently contains the „generalization“. So the intelligence is encoded in the language humans speak.

moffkalast
0 replies
41m

language already inherently contains the „generalization“

The mental gymnastics required to handwave language model capabilities are getting funnier and funnier every day.

michaelt
0 replies
1h41m

I don't have explanations but I can point you to one of the papers: https://arxiv.org/pdf/2309.12288 which calls it "the reversal curse" and does a bunch of experiments showing models that are successful at questions like "Who is Tom Cruise’s mother?" (Mary Lee Pfeiffer) will not be equally successful at answering "Who is Mary Lee Pfeiffer’s son?"

dannyw
1 replies
2h45m

Anecdata, but I did some continued pretraining on a toy LLM using machine-translated data; of the original dataset.

Performance improved across all benchmarks; in English (the original language).

benmanns
0 replies
2h30m

Am I understanding correctly? You look an English dataset, trained an LLM, machine translated the English dataset to e.g. Spanish, continued training the model, and performance for queries in English improved? That’s really interesting.

bernaferrari
1 replies
2h59m

no, it is basically an 'auto-correct' spell checker from the phone. It only knows what it was trained on. But it has been shown that a coding LLM that has never seen a programming language or a library can "learn" a new one faster than, say, a generic LLM.

StevenWaterman
0 replies
2h48m

That's not true, LLMs can answer questions in one language even if they were only trained on that data in another language.

IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English

bionhoward
0 replies
2h45m

There is evidence code training helps with reasoning so if you count code as another language then, this makes sense

https://openreview.net/forum?id=KIPJKST4gw

Is symbolic language a fuzzy sort of code? Absolutely, because it conveys logic and information. TLDR: yes!

andrethegiant
6 replies
1h41m

I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?

kaoD
5 replies
1h34m

but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

andrethegiant
4 replies
1h27m

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

Why let Amazon/Cloudflare repackage it?

bilbo0s
3 replies
1h14m

How would you stop them?

The license is Apache 2.

andrethegiant
2 replies
1h12m

That's my question -- why license as Apache 2

bilbo0s
1 replies
1h0m

What license would allow complete freedom for everyone else, but constrain Amazon and Cloudflare?

supriyo-biswas
0 replies
38m

The LLaMa license is a good start.

Workaccount2
6 replies
3h26m

Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

voiper1
2 replies
3h12m

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

https://mistral.ai/technology/#pricing

dannyw
0 replies
2h50m

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.

Palmik
0 replies
2h20m

They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.

marci
0 replies
2h44m

For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"

eyeswideopen
0 replies
3h17m

As written here: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one

causal
0 replies
2h46m

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.

simonw
5 replies
2h55m

I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.

sroussey
1 replies
2h48m

Same could be said for onnx.

Depends on which community you are in as to what you want.

simonw
0 replies
1h33m

Right - imagine how much of an impact a model release could have if it included GGUF and ONNX and MLX along with PyTorch.

dannyw
0 replies
2h54m

I think it's actually reasonable to leave some opportunities to the community. It's an Apache 2.0 model. It's meant for everyone to build upon freely.

a2128
0 replies
2h52m

llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open

Patrick_Devine
0 replies
40m

Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.

pants2
5 replies
3h25m

Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.

Raed667
1 replies
2h51m

If I "only" have 16GB of ram on a macbook pro, would that still work ?

sofixa
0 replies
2h29m

If it's an M-series one with "unified memory" (shared RAM between the CPU, GPU and NPU on the same chip), yes.

mysteria
0 replies
14m

Keep in mind that Gemma is a larger model but it only has 8k context. The Mistral 12B will need less VRAM to store the weights but you'll need a much larger KV cache if you intend to use the full 128k context, especially if the KV is unquantized. Note sure if this new model has GQA but those without it absolutely eat memory when you increase the context size (looking at you Command R).

_flux
0 replies
3h18m

How much memory does employing the complete 128k window take, though? I've sadly noticed that it can take a significant amount of VRAM to use a larger context window.

edit: e.g. I wouldn't know the correct parameters for this calculator, but going from 8k window to 128k window goes from 1.5 GB to 23 GB: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

pantulis
4 replies
3h29m

Does it have any relation to Nvidia's Nemo? Otherwise, it's unfortunate naming

refulgentis
2 replies
3h5m

Click the link, read the first sentence.

pantulis
1 replies
2h1m

Yeah, not my brightest HN moment, to be honest.

SubiculumCode
0 replies
1h22m

At least you didn't ask about finding a particular fish.

mcemilg
4 replies
2h52m

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.

wongarsu
3 replies
2h39m

I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.

gooob
2 replies
2h8m

no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.

candiddevmike
0 replies
2h0m

They can't release it without admitting to copyright infringement.

bilbo0s
0 replies
1h28m

Uh..

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.

k__
4 replies
2h46m

What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.

kaoD
1 replies
2h22m

I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.

k__
0 replies
1h19m

Ah, yes.

Thanks, I confused those numbers!

yjftsjthsd-h
0 replies
1h3m

Also, are these small models OSS?

From the very first paragraph on the page:

released under the Apache 2.0 license.
simion314
0 replies
2h31m

What's the reason for measuring the model size in context window length and not GB?

there are 2 different things.

The context window is how many tokens ii's context can contain, so on a big model you could put in the context a few books and articles and then start your questions, on a small context model you can start a conversation and after a short time it will start forgetting eh first prompts. Big context will use more memory and will cost on performance but imagine you could give it your entire code project and then you can ask it questions, so often I know there is some functions already there that does soemthing but I can't remember the name.

saberience
2 replies
3h2m

Two questions:

1) Anyone have any idea of VRAM requirements?

2) When will this be available on ollama?

causal
0 replies
2h54m

1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.

alecco
0 replies
2h48m

The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU
minimaxir
2 replies
3h17m

Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.

zwaps
0 replies
7m

SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.

rockinghigh
0 replies
19m

The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.

davidzweig
1 replies
1h6m

Did anyone try to check how are it's multilingual skills vs. Gemma 2? On the page, it's compared with LLama 3 only.

moffkalast
0 replies
46m

Well it's not on Le Chat, it's not on LMSys, it has a new tokenizer that breaks llama.cpp compatibility, and I'm sure as hell not gonna run it with Crapformers at 0.1x speed which as of right now seems to be the only way to actually test it out.

p1esk
0 replies
3h8m

Interesting how it will compete with 4o mini.

ofermend
0 replies
2h12m

Congrats. Very exciting to see continued innovation around smaller models, that can perform much better than larger models. This enables faster inference and makes them more ubiquitous.

obblekk
0 replies
2h38m

Worth noting this model has 50% more parameters than llama3. There are performance gains but some of the gains might be from using more compute rather than performance per unit compute.

dpflan
0 replies
2h25m

These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

bugglebeetle
0 replies
2h40m

Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.

alecco
0 replies
2h48m

Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

*Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.
LoganDark
0 replies
2h35m

Is the base model unaligned? Disappointing to see alignment from allegedly "open" models.

I_am_tiberius
0 replies
41m

The last time I tried a Mistral model, it didn't answer most of my questions, because of "policy" reasons. I hope they fixed that. OpenAI at least only tells me that it's a policy issue but still answers most of the time.