Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.
EDIT: This a 1W light bulb moment for me, thank you!
Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile
I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.
Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.
This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.
I love that domain name.
I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.
Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.
Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?
We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.
Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D
First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo
We're working on it!
Yes.
If you're on a Mac, check out LM Studio.
It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.
llama.cpp or ollama both have apis for most models
llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...
and expose an OpenAI compatible server, or you can use their python bindings
Try Lm Studio or Ollama. Load up the model, and there you go.
AFAIK, Ollama supports most of these models locally and will expose a REST API[0]
[0]: https://github.com/ollama/ollama/blob/main/docs/api.md
You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.
I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air