The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)
1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/
I've always used `llamacpp -m <model> -p <prompt>`. Works great as my daily driver of Mixtral 8x7b + CodeLlama 70b on my MacBook. Do alternatives have any killer features over Llama.cpp? I don't want to miss any cool developments.
I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.
Same here, I run deepseek coder 33b on my 64GB M1 Max at about 7-8t/s and it blows all other models I've tried for coding. It feels like magic and cheating at the same time, getting these lenghty and in-depth answers with activity monitor showing 0 network IO.
I tried running Deepseek 33b using llama.cpp with 16k context and it kept injecting unrelated text. What is your setup so it works for you? Do you have some special CLI flags or prompt format?
I actually use lmstudio with settings preset for deepseek that comes with it, except for mlock set to keep it entirely in memory, works really good
How exactly do you use the LLM with multiple files? Do you copy them enterly into the prompt?
With all the models I tried there was a quite a bit of fiddling for each one to get the correct command-line flags and a good prompt, or at least copy-paste some command-line from HF. Seems like every model needs its own unique prompt to give good results? I guess that is what the wrappers take care of? Other than that llama.cpp is very easy to use. I even run it on my phone in Termux, but only with a tiny model that is more entertaining than useful for anything.
For the chat models, they're all finetuned slightly differently in their prompt format - see Llama's. So having a conversion between the OAI api that everyone's used to now and the slightly inscrutable formats of models like Llama is very helpful - though much like langchain and its hardcoded prompts everywhere, there's probably some subjectivity and you may be rewarded by formatting prompts directly.
The slight incompatibilities of prompt formats and style is a nuisance. I have just been looking an Mistral’s prompt design documentation and I now feel like I have underutilized mistral-7B and mixtral-8-7b https://docs.mistral.ai/guides/prompting-capabilities/
70b is probably going to be a bit slow for most on M-series MBPs (even with enough RAM), but Mixtral 8x7b does really well. Very usable @ 25-30T/s (64GB M1 Max), whereas 70b tends to run more like 3.5-5T/s.
'llama.cpp-based' generally seems like the norm.
Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.
Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.
[1] https://ollama.ai/library/mixtral
[2] https://github.com/ollama-webui/ollama-webui
Yeah, ollama-webui is an excellent front end and the team was responsive in fixing a bug I reported in a couple of days
It's also possible to connect to OpenAI API and use GPT-4 on per token plan. I cancelled my chatGPT subscription since. But 90% of the usage for me is Mistral 7B fine-tunes, I rarely use OpenAI
Thanks for that idea, I use Ollama as my main LLM driver, but I still use OpenAI, Anthropic, and Mistral commercial API plans. I access Ollama via a REST API and my own client code, but I will try their UI.
re: cancelling ChatGPT subscription: I am tempted to do this also except I suspect that when they release GPT-5 there may be a waiting list, and I don’t want any delays in trying it out.
Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.
I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)
ollama is extremely convenient wrapper around llamacpp.
they separate serving heavy weights from model definition and usage itself.
what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.
the effect is that you have very low latency, very good interface - for programming api and ui.
ps. it's not only for macs
open weight models + (llama.app) as ollama + ollama-webui = real openai.
From the blog article:
I never tried to run these LLMs on my own machine -- is it this bad?
I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?
I would expect that 4060ti to get about 20-25 tokens per second on Mixtral. I can read at roughly 10-15 tokens per second so above that is where I see diminishing returns for a chatbot. Generating whole blog articles might have you sit waiting for a minute or so though.
It depends on the context window, but my 3090 gets ~60/s on smaller windows.
I get 50-60t/s on Mistral 7B on 2080 Ti
Thanks, that sounds more than tolerable than "more than an hour"!
I also have the 16GB version, which I assume would be a little bit better.
You can load a 7B parameter model quantized at Q4_K_M as gguf. I don't know ollama, but you can load it in koboldcpp -- use cuBLAS and gpu layers 100 context 2048 and it should fit it all into 8GB of VRAM. For quantized models look at TheBloke on huggingface -- Mistral 7B is a good one to try.
If I am not mistaken, layer offloading is a llama.cpp feature so a lot of frontends/loaders that use it also have it. I use it with koboldcpp and text-generation-webui.
On an M3 MacBook Pro with 32GB of RAM, I can comfortably run 34B models like phind-codellama:34b-v2-q8_0.
Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.
The Apple M1 is very useable with ollama using 7B parameter models and is virtually as “fast” as ChatGPT in responding. Obviously not same quality, but still useful.
I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.
I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).
Does anyone know why this would be?
I've had the opposite experience with Mixtral on Ollama, on an intel linux box with a 4090. It's weirdly slow. But I suspect there's something up with ollama on this machine anyway, any model I run with it seems to have higher latency than vLLM on the same box.
You have to specify the amount of layers to put on the GPU with ollama. Ollama defaults to far less layers compared to what is actually possible.
To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?
MIXtral (8x)-7B
The pace of progress here is pretty amazing. I loved how easy it is to get llamafile up and running, but I missed feature complete chat interfaces, so I built one based off it: https://recurse.chat/.
I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.
My primary use case would be to feed large internal codebases into an LLM with a much larger context window than what GPT-4 offers. Curious what the best options here are, in terms of model choice, speed, and ideas for prompt engineering
Yi-34B-200K might be something to look at.
* https://huggingface.co/01-ai/Yi-34B-200K
Personally I'd recommend Ollama, because they have a good model (dockeresque), the APIs are quite more widely supported
You can mix models in a single model file, it's a feature I've been experimenting with lately
Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp
Curious if anyone has any recommendation for what LLM model to use today if you want a code assistant locally. Mistral?
I think it is even easier right now for companies to self host an inference server with basic rag support:
- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui
No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.