I'm not sure why Ollama garners so much attention. It has limited value - used for only experimenting with models + cannot support more than 1 model at a time. It's not meant for production deployments. Granted that it makes the experimentation process super easy but for something that relies on llama.cpp completely and whose main value proposition is easy model management I'm not sure it deserves the brouhaha people are giving it.
Edit: what do you do after the initial experimentation? you need to deploy these models eventually to production. I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers. Not denying that it's a great product.
The answer to your question is:
ollama run mixtral
That's it. You're running a local LLM. I have no clue how to run llama.cpp
I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.
The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:
This shows the value ollama provides
I only need to know the model name and then run a single command
It should be fairly obvious that one can find alternative models and use them in the above command too.
Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.
Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.
More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.
I'd rather focus on building on top of of LLMs than going lower level
Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box
Sure.
Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.
I'm aware, I don't need that amount of sophistication yet.
Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?
Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.
I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask
Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs
There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.
Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.
The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.
Your mileage may vary. It runs first time for me on an Apple Silicon Mac.
And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?
I focus on building the application rather than figuring out someone else preferred method for how I should work?
I use Docker Compose locally, Kubernetes in the cloud
I run in hot-reload locally, I build for production
I often nuke my database locally, but I run it HA in production
It is very rare to use the same technology locally (or the same way) as in production
Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.
There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.
The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).
On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.
When I get to the point of modification, I will go with Python. This is where the AI ecosystem is largely at
I stopped using C++ when Go came out, no interest in ever having to write it again.
This will likely build a version without GPU acceleration, I think?
Builds with Metal support on my Mac M2
Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.
It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.
Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.
The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.
For us this may like a walk in the park.
For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)
This is just like dropbox case years ago.
On a mac, https://drawthings.ai is the ollama of Stable Diffusion.
Check out EasyDiffusion.
For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.
I mean, it takes something difficult like an LLM and makes it easy to run. It's bound to get attention. If you've tried to get other models like BERT based models to run you'll realize just how big the usability gains are running ollama than anything else in the space.
If the question you're asking is why so many folks are focused on experimentation instead of productionizing these models, then I see where you're coming from. There's the question of how much LLMs are actually being used in prod scenarios right now as opposed to just excited people chucking things at them; that maybe LLMs are more just fun playthings than tools for production. But in my experience as HN has gotten bigger, the number of posters talking about productionizing anything has really gone down. I suspect the userbase has become more broadly "interested in software" rather than "ships production facing code" and the enthusiasm in these comments reflects those interests.
FWIW we use some LLMs in production and we do not use ollama at all. Our prod story is very different than what folks are talking about here and I'd love to have a thread that focuses more on language model prod deployments.
Well you would be one of the few hundred people on the planet doing that. With local LLMs we're just trying to create a way for everyone else to use AI that doesn't require sharing all their data with them. First thing everyone asks for of course is how to turn the open source local llms into their own online service.
Few hundred on the planet? Are you kidding me? We're asking enterprises to run LLMs on-premise (I'm intentionally discounting the cloud scenario where the traffic rates are much higher). That's way more than a hundred and sorry to break it to you that Ollama is just not going to cut it.
No need to be angry about this. Tech folks should be discussing this collectively and collaboratively. There's space for everything from local models running on smartphones all the way up to OpenAI style industrialized models. Back when social networks were first coming out, I used to read lots of comments about deploying and running distributed systems. I remember reading early incident reports about hotspotting and consistent hashing and TTL problems and such. We need to foster more of that kind of conversation for LMs. Sadly right now Xitter seems to be the best place for that.
Not angry. Having a discussion :-). It just amazes me how the HN crowd is more than happy with just trying out a model on their machine and calling it a day and not seeing the real picture ahead. Let ignore perf concerns for a moment. Let's say I want to run it on a shared server in the enterprise network so that any application can make use of it. Each application might want to use a model of their choosing. Ollama will unload/load/unload models as each new request arrives. Not sure if folks here are realizing this :-)
I’m not sure you’re capable of understanding that your needs and requirements are just that, yours.
Ollama's purpose and usefulness is clear. I don't think anyone is disputing that nor the large usability gains ollama has driven. At least I'm not.
As far as being one of the few hundred on the planet, well yeah that's why I'm on HN. There's tons of publications and subreddits and fora for generic tech conversation. I come here because I want to talk about the unknowns.
Your knowns an are unknowns to some people and vice versa. This is a great strength of HN; on a whole lot of subject you’ll find people ranging from enthusiastic to expert. There are probably subreddits or discord servers tailored to narrow niches and that’s cool, but HN is not that. They are complementary, if anything. In contrast, HN is much more interesting and with a much better S/N ratio than generic tech subreddits, it’s not even comparable.
I've been using this site since 2008. This is my second account from 2009. HN very much used to be a tech niche site. I realize that for newer users to HN, the appeal is like r/programming or r/technology but with a more text oriented interface or higher SNR or whatever but this is a shift in audience and there are folks like I on this site who still want to use it for niche content.
There are still threads where people do discuss gory details, even if the topics aren't technical. A lot of the mapping articles on the site bring out folks with deep knowledge about mapping stacks. Alternate energy threads do it too. It can be like that for LLMs also, but the user base has to want this site to be more than just Yet Another Tech News Aggregator thread.
For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly. The LLM threads bring me here because experts like jart chime in.
I did not really like HN back in the day because it felt too startup-y, but maybe I got a wrong impression. I much preferred Ars Technica and their forum (now, Ars is much less compelling).
I think it depends on the stories. Different subjects have different demographics, and I have almost completely stopped reading physics stuff because it is way too Reddit-like (and full of confident people asserting embarrassingly wrong facts). I can see how you could feel about fields closer to your interests being dumbed down by over-confident non-specialists.
There are still good, highly technical discussions, but it is true that the home page is a bit limited and inefficient to find them.
More than one is easy: put it behind a load balancer. Put one ollama in one container or one port.
That is still one model per instance of Ollama, right?
yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.
Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.
Thanks! This is great to know.
FWIW Ollama has no concurrency support even though llama.cpp's server component (the thing that Ollama actually uses) supports it. Besides, you can't have more than 1 model running. Unloading and loading models is not free. Again, there's a lot more and really much of the real optimization work is not in Ollama; it's in llama.cpp which is completely ignored in this equation.
Thanks! Great to know. I did not know llama.cpp could do this. It should be pretty straight forward to support, not sure why they would not do it.
I'm pretty sure their primary focus right now is to gain as much mindshare as possible and they seem to be doing a great job of it. If you look at the following GitHub metrics:
https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....
https://devboard.gitsense.com/ollama?r=ollama%2Follama&nb=tr...
The number of people engaging with ollama is twice that of llama.cpp. And there hasn't been a dip in people engaging with Ollama in the past 6 months. However, what I do find interesting with regards to these two projects is the number of merged pull requests. If you click on the "Groups" tab and look at "Hooray", you can see llama.cpp had 72 contributors with one or more merged pull requests vs 25 for Ollama.
For Ollama, people are certainly more interested in commenting and raising issues. Compare this to llama.cpp, where the number of people contributing code changes is double that of Ollama.
I know llama.cpp is VC funded and if they don't focus on make using llama.cpp as easy to use as Ollama, they may find themselves doing all the hard stuff with Ollama reaping all the benefits.
Full Disclosure: The tool that I used is mine.
I am probably not the demographics you expect. I don’t do “production” in that sense, but I have ollama running quite often when I am working, as I use it for RAG and as a fancy knowledge extraction engine. It is incredibly useful:
- I can test a lot of models by just pulling them (very useful as progress is very fast),
- using their command line is trivial,
- the fact that it keeps running in the background means that it starts once every few days and stays out of the way,
- it integrates nicely with langchain (and a host of other libraries), which means that it is easy to set up some sophisticated process and abstract away the LLM itself.
I just keep using it. And for now, I keep tweaking my scripts but I expect them to stabilise at some point, because I use these models to do some real work, and this work is not monkeying about with LLMs.
For me, there is nothing that comes close in terms of integration and convenience. The value it delivers is great, because it enables me to do some useful work without wasting time worrying about lower-level architecture details. Again, I am probably not in the demographics you have in mind (I am not a CS person and my programming is usually limited to HPC), but ollama is very useful to me. Its reputation is completely deserved, as far as I am concerned.
Curious, can you share more details about your usecase?
The use case is exploratory literature review in a specific scientific field.
I have a setup that takes pdfs and does some OCR and layout detection with Amazon, and then bunch them with some internal reports. Then, I have a pipeline to write summaries of each document and another one to slice them into chunks, get embeddings and set up a vector store for a RAG chat bot. At the moment it’s using Mixtral and the command line. But I like being able to swap LLMs to experiments with different models and quantisation without hassle, and I more or less plan to set this up on a remote server to free some resources on my workstation so the web UI could come in handy. Running this locally is a must for confidentiality reasons. I’d like to get rid of Textract as well, but unfortunately I haven’t found a solution that’s even close. Tesseract in particular was very disappointing.
Try ollama webui(now open-webui). Sry on my phone now => no links
Even for just running a model locally, Ollama provided a much simpler "one click install" earlier than most tools. That in itself is worth the support.
Koboldcpp is also very, very good, plug and play, very complet web UI, nice little api with sse text streaming, vulkan accelerated, have an AMD fork...
I am 100% uninterested in your production deployment of rent seeking behavior for tools and models I can run myself. Ollama empowers me to do more of that easier. That’s why it’s popular.
OP’s point is more that Ollama isn’t what’s doing the empowering. Llama.cpp is.
That's the answer to your question. It may have less space than a Zune, but the average person doesn't care about technically superior alternatives that are much harder to use.
*Nomad
And lame.
It's nice for personal use which is what I think it was built for, has some nice frontend options too. The tooling around it is nice, and there are projects building in rag etc. I don't think people are intending to deploy days services through these tools
You are not making any sense. I am running ollama and Open WebUI (which takes care of auth) in production.
Ollama is the Docker of LLMs. Ollama made it _very_ easy to run LLMs locally. This is surprisingly not as easy as it seems, and incredibly useful.
As it turns out, making it faster and better to manage things tends to get people’s attention. I think it’s well deserved.
In my opinion, pre-built binaries and an easy-to-use front-end are things that should exist and are valid as a separate project unto themselves (see, e.g., HandBrake vs ffmpeg).
Using the name of the authors or the project you're building on can also read like an endorsement, which is not _necessarily_ desirable for the original authors (it can lead to ollama bugs being reported against llama.cpp instead of to the ollama devs and other forms of support request toil). Consider the third clause of BSD 3-Clause for an example used in other projects (although llama.cpp is licensed under MIT).
Interestingly, Ollama is not popular at all in the "localllama" community (which also extends to related discords and repos).
And I think thats because of capabilities... Ollama is somewhat restrictive compared to other frontends. I have a littany of reasons I personally wouldn't run it over exui or koboldcpp, both for performance and output quality.
This is a necessity of being stable and one-click though.
Sounds like you are dismissing Ollama as a "toy".
Refer:
https://paulgraham.com/startupideas.html
For me all the projects which enable running & fine-tuning LLMs locally like llama.cpp, ollama, open-webui, unsolth etc. play a very important part in democratizing AI.
I built GaitAnalyzer[1], to analyze my gait laptop; I had deployed it briefly in production when I had enough credits to foot the AWS GPU bills. Ollama made it very simple to deploy the application, Anyone who has used docker before can now run GaitAnalyzer in their computer.
[1] https://github.com/abishekmuthian/gaitanalyzer