OpenAI compatibility

The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)

1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

I've always used `llamacpp -m <model> -p <prompt>`. Works great as my daily driver of Mixtral 8x7b + CodeLlama 70b on my MacBook. Do alternatives have any killer features over Llama.cpp? I don't want to miss any cool developments.

I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.

Same here, I run deepseek coder 33b on my 64GB M1 Max at about 7-8t/s and it blows all other models I've tried for coding. It feels like magic and cheating at the same time, getting these lenghty and in-depth answers with activity monitor showing 0 network IO.

I tried running Deepseek 33b using llama.cpp with 16k context and it kept injecting unrelated text. What is your setup so it works for you? Do you have some special CLI flags or prompt format?

I actually use lmstudio with settings preset for deepseek that comes with it, except for mlock set to keep it entirely in memory, works really good

How exactly do you use the LLM with multiple files? Do you copy them enterly into the prompt?

With all the models I tried there was a quite a bit of fiddling for each one to get the correct command-line flags and a good prompt, or at least copy-paste some command-line from HF. Seems like every model needs its own unique prompt to give good results? I guess that is what the wrappers take care of? Other than that llama.cpp is very easy to use. I even run it on my phone in Termux, but only with a tiny model that is more entertaining than useful for anything.

For the chat models, they're all finetuned slightly differently in their prompt format - see Llama's. So having a conversion between the OAI api that everyone's used to now and the slightly inscrutable formats of models like Llama is very helpful - though much like langchain and its hardcoded prompts everywhere, there's probably some subjectivity and you may be rewarded by formatting prompts directly.

The slight incompatibilities of prompt formats and style is a nuisance. I have just been looking an Mistral’s prompt design documentation and I now feel like I have underutilized mistral-7B and mixtral-8-7b https://docs.mistral.ai/guides/prompting-capabilities/

70b is probably going to be a bit slow for most on M-series MBPs (even with enough RAM), but Mixtral 8x7b does really well. Very usable @ 25-30T/s (64GB M1 Max), whereas 70b tends to run more like 3.5-5T/s.

'llama.cpp-based' generally seems like the norm.

Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.

Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.

[1] https://ollama.ai/library/mixtral

[2] https://github.com/ollama-webui/ollama-webui

Yeah, ollama-webui is an excellent front end and the team was responsive in fixing a bug I reported in a couple of days

It's also possible to connect to OpenAI API and use GPT-4 on per token plan. I cancelled my chatGPT subscription since. But 90% of the usage for me is Mistral 7B fine-tunes, I rarely use OpenAI

Thanks for that idea, I use Ollama as my main LLM driver, but I still use OpenAI, Anthropic, and Mistral commercial API plans. I access Ollama via a REST API and my own client code, but I will try their UI.

re: cancelling ChatGPT subscription: I am tempted to do this also except I suspect that when they release GPT-5 there may be a waiting list, and I don’t want any delays in trying it out.

Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.

I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)

ollama is extremely convenient wrapper around llamacpp.

they separate serving heavy weights from model definition and usage itself.

what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.

the effect is that you have very low latency, very good interface - for programming api and ui.

ps. it's not only for macs

open weight models + (llama.app) as ollama + ollama-webui = real openai.

From the blog article:

A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.

I never tried to run these LLMs on my own machine -- is it this bad?

I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?

I would expect that 4060ti to get about 20-25 tokens per second on Mixtral. I can read at roughly 10-15 tokens per second so above that is where I see diminishing returns for a chatbot. Generating whole blog articles might have you sit waiting for a minute or so though.

It depends on the context window, but my 3090 gets ~60/s on smaller windows.

I get 50-60t/s on Mistral 7B on 2080 Ti

Thanks, that sounds more than tolerable than "more than an hour"!

I also have the 16GB version, which I assume would be a little bit better.

You can load a 7B parameter model quantized at Q4_K_M as gguf. I don't know ollama, but you can load it in koboldcpp -- use cuBLAS and gpu layers 100 context 2048 and it should fit it all into 8GB of VRAM. For quantized models look at TheBloke on huggingface -- Mistral 7B is a good one to try.

If I am not mistaken, layer offloading is a llama.cpp feature so a lot of frontends/loaders that use it also have it. I use it with koboldcpp and text-generation-webui.

On an M3 MacBook Pro with 32GB of RAM, I can comfortably run 34B models like phind-codellama:34b-v2-q8_0.

Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.

The Apple M1 is very useable with ollama using 7B parameter models and is virtually as “fast” as ChatGPT in responding. Obviously not same quality, but still useful.

I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.

I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).

Does anyone know why this would be?

I've had the opposite experience with Mixtral on Ollama, on an intel linux box with a 4090. It's weirdly slow. But I suspect there's something up with ollama on this machine anyway, any model I run with it seems to have higher latency than vLLM on the same box.

You have to specify the amount of layers to put on the GPU with ollama. Ollama defaults to far less layers compared to what is actually possible.

To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?

MIXtral (8x)-7B

The pace of progress here is pretty amazing. I loved how easy it is to get llamafile up and running, but I missed feature complete chat interfaces, so I built one based off it: https://recurse.chat/.

I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.

My primary use case would be to feed large internal codebases into an LLM with a much larger context window than what GPT-4 offers. Curious what the best options here are, in terms of model choice, speed, and ideas for prompt engineering

Yi-34B-200K might be something to look at.

* https://huggingface.co/01-ai/Yi-34B-200K

Personally I'd recommend Ollama, because they have a good model (dockeresque), the APIs are quite more widely supported

You can mix models in a single model file, it's a feature I've been experimenting with lately

Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp

Curious if anyone has any recommendation for what LLM model to use today if you want a code assistant locally. Mistral?

I think it is even easier right now for companies to self host an inference server with basic rag support:

- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui

No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.

I know a few people privately unhappy that openai api compatibility is becoming a community standard. Apart from some awkwardness around data.choices.text.response and such unnecessary defensive nesting in the schema, I don't really have complaints.

wonder what pain points people have around the API becoming a standard, and if anyone has taken a crack at any alternative standards that people should consider.

I want it to be documented.

I'm fine with it emerging as a community standard if there's a REALLY robust specification for what the community considers to be "OpenAI API compatible".

Crucially, that standard needs to stay stable even if OpenAI have released a brand new feature this morning.

So I want the following:

- A very solid API specification, including error conditions

- A test suite that can be used to check that new implementations conform to that specification

- A name. I want to know what it means when software claims to be "compatible with OpenAI-API-Spec v3" (for example)

Right now telling me something is "OpenAI API compatible" really isn't enough information. Which bits of that API? Which particular date-in-time was it created to match?

It's a JSON API... JSON API's tend to be more... 'flexible'.

To consume them, just assume that every field is optional and extra fields might appear at any time.

and disappear at any time... was a leetle bit unsettled by the sudden deprecation of "functions" for "tools" with only minor apparante benefit

and what does `auto` even mean?

Hey nl — I work at OpenAI on our API. Do you mean `tool_choice="auto"`? If so, it means the model gets to pick which tool to call. The other options are:

- `tool_choice={type: "function", function: {name: "getWeather"}}`, where the developer can force a specific tool to be called. - `tool_choice="none"`, where the developer can force the model to address the user, rather than call a tool.

If you have any other feedback, please feel free to email me at atty@openai.com. Thanks!

Hey swyx — I work at OpenAI on our API. Sorry the change was surprising, we definitely didn't do a great job communicating it.

To confirm, the `functions` parameter will continue to be supported.

We renamed `functions` to `tools` to better align with the naming across our products (Assistants, ChatGPT), where we support other tools like `code_interpreter` and `retrieval` in addition to `function`s.

If you have any other feedback for us, please feel free to email me at atty@openai.com. Thanks!

Might be a good idea to have API versions for this... Then when someone builds a product against "version 1", they can be sure that new features might be added to version 1, but no fields will be removed/renamed without openai releasing version 2.

Amen! The lack of decent errors from OpenAI is the most annoying. They'll silently return 400 with no explanation. Let's hope that doesn't catch on.

OpenAI compatible just seems to mean 'you can format your prompt like the `messages` array'.

Hi te_chris — I work at OpenAI and am currently working to improve our error messages. Would you be willing to share more about what errors you find annoying? My email is atty@openai.com (or feel free to reply here). Thanks!

TBH, we debated about this a lot before adding it. It's weird being beholden to someone else's API which can dictate what features we should (or shouldn't) be adding to our own project. If we add something cool/new/different to Ollama will people even be able to use it since there isn't an equivalent thing in the OpenAI API?

That's more of a marketing problem than a technical problem. If there is indeed a novel use case with a good demo example that's not present in OpenAI's API, then people will use it. And if it's really novel, OpenAI will copy it into their API and thus the problem is no longer an issue.

The power of open source!

You're right that it's a marketing problem, but it's also a technical problem. If tooling/projects are built around the compat layer it makes it really difficult to consume those features without having to rewrite a lot of stuff. It also places a cognitive burden on developers to know which API to use. That might not sound like a lot, but one of the guiding principles around the project (and a big part of its success) is to keep the user experience as simple as possible.

At some point, (probably in a relatively close future), there will be the AI Consortium (AIC) to decide what enters the common API?

I would take an imperfect standard over no standard any day!

There is a difference between a standard and a monopoly, though.

It's so trivially easy to create your own web server in your language of choice that calls directly into llama.cpp functions with the bindings for your language of choice it doesn't really matter all that much. If you want more control you can get with just a little more work. You don't really need these plug and play things.

That's why it's good as an option to minimize friction and reduce lock-in to OpenAI's moat.

FYI: the Linux installation script for Ollama works in the "standard" style for tooling these days:

    curl https://ollama.ai/install.sh | sh

However, that script asks for root-level privileges via sudo the last time I checked. So, if you want the tool, you may want to download the script and have a look at it, or modify it depending on your needs.

we have package managers in this day and age, lol.

do package managers make promises that they only distribute code that's been audited to not pwn you? I'm not sure I see the difference if I decided I'm going to run someone's software whether I install it with sudo apt install vs sudo curl | bash

You are already trusting the maintainers of your distro by running Software they compiled, if you installed anything via the package manager. So it's about the number of people.

This only applies to software distributed by your distro. For something as novel as Ollama, I severely doubt it's made it into anything other than the most bleeding edge(Arch and co). You'll have to wait a few years to get it into mainline Debian, Ubuntu, Fedora, etc. and of course it will be at a set version.

Debian considers this a feature. Choose a distro that fits your needs.

With reproducible builds you get to challenge the binaries provided by your distribution. See <https://guix.gnu.org/manual/devel/en/html_node/Invoking-guix...> for an example.

thats really cool, thanks for the link

ok, so, I think i am trusting fewer people if I just run the bash script provided by the people whose software i want to run

Well, there's things like:

https://www.debian.org/doc/debian-policy/ch-archive.html#the...

The whole thing, actually:

https://www.debian.org/doc/debian-policy/index.html

Sadly most of them kinda suck, especially for packagers.

https://github.com/jordansissel/fpm

Wrap it in homebrew and have ruby call out to sudo. Problem solved /s

The ollama binary goes into /usr/bin which it doesn't have to but it's convenient. I haven't checked what else needs root access.

They have manual install instructions [0], and judging by those, what it does is set up a SystemD service that automatically runs on startup. But if you're just looking to play around, I found that downloading [1], making it executable (chmod +x ollama-linux-amd64), and then running it, worked just fine. All without needing root.

[0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...

[1] https://ollama.ai/download/ollama-linux-amd64

Is Ollama model I can use locally to use for my own project and keep my data secure?

I would not explicitly count on that. I’m a big fan of Ollama and use it every day but they do have some dark patterns that make me question a usecase where data security is a requirement. So I don’t use it where that is something that’s important.

like what? If you're gonna accuse a project of shady stuff, at least give examples :)

The same examples given every time ollama is posted. Off the top of my head the installer silently adds login items with no way to opt out, spawns persistent processes in the background in addition to the application with unclear purposes, no info on install about the install itself, doesn’t let you back out of the installer when it requests admin access. Basically lots of dark patterns in the non-standard installer.

Reminds me of how Zoom got it start with the “growth hacking” of the installation. Not enough to keep me from using it, but enough for me to keep from using it for anything serious or secure.

Show me the code

Install it on MacOs. Observe for yourself. This is a repeated problem mentioned in every thread. If you need help on the part about checking to see how many processes are running, let me know and I can assist. The rest are things you will observe, step by step, during the install process.

Send me a MacOS first, we don't use mac here.

If you can't find in foul play in code you can't prove.

These are some fair points. There definitely wasn't an intention of "growth hacking", but just trying to get a lot of things done with only a few people in a short period of time. Requiring admin access really sucks though and is something we've wanted to get rid of for a while.

I am running ollama in the CLI, under screen, and always disable the ollama daemon. It's hard to configure, while CLI it is just adding a few env vars in front.

We're planning to make it so you can change the env variables w/ the tray icon. The CLI will always work too though.

Opensource project so you can find evidence of foul play . Prove it or it is bs

Examples?

Ollama team are a few very down to earth, smart people. I really liked the folks I've met. I can't imagine they are doing anything malicious and I'm sure would address any issues (log them on GitHub) / entertain PRs to address any legitimate concerns

Ollama is an easy way to run local models on Mac/linux. See https://ollama.ai they have a web UI and a terminal/server approach

Isn't LangChain supposed to provide abstractions that 3rd parties shouldn't need to conform to OpenAI's API contract?

I know not everyone uses LangChain, but I thought that was one of the primary use-cases for it.

Which just then creates lock-in for LangChain's abstractions.

Which are pretty awful btw - every project at my job that started with LangChain openly regrets it - the abstractions, instead of making hard things easy, trend to make the way things hard (and hard to debug and maintain).

What are some better options?

Haystack is much better option and way alot flexible, scalable

Not using an abstraction at all and avoiding the technical debt it causes.

Don't use langchain, just make the calls?

Its what I ended up doing.

Have a fairly thin layer than wraps the underlying LLM behind a common API (e.g., Ollama as being discussed here, Oobabooga, etc.) and leaves the application-level stuff for the application rather than a framework like LangChain.

(Better for certain use cases, that is, I’m not saying LangChain doesn't have uses.)

https://www.llamaindex.ai/ is much better IMO, but it's definitely a case of boilerplate-y, well-supported incumbent vs smaller, better, less supported (e.g. Java vs Python in the 00s or something like that). Depends on your team and your needs.

Also Autogen seems popular and well-ish liked https://microsoft.github.io/autogen/

LangChain definitely has the most market-/mind- share. For example, GCP has a blog post on supporting it: https://cloud.google.com/blog/products/ai-machine-learning/d...

We use langchain and don't regret it at all. As a matter of fact, it is likely that without lc we would've failed to deliver our product.

The main reason is langsmith. (But there are other reasons too). Because of langchain we got "free" (as in no development necessary) langsmith integration and now I can debug my llm.

Before that it was trying to make sense of whats happening inside my app within hundreds and hundreds of lines of text which was extremely painful and time consuming.

Also, lc people are extremely nice and very open/quick to feedback.

The abstractions are too verbose, and make it difficult, but the value we've been getting from lc as a whole cannot be overstated.

other benefits:

* easy integrations with vector stores (we tried several until landing on one but switching was easy)

* easily adopting features like chat history, that would've taken us ages to determine correctly on our own

people that complain and say "just call your llm directly": If your usecase is that simple, of course. using lc for that usecase is also almost equally simple.

But if you have more complex use cases, lc provides some verbose abstractions, but it's very likely that you would've done the same.

I don’t quite follow why people use ollama ? It sounds like lama.cpp with less features and training wheels

Is it just ease of use or is there something I’m missing?

The CLI for llama.cpp is very clunky IMO. I put some kind of UI on it when I want to get something done.

It also ships with an openai-compatible server implementation as well now that you could point your UI at (if you wanted to run leaner w/out ollama).

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

It's a wrapper around llama.cpp that provides a stable api

not sure why you are getting downvoted, its a very valid question. Its kind of down to the ergonomics of running LLM. Downloading a user friendly CLI tool with good UX beats having to clone a repo and run make files. llama.cpp is the better option if you want to do anything non-trivial when it comes to LLMs

I started by using lama.cpp directly, and a few other options. I now just use Ollama because it is simple to download models, keep software and models up to date, and really easy to run a local REST query service. I like spending more time playing with application ideas and less time running infrastructure. Of course, lama.cpp under the hood provides the magic.

Downloading and activating models is very convenient. This llm stuff is really complicated and every little bit helps at the beginning. I only started two weeks ago and was very frustrated. A tool that just works is good for that kind of thing. Of course at that point you think it's their models and there's something special they are doing to the models etc. Honestly no tool that allows easy downloads goes out of their way to say they are just downloading TheBloke's gguf files and that the same models will run anywhere. (minus ollama's blob format on disk) :)

Ollama is just easier to use and serve the model on a local http server. I personally use it for testing stuff with llama-index as well. Pretty useful to say the least with zero configuration issues.

It's always ease of use lol. Thinking the best technology wins is a fallacy.

What's the use case of Ollama? Why should I not use llama.cpp directly?

I have the same question. Noticed that Ollama got a lot of publicity and seems to be well received, but what exactly is the advantage over using llama.cpp (which also has a built-in server with OpenAI compatibility nowadays?) Directly?

ollama swaps models from the local library on the fly, based on the request args, so you can test against a bunch of models quickly

Once you've tested to your heart's content, you'll deploy your model in production. So, looks like this is really just a dev use case, not a production use case.

In production, I'd be more concerned about the possibly of it going off on it's own and autoupdating and causing regressions. FLOSS LLMs are interesting to me because I can precisely control the entire stack.

If Ollama doesn't have a cli flag that disables auto updating and networking altogether, I'm not letting it anywhere near my production environments. Period.

If you’re serious about production deployments vLLM is the best open source product out there. (I’m not affiliated with it)

It's like a docker/package manager for the LLMs. You can easily install them, discover new ones, update them via a standardized, simple CLI. It also auto updates effortlessly.

Yesterday I learned it also deduplicates skmiler model files.

Genuinely curious to ask HN this: what are you using local models for?

Experimenting, as well as a cheaper alternative to cloud/paid models. Local models don't have the encyclopaedic knowledge as huge models such as GPT 3.5/4, but they can perform tasks well.

I use it for personal entertainment, both writing and roleplaying. I put quite a bit of effort into my own responses and actively edit the output to get decent results out of the larger 30B and 70B models. Trying out different models and wrangling the LLM to write what you want is part of the fun.

I used them to extract data from relatively unstructured reports into structured csv format. For privacy/gdpr reasons it was not something I could use an online model for. Saved me from a lot of manual work, and it did not hallucinate stuff as far as I could see.

I got the most use out of it on an airplane with no wifi. It let me keep working on a coding solution without the internet because I could ask it quick questions. Magic.

I use it to compare outputs from different models (along with OpenAI, MistralAI) and pick-and-choose-and-compose those outputs. I wrote an app[1] that facilitates this. This also allows me to work offline mode and not having to worry about sharing client's data to OpenAI or Mistral AI

[1]: https://msty.app

I'm hoping someone will write a tool to do project estimations. Like instead of my manager asking me "how long would it take you to implement X,Y,Z ...", he could use the LLM instead.

It doesn't even need to be very accurate because my own estimations aren't either :)

I built myself a hacky alternative to the chat UI from openAI and implemented ollama to test different models locally. Also, openAI chat sucks, the API doesn't seem to suck as much. Chat is just useless for coding at this point.

/e: https://github.com/ChristianSch/theta

I think it's a little misleading to say it's compatible with OpenAI because I expect function or tool calling when you say that.

It's nice that you have the role and content thing but that was always fairly trivial to implement.

When it gets to agents you do need to execute actions. In the agent hosting system I started, I included a scripting engine, which makes me think that maybe I need to set up security and permissions for the agent system and just let it run code. Which is what I started.

So I guess I am not sure I really need the function/tool calling. But if I see a bunch of people actually am standardizing on tool calls then maybe I need it in my framework just because it will be expected. Even if I have arbitrary script execution.

The documentation is upfront about which features are excluded: https://github.com/ollama/ollama/blob/main/docs/openai.md

Function calling/tool choice is done at the application level and currently there's no standard format, and the popular ones are essentually inefficient bespoke system prompts: https://github.com/langchain-ai/langchain/blob/master/libs/l...

Function calling/tool choice is done at the application level and currently there's no standard format,

Is this true for open ai - or just everything else?

I was drawn to Gemini Pro because it had function/tool calling... but it works terribly. (I haven't tried Gemini Ultra yet; unclear if it's available by API?)

Anyway, probably best that they didn't release support that doesn't work.

Gemini Ultra is not available via API yet, at least according to the Google reps we talked with today. There's a waiting list. I suspect they are figuring out how to charge for API access, among other things. The announcement today only seemed to have pricing for the "$20/month" thing.

It makes obvious sense to anyone with experience with OpenAI APIs.

ollama seems like taking a page from langchain book: develop something that's open source but get it so popular that attracts VC money.

I never liked ollama, maybe because ollama builds on llama.cpp (a project I truly respect) but adds so much marketing bs.

For example, the @ollama account on twitter keeps shitposting on every possible thread to advertise ollama. The other day someone posted something about their Mac setup and @ollama said: "You can run ollama on that Mac."

I don't like it when +500 people are working tirelessly on llama.cpp and then guys like langchain, ollama, etc. rip off the benefits.

Make something better, then. (I'm not being dismissive, I really genuinely mean it - please do)

I don't know who is behind Ollama and don't really care about them. I can agree with your disgust for VC 'open source' projects. But there's a reason they become popular and get investment: because they are valuable to people, and people use them.

If Ollama was just a wrapper over llama.cpp, then everyone would just use llama.cpp.

It's not just marketing, either. Compare the README of llama.cpp to the Ollama homepage, notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama. That's why it becomes valuable.

The same thing happened with Docker and we're just now barely getting a viable alternative after Docker as a company imploded, Podman Desktop, and even then it still suffers from major instability on e.g. modern macs.

The sooner open source devs in general learn to make their projects usable by an average developer, the sooner it will be competitive with these VC-funded 'open source' projects.

notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama.

Sorry, I'm new to ollama 'ecosystem'.

From llama.cpp readme, I ctrl-F-ed "Node.js: withcatai/node-llama-cpp" and from there, I got to https://withcatai.github.io/node-llama-cpp/guide/

Can you explain how ollama does it 'easier' ?

llama.cpp already has OpenAI compatible API.

It takes literally one line to install it (git clone and then make).

It takes one line to run the server as mentioned on their examples/server README.

    ./server -m <model> <any additional arguments like mmlock>

I didn't know ollama was VC funded

ggml is also VC backed, so that has nothing to do with it.

Smart. When they do come, will the embedding vectors be OpenAI compatible? I assume this is quite hard to do.

Embeddings as an I/O schema are just text-in, a list of numbers out. There are very few embedding models which require enough preprocessing to warrant an abstraction. (A soft example is the new nomic-embed-text-v1, which requires adding prefix annotations: https://huggingface.co/nomic-ai/nomic-embed-text-v1 )

Yes of course (syntactically it is just float[] getEmbeddings(text)) but are the numbers close to what OpenAI would produce? I assume no.

This submission only about I/O schema: the embeddings themselves are dependent on the model, and since OpenAI's models are closed source no one can reproduce them.

No direct embedding model can be cross-compatable. (exception: constrastive learning models like CLIP)

Probably not, embedding vectors aren't conpatible across different embedding models, and other tools presenting OAI-compatible APIs don't use OAI-compatible embedding models (e.g., oobabooga lets you configure different embeddings models, but none of them produce compatible vectors to the OAI ones.)

The compatibility layer can be also built in libraries. For example, Langchain has llm() which can work with multiple LLM backend. Which do you prefer?

but this means you need each library to support each llm, and I think this is the same issue what is with object storage where basically everyone support S3 compatible API

it's great to have some standard API even if that's isn't perfect, but having second API that allows you to use full potential (like B2 for backblaze) is also fine

so there isn't one model fits all, and if your model have different capabilities, then imo you should provide both options

This is hopefully much better than the s3 situation due to its simplicity. Many offerings that say “s3 compatible api” often mean “we support like 30% of api endpoints”. Granted often the most common stuff is supported and some stuff in the s3 api really only makes sense in AWS, but a good hunk of the s3 api is just hard or annoying to implement and a lot of vendors just don’t bother. Which ends up being rather annoying because you’ll pick some vendor and try to use an s3 client with it only to find out you can’t because of the 10% of the calls your client needs to make that are unsupported.

Before OpenAI released their app I was using langchain in a system that I built. It was a very simple SMS interface to LLMs. I preferred working with langchain's abstractions over directly interfacing with the GPT4 API.

I'd prefer it in library but there are a number of issues with that currently, the larger of it being that the landscape moves too fast and library wrappers aren't keeping up. the other is, what if the world standardize on a terrible library like langchain we'd be stuck with it for a long time since maintenance cost of non uniform backend tend to kill possible runner ups. So for now the uniform api seems the choice of convenience.

What's the current state-of-the-art in deploying large, "self-hosted" models to scalable infrastructure? (e.g. AWS or k8s)

Example use case would be to support a web application with, say, 100k DAU.

Nvidia Triton Inference Server with the TensorRT-LLM backend:

https://github.com/triton-inference-server/tensorrtllm_backe...

It’s used by Mistral, AWS, Cloudflare, and countless others.

vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).

100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.

Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

Do you have a source on Mistral API, etc. being based on TensoRT-LLM? And what are the main distinguishing features?

What I like about vLLM is the following:

- It exposes AsyncLLMEngine, which can be easily wrapped in any API you'd like.

- It has a logit processor API making it simple to integrate custom sampling logic.

- It has decent support for interference of quantized models.

You can Google all of them + nvidia triton, but here you go...

Mistral[0]:

"Acknowledgement We are grateful to NVIDIA for supporting us in integrating TensorRT-LLM and Triton and working alongside us to make a sparse mixture of experts compatible with TRT-LLM."

Cloudflare[1]: "It will also feature NVIDIA’s full stack inference software —including NVIDIA TensorRT-LLM and NVIDIA Triton Inference server — to further accelerate performance of AI applications, including large language models."

Amazon[2]: "Amazon uses the Text-To-Text Transfer Transformer (T5) natural language processing (NLP) model for spelling correction. To accelerate text correction, they leverage NVIDIA AI inference software, including NVIDIA Triton™ Inference Server, and NVIDIA® TensorRT™, an SDK for high performance deep learning inference."

There are many, many more results for AWS (internally and for customers) with plenty of "case studies", and "customer success stories", etc describing deployments. You can also find large enterprises like Siemens, etc using Triton internally and embedded/deployed within products. Triton also runs on the embedded Jetson series of hardware and there are all kinds of large entities doing edge/hybrid inference with this approach.

You can also add at least Phind, Perplexity, and Databricks to the list. These are just the public ones, look at a high scale production deployment of ML/AI in any use case and there is a very good chance there's Triton in there.

I encourage you to do your own research because the advantages/differences are too many to list. Triton can do everything you listed and often better (especially quantization) but off the top of my head:

- Support for the kserve API for model management. Triton can load/reload/unload models dynamically while running, including model versioning and config params to allow clients to specify model version, require specification of version, or default to latest, etc.

- Built in integration and support for S3 and other object stores for model management that in conjunction with the kserve API means you can hit the Triton API and just tell it to grab model X version Y and it will be running in seconds. Think of what this means when you have thousands of Triton instances throughout core, edge, K8s, etc, etc... Like Cloudflare.

- Multiple backend support with support for literally any model: TF, Torch, ONNX, etc with dynamic runtime compilation for TensorRT (with caching and int8 calibration if you want it), OpenVINO, etc acceleration. You can run any LLM (or multiple), Whisper, Stable Diffusion, sentence embeddings, image classification, and literally any model on the same instance (or whatever) because at the fundamental level Triton was designed for multiple backends, multiple models, and multiple versions. It operates on a in/out concept with tensors or arbitrary data. Which can be combined with the Python and model ensemble support to do anything...

- Python backend. Triton can do pre/post-processing in the framework for things like tokenizers and decoders. With ensemble you can arbitrary chain together inputs/outputs from any number of models/encoders/decoders/custom pre-processing/post-processing/etc. You can also, of course, build your own backends to do anything you need to do that can't be done with included backends or when performance is critical.

- Extremely fine grained control for dispatching, memory management, scheduling, etc. For example the dynamic batcher can be configured with all kinds of latency guarantees (configured in nanoseconds) to balance request latency vs optimal max batch size while taking into account node GPU+CPU availability across any number of GPUs and/or CPU threads on a per-model basis. It also supports loading of arbitrary models to CPU, which can come in handy for acceleration of models that can run well on unused CPU resources - things like image classification/object detection. With ONNX and OpenVINO it's surprisingly useful. This can be configured on a per model basis, with a variety of scheduling/thread/etc options.

- OpenTelemetry (not that special) and Prometheus metrics. Prometheus will drill down to an absurd level of detail with not only request details but also the hardware itself (including temperature, power, etc).

- Support for Model Navigator[3] and Performance Analyzer[4]. These tools are on a completely different level... They will take any arbitrary model, export it to a package, and allow you to define any number of arbitrary metrics to target a runtime format and model configuration so you can do things like:

- p95 of time to first token: X

- While achieving X RPS

- While keeping power utilization below X

They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.

- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.

- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine on steroids, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, clients, etc can send/receive raw data/numpy/tensors.

- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, pipe through whatever you want (all on GPU), and get output back to the network with a single CPU copy for in/out.

vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.

[0] - https://mistral.ai/news/la-plateforme/

[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...

[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...

[3] - https://github.com/triton-inference-server/model_navigator

[4] - https://github.com/triton-inference-server/client/blob/main/...

[5] - https://github.com/triton-inference-server/dali_backend

Very helpful answer, thank you!

I am business prof. I wanted my students to try out ollama (with web-ui), so I built some directions for doing so on google cloud [1]. If you use a spot instance you can run it for 18 cents an hour.

[1] https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...

Very useful thanks

The way you've set this up, your students could be too late to claim admin and have their instance hijacked. Very insecure. Would highly recommend you make them use an SSH key from git-bash; it's no more technical than anything you already have.

You can run a lot of things on Google Colab for free as well. KoboldCPP has a nice premade thing on their website that can even load different models.

I had trouble installing Ollama last time I tried, I'm going to try again tomorrow.

I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.

https://github.com/BenWheatley/YetAnotherChatUI

If you don't care about the electron app and just want the API, you can `go generate ./... && go build && ./ollama serve` and you're off to the races. No installation needed.

I made my web interface before I'd even heard of Ollama, and because I wanted a PAYG interface for GPT-4.

You also don't need to actually install my web UI, as it runs from the github page and the endpoint and API key are both configurable by the user during a chat session.

Also (a) the ollama command line interface is good enough for what I actually want, (b) my actual problem was not realising I'd only installed the python and not the underlying model.

Turns out my failure to install last time was due to thinking that the instructions on the python library blog post were complete installation instructions for the whole thing.

pip install ollama

- https://ollama.ai/blog/python-javascript-libraries

is just the python libraries, not ollama itself, which the libraries need, and without which they will just…

httpx.ConnectError: [Errno 61] Connection refused

Install the main app from the big friendly download button, and this problem fixed itself: https://ollama.ai/download

Does ollama support loaders other than llamacpp? I'm using oobabooga with exllama2 to run exl2 quants on a dual NVIDIA gpu, and nothing else seems to beat performance of it.

I tried that, but failed to get the GPU split working. Do you have a link on how to do that?

Do what exactly? I have no issues with GPU split on oobabooga with either exl2 or gguf.

It feels absolutely amazing to build AI startup right now:

- We first struggled with token limits [solved]

- We had issues with consistent JSON ouput [solved]

- We had rate limiting and performance issues for the large 3rd party models [solved]

- We wanted to reduce costs by hosting our own OSS models for small and medium complex tasks [solved]

It's like your product becomes automatically cheaper, more reliable, and more scalable with every new major LLM advancement.

Obivously you still need to build up defensibility and focus on differentiating with everything “non-AI”.

We first struggled with token limits [solved]

How has this been solved in your opinion? Do you mean with recent versions with much bigger limits but also heaps more expensive?

The limits still exist but for certain use cases larger limits have helped

Is Ollama effectively a dockerized HTTP server that calls llama.cpp directly? For the exception of this newly added OpenAI API ;)

More like an easy-mode llama.cpp that does a cgo wrapping of the lib (now; before they built patched llama.cpp runners and did IPC and managed child processes) and it does a few clever things to auto figure out layer splits (if you have meager GPU VRAM). The easy mode is that it will auto-load whatever model you'd like per request. They also implement docker-like layers for their representation of a model allowing you to overlay parameters of configuration and tag it. So far, it has been trivial to mix and match different models (or even the same models just with different parameters) for different tasks within the same application.

Trying to openai am I missing something?

    import OpenAI from 'openai'

    const openai = new OpenAI({
      baseURL: 'http://localhost:11434/v1',
      apiKey: 'ollama', // required but unused
    })

    const chatCompletion = await 
      openai.chat.completions.create({
      model: 'llama2',
      messages: [{ role: 'user', content: 'Why is the sky blue?' }],
    })

    console.log(completion.choices[0].message.content)

I am getting the below error:

    return new NotFoundError(status, error, message, headers);
                   ^
    NotFoundError: 404 404 page not found

Remove the v1

Anyone actually tested it with GPT4 api to see how well it performs?

That's not what this announcement is: it's an I/O schema for OSS local LLMs.

Gemini Ultra release day, and a minor post on ollama OpenAI compatibility gets more points lol

Who cares about another closed LLM that's no better than GPT 4? I think there's more exciting potential in open weights LLMs that you can run on your own machine and do whatever you want with.

This is super neat! Thanks folks!

Awesome!

I wonder why ollama didn't namespace the path (e.g. under "/openai") but in any case this is great for interoperability.

Wow perfect timing. I personally love it. There’s so many projects out there that use OpenAI’s API whether you like it or not. I wanted to try this unit test writer notebook that OpenAI has but with Ollama. It was such a pain in the ass to fix it that I just didn’t bother cause it was just for fun. Now it should be 2 line of code change.

Useful! At work we are building a better version of Copilot, and support bringing your own LLM. Recently I've been adding an 'OpenAI compatible' backend, so that if you can provide any OpenAI compatible API endpoint, and just tell us which model to treat it as, then we can format prompts, stop sequences, respect max tokens, etc. according to the semantics of that model.

I've been needing something exactly like this to test against in local dev environments :) Ollama having this will make my life / testing against the myriad of LLMs we need to support way, way easier.

Seems everyone is centralizing behind OpenAI API compatibility, e.g. there is OpenLLM and a few others which implement the same API as well.

How does Ollama compare to LocalGPT ?

We've been working on a project that provides this sort of easy swapping between open source (via HF, VLLM) & commercial models (OpenAI, Google, Anthropic, Together) in Python: https://github.com/datadreamer-dev/DataDreamer

It's a little bit easier to use if you want to do this without an HTTP API, directly in Python.

How does ollama compare to H2o? We dabbled a bit with H2o and it looks very promising

https://gpt.h2o.ai/

Does ollama support ROCm? It's not clear from their github repo if it does.

Ollama is great. If you want a GUI, LMStudio and Jan are great too.

I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.

https://github.com/sampatt/lookma

There has been a lot of progress with tools like llama.cpp and ollama, but despite slightly more difficult setup I prefer huggingface transformer based stuff(TGI for hosting, openllm proxy for (not at all)OpenAI compatibility). Why? Because you can bet the latest newest models are going to be supported in huggingface transformers library.

Llama.cpp is not far behind, but I find the well structured python code of transformers easy to modify and extend(with context free grammars, function calling etc) than just waiting for your favourite alternate runtime support a new model.

Love it! Ollama has been such a wonderful project (at least, for me).

Ollama is very good and runs better than some of the other tooling I have tried. It also Just Works™. I ran Dolphin Mixtral 7b on a Raspberry pi 4 off a 32 gig SD card. Barely had room. I asked it for a cornbread recipe, stepped away for a few hours and it had generated two characters. I was surprised it got that far if I am being honest.