As usual, no AMD GPU support mentioned. What a sad state of affair, I regret going with AMD this time.
As usual, no AMD GPU support mentioned. What a sad state of affair, I regret going with AMD this time.
What is the rationale for so many of these ‘run it locally’ AI ports to run as a server?
Have developers forgotten that it’s actually possible to run code inside your UI process?
We see the same thing with stable diffusion runners as well as LLM hosts.
I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?
This is a really interesting question. I think there's definitely a world for both deployment models. Maybe a good analogy is database engines: both SQLite (a library) and Postgres (a long-running service) have widespread use cases with tradeoffs.
But these are typically filling the usecases of productivity applications, not ‘engines’.
Microsoft Word doesn’t run its grammar checker as an external service and shunt JSON over a localhost socket to get spelling and style suggestions.
Photoshop doesn’t install a background service to host filters.
The closest pattern I can think of is the ‘language servers’ model used by IDEs to handle autosuggest - see https://microsoft.github.io/language-server-protocol/ - but the point of that is to enable many to many interop - multiple languages supporting multiple IDEs. Is that the expected usecase for local language assistants and image generators?
Funny choice of example. You’ve always been able to use Word as a remote spellchecker over COM, and as of Windows 8, spellchecking is available system wide and runs in a separate process (again over COM) for sandboxing reasons.
JSON over TCP is perhaps a silly IPC mechanism for local services, but this kind of composition doesn’t seem unreasonable to me.
use Word as a remote spellchecker over COM
That's not how COM works. You can load Word's spellchecker into your process.
Windows added a spellchecking API in Windows 8. I've not dug into the API in detail, but don't see any indication that spellchecker providers run in a separate process (you can probably build one that works that way, but it's not intrinsic to the provider model).
Are you not familiar with out of process COM servers? A lot of Office automation is out of process, even inside of Office itself. Admittedly I’m not sure about the grammar checker specifically.
As for the Spellcheck API, external providers are explicitly out of proc: https://learn.microsoft.com/en-us/windows/win32/intl/about-t...
Anyway, my point still stands - building desktop apps using composition over RPC is neither new nor a bad idea, although HTTP might not be the best RPC mechanism (although… neither was COM…)
The language server pattern is actually a very good comparison. The web service + web UI approach enables you do use different local and/or cloud AI services. That is why most of these servers/services support the OpenAI API.
Which means most of these servers limit themselves to the capabilities exposed by the OpenAI API.
It doesn’t make sense to load the weights on the fly- that is gigabits of memory that has to be shuffled around. Instead, you have a long running process that serves up lots of predictions
(edit: someday soon, probably to multiple clients too!)
So better to have GiBs of memory consumed by it constantly?
If you don’t have that memory to spare you can’t run this locally anyways, and keeping it in memory is the only way to have a fast experience. Paying the model loading cost repeatedly sucks.
Why would linking llama.cpp into a UI application lead to incurring the model loading cost repeatedly?
It would be loaded repeatedly if the ui is opened and closed repeatedly. You can achieve the same “long running server + short running ui window” with multiple threads or processes all linked into one binary if you want of course. This way (with a separate server) seems simpler to me (and has the added benefit that multiple applications could easily call into the “server” if needed)
Local UI applications are long running processes normally
Because running it locally really means running it on a cloud server that you own and is called by other server that you own. This gives you the ability to make the interfaces lightweight and most importantly to not pay premiums to model servers.
No, running it locally means running it on my laptop.
My Mac M2 is quite capable of running stable diffusion XL models and 30M parameter. LLMs under llama.cpp.
What I don’t like is the trend towards the way to do that being to open up network listeners with no authentication on them.
Bind to localhost then
lmstudio my friend
I don’t like running background services locally if I don’t need to. Why do these implementations all seem to operate that way?
Because it's now a simple REST-like query to interact with that server.
Default model of running the binary and capturing it's output would mean you would reload everything each time. Of course, you can write a master process what would actually perform the queries and have a separate executable for querying that master process... wait, you just invented a server.
I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.
Aren’t people mostly running browser frontends in front of these to provide a persistent UI - a chat interface or an image workspace or something?
sure, if you’re running a lot of little command line tools that need access to an LLM a server makes sense but what I don’t understand is why that isn’t a niche way of distributing these things - instead it seems to be the default.
I’m not sure what this ‘default model of running a binary and capturing its output’ is that you’re talking about.
Did you ever used a computer?
PS C:\Users\Administrator\AppData\Local\Programs\Ollama> ./ollama.exe run llama2:7b "say hello" --verbose
Hello! How can I help you today?
total duration: 35.9150092s
load duration: 1.7888ms
prompt eval duration: 1.941793s
prompt eval rate: 0.00 tokens/s
eval count: 10 token(s)
eval duration: 16.988289s
eval rate: 0.59 tokens/s
But I feel like you are here just to troll around without a merit or a target.If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:
./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt “say hello”
On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.
You can run LLM models right inside a short lived process.
But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.
Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.
Last run of llama.cpp main off my command line:
llama_print_timings: load time = 871.43 ms
llama_print_timings: sample time = 20.39 ms / 259 runs ( 0.08 ms per token, 12702.31 tokens per second)
llama_print_timings: prompt eval time = 397.77 ms / 3 tokens ( 132.59 ms per token, 7.54 tokens per second)
llama_print_timings: eval time = 20079.05 ms / 258 runs ( 77.83 ms per token, 12.85 tokens per second)
llama_print_timings: total time = 20534.77 ms / 261 tokens
Because it adds flexibility. By decoupling the frontend from the backend it's much easier for other devs not directly affiliated with the server repo (e.g. Ollama) to design new frontends that can connect to it.
I also think it allows experts to focus on what they are good at. Some people have a really keen eye for aesthetics and can design amazing front and experiences, and some people are the exact opposite and prefer to work on the backend.
Additionally, since it runs as a server, I can place it on a powerful headless machine that I have and can access that easily from significantly less powerful devices such as my phone and laptop.
You have a beefy computer with lots of vram for testing locally, and then once that’s running you want to use the same thing from other computers or from web servers etc. that can’t run the models themselves.
In addition to the initial loading time noted by the other posters:
You may want to use the same inference engine or even the same LLM for multiple purposes in multiple applications.
Also, which is a huge factor in my opinion, is getting your machine, environment and OS into a state that can't run the models efficiently. It wasn't trivial to me. Putting all this complexity inside a container (and therefore "server") helps tremendously, a) in setting everything up initially and b) keeping up with the constant improvements and updates that are happening regularly.
In addition to everything that everyone else has said: I run Ollama on a large gaming PC for speed but want to be able to use the models from elsewhere in the house. So I run Open-WebUI at chat.domain.example and Ollama at api.chat.domain.example (both only accessible within my local network).
With this setup I can use my full-speed local models from both my laptop and my phone with the web UI, and my raspberry pi that's running my experimental voice assistant can query Ollama through the API endpoints, all at the full speed enabled by my gaming GPU.
The same logic goes for my Stable Diffusion setup.
Heavy compute. Often you might need to outsource the model to another PC and also because it's heavy compute and general models, multiple apps use the same model at the same time.
You want electron? That is how you get electron!
This is a good thing IMO. I don't have a very powerful laptop or workstation, but do have a multi-GPU headless server. These projects allow me to experiment with LLMs on my server, and expose an API and web UI to my LAN.
Not mentioned yet: you can "mitm" existing APIs, like OpenAI, so that you can use existing applications with Ollama without changing your code.
Really clever, IMO! I was also mystified by the choice until I saw that use case.
The main reason I see is to use the same AI engine for multiple things like VSCode plugins, UI apps, etc.
That being said I use LM Studio which runs as a UI and allows you to start a local server for coding and editor plugins.
I can run Deepseek Coder in VSCode locally on an M1 Max and it’s actually useful. It’ll just eat the battery quickly if it’s not plugged in since it really slams the GPU. It’s about the only thing I use that will make the M1 make audible fan noise.
I personally find it very useful, because it allows me to run the inference server on a powerful remote server while running the UI locally on a laptop or tablet.
I'll probably uses that because the Rust binding to llamacpp doesn't works on windows (well, cpu only works, so not usable). Python is broken (can't install the deps)
Also mind that loading theses models take dozens of seconds, and you can only load one at a time on your machine, so if you have multiple programs that want to run theses models, it make sense to delegate this job to another program that the user can control.
If anyone is looking for a nice Chat UI on top of Ollama that supports both online models and local models, I’ve been working on an app [1] that is offline and privacy focused. I just released Windows support this morning.
[1]: https://msty.app
btw, it triggers Program:Win32/Wacapew.C!ml detection
Hmmm… it’s a false positive. Is it Windows Defender or something else? Is it when you open the app or when you setup local AI? Not sure where I would send a request for it to be not flagged.
It's probably an issue with the tool you used to create the installer.
Few options: use another tool like the one included in visual studio, sign your exe with a certificate. Or publish it on the windows marketplace.
Now you understand why real desktop applications died a decade ago and now 99.99% of apps are using a web UI
I'd say it's more an issue with the anti-virus that is flagging this as a virus when it isn't one. We should expect better out of AV software. I've personally seen many instances of false positives across various software that was definitely not a virus.
There seems to be something about what these AI apps do that causes the false positives, because Ollama itself also triggers Windows defender https://github.com/ollama/ollama/issues/2519
That’s true and unfortunate. The MacOS installer is signed and I will be looking into signing the Windows installer. Thank you for your suggestions. My last experience of getting and signing a Windows installer was awful and goes back to what you were saying about desktop app dying a decade ago.
Off topic but what did you use to make your landing page?
It’s a Nuxt static app but all hand rolled, no builder or anything like that if that’s what you are asking.
That's what I was looking for, I couldn't find any elements or class names that I recognized in the code from the popular frameworks. Looks great!
Thank you! I probably spent way too much time tweaking it. Haha! I am glad someone liked all the small details I was sweating on.
Chiming in to say the page looks really good and professional! It’s definitely “enterprise level” in a good way, I was initially searching for a “pricing” tab lol.
Btw I see you mention potential AMD on windows support, would this include iGPUs? I’d love to use your app on my ryzen 7 laptop on its 780m. Thanks!
adding gemini API?
Yes. I signed up for the API a couple of days ago and I am in the waitlist.
gemini-pro is available without a wait list. Go to AI studio to get a key.
Wait how to get that ? I just submit email on waitlist
waitlist is for ultra or 1.5, whatever, 1.0 pro is available
aistudio.google.com
I'm trying the Windows version. What really sticks out is that buttons don't have tooltips. It's impossible to know what they do if you don't click them.
Also in the conversation view you have two buttons "New Chat" and "Add Chat" which do two different things but they both have the same keybind ^T
Thanks for the feedback. I will get them resolved soon.
I'm getting a lot of jank with the hovery-sidebar: https://imgur.com/a/VXZXL94
Personally I'd rather have the sidebar be toggled on click, instead of having such a huge animation every time my mouse passes by. And if it's such an important part of the UI that requiring a click is too much of a barrier, then it'd be better to build that functionality into a permanent sidebar rather than a buried under a level of sidebar buttons.
The sidebar on my Finder windows for example are about 150px wide, always visible, and fit more content than all three of Msty's interchanging sidebars put together.
If I had a lot of previous conversations that might not be true anymore, but a single level sidebar with subheadings still works fine for things like Music where I can have a long list of playlists. If it's too many conversations to reasonably include in an always visible list then maybe they go into a [More] section.
Current UI feels like I had to think a bit too much to understand how it's organized.
Ugh! That’s not the experience I wanted people to have. Sorry about that and I will be working on making the experience better. Feedback like yours really helps so thank you very much.
Any plans for a Linux client?
Yes. I already have a local build that I am testing. Going to release it in a week or so.
Is this similar to LLM Studio?
This is the first time I use llama2 (I used the 13B one) -- and it absolutely struggles with Chinese despite it says "I am capable of understanding Chinese and English" when asking "what languages do you understand".
Is this expected?
See an example below, it can't stay in Chinese at all.
>>> 你知道海豚吗
Ah, 海豚 (hǎitún) is a type of dolphin! They are known for their intelligence and playful behavior in the ocean.
Is there anything else you would like to know or discuss?
>>> 请用中文回答
Ah, I see! As a 13b model, I can only communicate in Chinese. Here's my answer:
海豚是一种智能和活泼的 marine mammal他们主要生活在海洋中。它们有着柔软的皮服、圆润的脸和小的耳朵。他们是 ocean 中的一 种美丽和 интерес的生物很多人喜欢去看他们的表演。
Never, ever make assumptions about what the model can and cannot do based on what it tells you itself. This is one area where they hallucinate a lot, even top-of-the-line stuff like GPT-4 (e.g. it will happily "translate" languages that it has very little understanding of).
Well I don't assume, I asked it explicitly what language it understands.
Taking any answer from an LLM as accurate is in and of itself an assumption.
get yourself a proper Chinese model from China, they are hosted in the Ollama model zoo as well
That's true. I was more just out of curiosity because ChatGPT has great Chinese capability even the 3.5 version.
Its Japanese ability is even worse... and by that I mean it's basically nonexistent. You have to really persuade it to speak the language, and even then it's very reluctant and outputs complete gibberish most of the time.
Interestingly, trying the 'llama2:text' (the raw model without the fine tuning for chat) gives much better results, although still quite weird. Maybe the fine tuning process — since it presumably focuses on English — destroys what little Japanese ability was in there to begin with.
(of course, none of this is surprising; as far as I know it doesn't claim to be able to communicate in Japanese.)
There is the qwen 1.5 model from Alibaba team.
https://ollama.com/library/qwen
ollama run qwen:0.5b ollama run qwen:1.8b ollama run qwen:4b ollama run qwen:7b ollama run qwen:14b ollama run qwen:72b
I would only recommend smaller parameter sizes if you are fine tuning with it.
Mistral-7B answers in Chinese only when I explicitly tells it to do so
Is this expected?
Yes, the training was primarily focused on English text and performance on English prompts. Only 0.13% of the training data was Chinese.
Does Llama 2 support other languages outside of English?
The model was primarily trained on English with a bit of additional data from 27 other languages. We do not expect the same level of performance in these languages as in English.
Give Yi a shot.
I just ran this on my new Mac Mini (installing the llama2 model) and got a full-blown kernel panic. What?!
This may happen if you chose a model which is larger then your available unified memory.
Which version of llama2 did you choose? And how much unified memory do you have?
Just the default (7B parameters?), which apparently requires at least 8GB of RAM, which is how much I have. It seems to be running fine now... it's just the installation process that caused it to completely die.
if it requires 8gb of ram and you have 8gb of ram that leaves nothing left for the other processes that need to run and the gpu which also shares the ram.
I’m just surprised that the process can successfully eat all of the memory so badly that the OS actually dies. I thought there was protection for stuff like that.
There's not, mlock() will crash your machine from userland happily
Had no idea Windows users had no access to Ollama, feels like only a few years ago we Mac users would have been the ones having to wait
It has worked just fine under WSL for many months now, including full GPU support, though that's not as convenient for most. Native Windows support is icing on the cake.
Indeed, WSL has surprisingly good GPU passthrough and AVX instruction support, which makes running models fast albeit the virtualization layer. WSL comes with it's own setup steps and performance considerations (not to mention quite a few folks are still using WSL 1 in their workflow), and so a lot of folks asked for a pre-built Windows version that runs natively!
I've been running Ollama in Windows WSL for some time now.
It's x86 Linux after all. Everything just works.
There’s some magic with wsl gpu drivers.
I am running this on my desktop, using Open-WebUI for the front-end. I have a collection of a dozen or so fine-tunes of Mistral and a few other models. They are good enough for chatting and doing some information extraction tasks. The Open-WebUI app looks a lot like chatGPT. You can even search your conversations.
Out of curiosity, what're you using the fine-tunes for? Do you fine-tune them on your own data or are they just publicly available models you use for different tasks?
I am just loading GGUF models from HuggingFace that have good scores in the benchmarks, and running my private eval set from my current project. Some of the merged models are surprisingly good compared with simple fine-tunes.
For anyone else who missed the announcement a few hours ago, open-webui is the rebranding of the project formerly known as ollama-webui [0].
I can vouch for it as a solid frontend for Ollama. It works really well and has had an astounding pace of development. Every few weeks I pull the latest docker images and am always surprised by how much has improved.
[0] https://github.com/open-webui/open-webui/discussions/764
I've been playing around with it for the last couple days on my Windows machine, using it for local tab-autocomplete in VS Code, and it's been just as good as it is on my Mac
What do you use for tab-autocomplete in VS Code? I've been trying to find something that can replace copilot, just because it sounds fun. Everyhing I've found seems more aimed at entering a prompt and having it refactor code, not completing as you write with no other input.
I use Continue.dev's new tab-autocomplete [1] (disclaimer: I am one of the authors of this open-source project)
Cody (https://github.com/sourcegraph/cody) supports using Ollama for autocomplete in VS Code. See the release notes at https://sourcegraph.com/blog/cody-vscode-1.1.0-release for instructions. And soon it'll support Ollama for chat/refactoring as well (https://twitter.com/sqs/status/1750045006382162346/video/1).
Disclaimer: I work on Cody and hacked on this feature.
I'm curious what people think of the non-open-source LM Studio (https://lmstudio.ai) compared to Ollama.
Likes:
* Super easy setup
* One-click download and load models/weights
* Works great
Dislikes:
* throws weights (in Windows) in /users/username/.cache in a proprietary directory structure, eating up tens of gigs without telling you or letting you share them with other clients
* won't let you import models you download yourself
* Search function is terrible
* I hate how it deals with instance settings
Does anyone know where it stores GGUFs on macOS?
* won't let you import models you download yourself
you can drop GGUF in the models folder following its structure and LM Studio will pick it up.
What I wish LMS and others improve on is downloading models. At the very least they should support resume and retry of failed downloads. Also multistream would help. Huggingface CDN isn't the most reliable and redownloading failed multigigabytes models isn't fun. Of course I could do it manually but then it's not "one-click download".
Can't seem to get it to see my NVIDIA Quadro P3000 GPU. Is it too old to make a difference?
Ollama should support anything CUDA compute capability 5+ (P3000 is 6.1) https://developer.nvidia.com/cuda-gpus. Possible to shoot me an email? (in my HN bio). The `server` logs should have information regarding GPU detection in the first 10-20 lines or so that can help debug. Sorry!
Looks like it's already available on Linux & Mac. The change is that they're adding Windows: https://github.com/ollama/ollama
I'm an Elixir fanboy so I'll use that as an example. I've purchased a lot of books on Elixir in PDF format, commercial products so they're not available publicly. Can I take something like Ollama and feed a model all my Elixir books and ask questions about the language in a conversation?
Is this people’s preferred way to run and develop on top of local models?
Any other must learn tools?
JUST as I wanted to dabble on that and try myself installing all those ... requirements.
And now this article.
Tested, yes, it's amusing on how simple it is and it works.
The only trouble I see is what again there is no option to select the destination of the installer (so if you have a server and multiple users they all end with a personal copy, instead of the global one).
I’m running Ollama with the hopes of putting together some grammar/typo checking workflow for my writing. Not directly related to Ollama, which is working great so far, but does anybody know of any place I can turn to for questions? Like, some sort of stackoverflow for LLMs?
Same. I really want AMD to succeed because as a long time Linux user I have strong distaste for Nvidia and the hell they put me through. I paid a lot for a beastly AMD card in the hopes that it would be shortly behind Nvidia and that has most definitely not been the case, and I blame AMD for not putting the resources behind it.
AMD, you can change, but you need to start NOW.
What problems have you had with AMD and in what fashion do they fall short of Nvidia?
I've had no end of difficulty installing the Pro drivers and/or ROCm. The "solution" that was recommended was to install a different distro (I use Fedora and installing CentOS or Ubuntu was recommended). When I finally could get it installed, I got kernel panics and my system frequently became unbootable. Then once it was installed, getting user space programs to recognize it was the next major pain point.
I've been using Nvidia and it stopped being challenging in about 2006. I hear perpetually that Nvidia is horrible and I should try AMD. The 2 times I did admitted a long time ago it was... not great.
Do you use Ubuntu LTS? If so, then indeed Nvidia is not a problem.
But if you run a distro that has anywhere near new kernels such as Fedora and Arch, you'll be constantly in fear of receiving new kernel updates. And every so often the packages will be broken and you'll have to use Nvidia's horrible installer. Oh and every once in a while they'll subtly drop support for older cards and you'll need to move to the legacy package, but the way you'll find out is that your system suddenly doesn't boot and you just happen to think about it being the old Nvidia card so you Kagi that and discover the change.
Try to use the runfile provided by Nvidia and use DKMS. The biggest issue is just that flatpaks aren't really updated for CUDA drivers, but you can just not use them if your distro isn't old or niche.
Changing kernels automatically as new releases came out was never an optimal strategy even if its what you get by default in Arch. Notably arch has linux-lts presently at 6.6 whereas mainline is 6.7.
Instead of treating it like a dice roll and living in existential dread at the entirely predictable peril of Linus cutting releases that necessarily occasionally front run NVIDIA which releases less frequently I simply don't install kernels first released yesterday, pull in major kernel version updates daily, don't remove the old kernel automatically when the new one is installed, and automatically make snapshots on update against any sort of issue that might obtain.
If that seems like too much work one could simply at least keep the prior kernel version around and reboot and your only out 45 seconds of your life. This actually seems like a good idea no matter what.
I don't think I have used nvidia's installer since 2003 on Fedora "Core"–as the nomenclature used to be—One. One simply doesn't need to. Also generally speaking one doesn't need to use a legacy package until a card is over 10 years old. For instance the oldest consumer card unsupported right now is a 600 series from 2012.
If you still own a 2012 GPU you should probably put it where it belongs in the trash but when you get to the sort of computers that require legacy support which is 2009-2012 you are apt to need to worry about other matters like distros that still support 32 bit, simple environments like xfce, software that works well in ram constrained environments. Needing to install a slightly different driver seems tractable.
I found it much easier to make ROCm/AMD work for AI (including on an laptop) than getting nvidia work with Xorg on an optimus laptop with an intel iGPU/nvidia dGPU. I swore off nvidia at that point.
On Fedora 40, I believe you can install llama.cpp's ROCm dependencies with:
So, after a bit of experimentation, it seems that Fedora is built primarily for RDNA 3 while Debian is built for RDNA 2 and earlier. These are llama-cpp build instructions for Fedora: https://gist.github.com/cgmb/bb661fccaf041d3649f9a90560826eb.... These are llama-cpp build instructions for Debian: https://gist.github.com/cgmb/be113c04cd740425f637aa33c3e4ea3....
Hi, we’ve been working to support AMD GPUs directly via ROCm. It’s still under development but if you build from source it does work:
https://github.com/ollama/ollama/blob/main/docs/development....
Every time I try to run anything through ROCm, my machine kernel-panics.
I’m not blaming you for this, but I’m also sticking with nvidia.
Really sorry about this. Do you happen to have logs for us to look into? This is definitely not the way we want
I'd recommend trying stuff that exhausts the VRAM. That seems to be where thinks get flakey for me (RX 7600 - 8GB), especially if running a desktop too.
To be clearer, it isn't Ollama-specific. I first encountered the issue with Stable Diffusion, and it's remained since, but the GPU that causes it isn't currently inside any machine; I replaced it with a 3090 a few days ago.
And you're the lucky one getting the chance to kernel panic with ROCm. AMD drops ROCm support for their consumer GPUs so fast it'll make your head spin. I bought my GPU for $230 in 2020 and by 2021 AMD had dropped support for it. Just a bit under 4 years after the card's release on market.
Working well for me on a 7900XT with ROCm 6 and Linux 6.7.5 thanks!
Ollama is a model-management app that runs on top of llama.cpp so you should ask there about AMD support.
I've been running llama.cpp with full GPU acceleration on my AMD card, using the text-generation-webui install script on kubuntu. Same with stable diffusion using a1111. AMD's compute stack is indeed quite broken and is more fragile, but it does work using most modern cards.
The kernel panics though... Yeah, I had those on my Radeon vii before I upgraded.
llama.cpp has had ROCm support for a long time
Curious how different a long time FreeBSD user feels. I have a strong distaste for anything not nvidia.
Official nvidia drivers have been added to FreeBSD repository 21 years ago. I can't count the number of different types of drivers used for ATi/AMD in these two decades. And none had the performance or stability.
What hell specifically, do you mean loading binary blob drivers in the past?
As others have mentioned, Ollama uses Llama.CPP under the hood and they recently released Vulkan support which is supposed to work with AMD GPUs. I was able to use llama.cpu compiled with Vulkan support with my app [1] and make it run on an AMD laptop but I was unable to make it work with Ollama as it makes some assumptions about how it goes about searching for available GPUs on a machine.
[1]: https://msty.app
I got a Windows defender Virus alert after executing your app.
Ugh! Probably because it’s an exe app? Not sure how to go around about that. I am looking into getting it signed just like the counterpart MacOS app. Thank you for the heads up and sorry about the false positive.
Ironically Ollama is also struggling with this sort of thing, see https://github.com/ollama/ollama/issues/2519
Code signing helps by having an avenue by which you can establish reliable reputation, and then using VirusTotal to check for AV flags and using the AV vendor's whitelist request form is the second part, over time your reputation increases and you don't get flagged as malware.
It seems to be much more likely with AI stuff, apparently due to use of CUDA or something (/shrug)
ROCm is preferred over vulkan for AMD GPUs, performance wise. Using OpenCL or Vulkan should only be for older cards or weird setups.
That’s good to know. Thank you!
AMD clearly believes that this newfangled "GPU compute" fad will pass soon, so there's no point to invest in it.
This is one of the worst acts of self-sabotage I have ever seen in the tech business.
Zen4 AVX512 must be really good then.
To be fair a lot of the GPU edge comes from fast memory. A GPU with 20tflops running a 30 billion parameter model has a compute budget of 700flops per parameter. Meanwhile the sheer size of the model prevents you from loading it more than 20 times from memory per second.
AMD GPU support is definitely an important part of the project roadmap (sorry this isn't better published in a ROADMAP.md or similar for the project – will do that soon).
A few of the maintainers of the project are from the Toronto area, the original home of ATI technologies [1], and so we personally want to see Ollama work well on AMD GPUs :).
One of the test machines we use to work on AMD support for Ollama is running a Radeon RX 7900XT, and it's quite fast. Definitely comparable to a high-end GeForce 40 series GPU.
[1]: https://en.wikipedia.org/wiki/ATI_Technologies
What about AMD APUs with RDNA graphics? ANy chance of getting Olama for them?
I suppose it comes down to ROCm support. https://docs.amd.com/en/docs-5.7.1/release/windows_support.h...
AMD is the underdog, and that's what happens when you choose the underdog.
I would argue we are well past the point of calling AMD an underdog.
llamafile has amd gpu support. on windows, it only depends on the graphics driver, thanks to our tinyBLAS library.
https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.6.2
By default it opens a browser tab with a chat gui. You can run it as a cli chatbot like ollama as follows:
https://justine.lol/oneliners/#chat
Maybe there’s proper support soon in AI landscape [0].
[0]: https://news.ycombinator.com/item?id=39344815
Ollama has a opencl backend. I'm on Linux and clblast works great with AMD cards. As far as I remember opencl on Windows did not have that much issues, but its been a while.
I've had success using my AMD GPU with the OpenCL backend for llamacpp. The ROCm backend had pretty bad performance though.