HN comments for: Ollama now supports AMD graphics cards

eclectic29

63 replies

22h10m

2024-03-15 20:15:50 UTC

I'm not sure why Ollama garners so much attention. It has limited value - used for only experimenting with models + cannot support more than 1 model at a time. It's not meant for production deployments. Granted that it makes the experimentation process super easy but for something that relies on llama.cpp completely and whose main value proposition is easy model management I'm not sure it deserves the brouhaha people are giving it.

Edit: what do you do after the initial experimentation? you need to deploy these models eventually to production. I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers. Not denying that it's a great product.

nerdix

26 replies

21h32m

2024-03-15 20:53:47 UTC

The answer to your question is:

ollama run mixtral

That's it. You're running a local LLM. I have no clue how to run llama.cpp

I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.

jameshart

22 replies

18h54m

2024-03-15 23:31:44 UTC

The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:

   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   wget https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf?download=true
   ./main -m ./mixtral-8x7b-v0.1.Q4_K_M.gguf -n 128

verdverm

15 replies

18h40m

2024-03-15 23:46:16 UTC

This shows the value ollama provides

I only need to know the model name and then run a single command

jameshart

7 replies

18h28m

2024-03-15 23:58:11 UTC

It should be fairly obvious that one can find alternative models and use them in the above command too.

Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.

Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.

More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.

verdverm

4 replies

18h24m

2024-03-16 00:02:07 UTC

I'd rather focus on building on top of of LLMs than going lower level

Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box

jameshart

3 replies

18h22m

2024-03-16 00:03:36 UTC

Sure.

Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.

verdverm

2 replies

18h13m

2024-03-16 00:12:26 UTC

I'm aware, I don't need that amount of sophistication yet.

Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?

jameshart

1 replies

17h49m

2024-03-16 00:36:28 UTC

Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.

verdverm

0 replies

17h34m

2024-03-16 00:51:51 UTC

I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask

Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs

roenxi

1 replies

16h8m

2024-03-16 02:18:11 UTC

There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.

Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.

The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.

jameshart

0 replies

5h14m

2024-03-16 13:12:19 UTC

Your mileage may vary. It runs first time for me on an Apple Silicon Mac.

eclectic29

3 replies

18h34m

2024-03-15 23:52:16 UTC

And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?

verdverm

0 replies

18h29m

2024-03-15 23:56:36 UTC

I focus on building the application rather than figuring out someone else preferred method for how I should work?

I use Docker Compose locally, Kubernetes in the cloud

I run in hot-reload locally, I build for production

I often nuke my database locally, but I run it HA in production

It is very rare to use the same technology locally (or the same way) as in production

ramblerman

0 replies

12h19m

2024-03-16 06:06:45 UTC

Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.

imtringued

0 replies

6h27m

2024-03-16 11:59:18 UTC

There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.

hnfong

1 replies

15h6m

2024-03-16 03:20:06 UTC

The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).

On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.

verdverm

0 replies

6h41m

2024-03-16 11:44:44 UTC

When I get to the point of modification, I will go with Python. This is where the AI ecosystem is largely at

I stopped using C++ when Go came out, no interest in ever having to write it again.

read_if_gay_

0 replies

10h25m

2024-03-16 08:00:40 UTC

Hacker News

cjbprime

1 replies

14h1m

2024-03-16 04:25:02 UTC

This will likely build a version without GPU acceleration, I think?

jameshart

0 replies

3h21m

2024-03-16 15:05:21 UTC

Builds with Metal support on my Mac M2

vidarh

0 replies

18h48m

2024-03-15 23:37:48 UTC

Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.

It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.

kergonath

0 replies

18h11m

2024-03-16 00:14:28 UTC

Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.

imtringued

0 replies

6h37m

2024-03-16 11:48:41 UTC

The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.

ies7

0 replies

18h5m

2024-03-16 00:21:12 UTC

For us this may like a walk in the park.

For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)

This is just like dropbox case years ago.

viraptor

0 replies

19h20m

2024-03-15 23:05:32 UTC

On a mac, https://drawthings.ai is the ollama of Stable Diffusion.

icelain

0 replies

11h37m

2024-03-16 06:49:14 UTC

Check out EasyDiffusion.

ghurtado

0 replies

19h13m

2024-03-15 23:12:44 UTC

For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.

Karrot_Kream

9 replies

20h30m

2024-03-15 21:55:54 UTC

I mean, it takes something difficult like an LLM and makes it easy to run. It's bound to get attention. If you've tried to get other models like BERT based models to run you'll realize just how big the usability gains are running ollama than anything else in the space.

If the question you're asking is why so many folks are focused on experimentation instead of productionizing these models, then I see where you're coming from. There's the question of how much LLMs are actually being used in prod scenarios right now as opposed to just excited people chucking things at them; that maybe LLMs are more just fun playthings than tools for production. But in my experience as HN has gotten bigger, the number of posters talking about productionizing anything has really gone down. I suspect the userbase has become more broadly "interested in software" rather than "ships production facing code" and the enthusiasm in these comments reflects those interests.

FWIW we use some LLMs in production and we do not use ollama at all. Our prod story is very different than what folks are talking about here and I'd love to have a thread that focuses more on language model prod deployments.

jart

8 replies

19h59m

2024-03-15 22:26:43 UTC

Well you would be one of the few hundred people on the planet doing that. With local LLMs we're just trying to create a way for everyone else to use AI that doesn't require sharing all their data with them. First thing everyone asks for of course is how to turn the open source local llms into their own online service.

eclectic29

3 replies

18h39m

2024-03-15 23:46:41 UTC

Few hundred on the planet? Are you kidding me? We're asking enterprises to run LLMs on-premise (I'm intentionally discounting the cloud scenario where the traffic rates are much higher). That's way more than a hundred and sorry to break it to you that Ollama is just not going to cut it.

Karrot_Kream

2 replies

18h34m

2024-03-15 23:51:50 UTC

No need to be angry about this. Tech folks should be discussing this collectively and collaboratively. There's space for everything from local models running on smartphones all the way up to OpenAI style industrialized models. Back when social networks were first coming out, I used to read lots of comments about deploying and running distributed systems. I remember reading early incident reports about hotspotting and consistent hashing and TTL problems and such. We need to foster more of that kind of conversation for LMs. Sadly right now Xitter seems to be the best place for that.

eclectic29

1 replies

18h20m

2024-03-16 00:06:15 UTC

Not angry. Having a discussion :-). It just amazes me how the HN crowd is more than happy with just trying out a model on their machine and calling it a day and not seeing the real picture ahead. Let ignore perf concerns for a moment. Let's say I want to run it on a shared server in the enterprise network so that any application can make use of it. Each application might want to use a model of their choosing. Ollama will unload/load/unload models as each new request arrives. Not sure if folks here are realizing this :-)

theshackleford

0 replies

16h23m

2024-03-16 02:02:35 UTC

Not sure if folks here are realizing this :-)

I’m not sure you’re capable of understanding that your needs and requirements are just that, yours.

Karrot_Kream

3 replies

19h25m

2024-03-15 23:01:09 UTC

Ollama's purpose and usefulness is clear. I don't think anyone is disputing that nor the large usability gains ollama has driven. At least I'm not.

As far as being one of the few hundred on the planet, well yeah that's why I'm on HN. There's tons of publications and subreddits and fora for generic tech conversation. I come here because I want to talk about the unknowns.

kergonath

2 replies

18h4m

2024-03-16 00:21:58 UTC

I come here because I want to talk about the unknowns.

Your knowns an are unknowns to some people and vice versa. This is a great strength of HN; on a whole lot of subject you’ll find people ranging from enthusiastic to expert. There are probably subreddits or discord servers tailored to narrow niches and that’s cool, but HN is not that. They are complementary, if anything. In contrast, HN is much more interesting and with a much better S/N ratio than generic tech subreddits, it’s not even comparable.

Karrot_Kream

1 replies

17h49m

2024-03-16 00:36:34 UTC

I've been using this site since 2008. This is my second account from 2009. HN very much used to be a tech niche site. I realize that for newer users to HN, the appeal is like r/programming or r/technology but with a more text oriented interface or higher SNR or whatever but this is a shift in audience and there are folks like I on this site who still want to use it for niche content.

There are still threads where people do discuss gory details, even if the topics aren't technical. A lot of the mapping articles on the site bring out folks with deep knowledge about mapping stacks. Alternate energy threads do it too. It can be like that for LLMs also, but the user base has to want this site to be more than just Yet Another Tech News Aggregator thread.

For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly. The LLM threads bring me here because experts like jart chime in.

kergonath

0 replies

17h14m

2024-03-16 01:11:33 UTC

I've been using this site since 2008. This is my second account from 2009.

I did not really like HN back in the day because it felt too startup-y, but maybe I got a wrong impression. I much preferred Ars Technica and their forum (now, Ars is much less compelling).

For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly.

I think it depends on the stories. Different subjects have different demographics, and I have almost completely stopped reading physics stuff because it is way too Reddit-like (and full of confident people asserting embarrassingly wrong facts). I can see how you could feel about fields closer to your interests being dumbed down by over-confident non-specialists.

There are still good, highly technical discussions, but it is true that the home page is a bit limited and inefficient to find them.

airocker

7 replies

20h50m

2024-03-15 21:35:41 UTC

More than one is easy: put it behind a load balancer. Put one ollama in one container or one port.

Zambyte

3 replies

19h11m

2024-03-15 23:14:31 UTC

That is still one model per instance of Ollama, right?

airocker

2 replies

18h58m

2024-03-15 23:28:24 UTC

yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.

eclectic29

1 replies

18h36m

2024-03-15 23:50:21 UTC

Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.

airocker

0 replies

18h32m

2024-03-15 23:53:55 UTC

Thanks! This is great to know.

eclectic29

2 replies

18h36m

2024-03-15 23:49:50 UTC

FWIW Ollama has no concurrency support even though llama.cpp's server component (the thing that Ollama actually uses) supports it. Besides, you can't have more than 1 model running. Unloading and loading models is not free. Again, there's a lot more and really much of the real optimization work is not in Ollama; it's in llama.cpp which is completely ignored in this equation.

airocker

1 replies

18h27m

2024-03-15 23:58:49 UTC

Thanks! Great to know. I did not know llama.cpp could do this. It should be pretty straight forward to support, not sure why they would not do it.

sdesol

0 replies

14h51m

2024-03-16 03:35:07 UTC

I'm pretty sure their primary focus right now is to gain as much mindshare as possible and they seem to be doing a great job of it. If you look at the following GitHub metrics:

https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....

https://devboard.gitsense.com/ollama?r=ollama%2Follama&nb=tr...

The number of people engaging with ollama is twice that of llama.cpp. And there hasn't been a dip in people engaging with Ollama in the past 6 months. However, what I do find interesting with regards to these two projects is the number of merged pull requests. If you click on the "Groups" tab and look at "Hooray", you can see llama.cpp had 72 contributors with one or more merged pull requests vs 25 for Ollama.

For Ollama, people are certainly more interested in commenting and raising issues. Compare this to llama.cpp, where the number of people contributing code changes is double that of Ollama.

I know llama.cpp is VC funded and if they don't focus on make using llama.cpp as easy to use as Ollama, they may find themselves doing all the hard stuff with Ollama reaping all the benefits.

Full Disclosure: The tool that I used is mine.

kergonath

3 replies

18h13m

2024-03-16 00:12:30 UTC

It's not meant for production deployments.

I am probably not the demographics you expect. I don’t do “production” in that sense, but I have ollama running quite often when I am working, as I use it for RAG and as a fancy knowledge extraction engine. It is incredibly useful:

- I can test a lot of models by just pulling them (very useful as progress is very fast),

- using their command line is trivial,

- the fact that it keeps running in the background means that it starts once every few days and stays out of the way,

- it integrates nicely with langchain (and a host of other libraries), which means that it is easy to set up some sophisticated process and abstract away the LLM itself.

what do you do after the initial experimentation?

I just keep using it. And for now, I keep tweaking my scripts but I expect them to stabilise at some point, because I use these models to do some real work, and this work is not monkeying about with LLMs.

I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers.

For me, there is nothing that comes close in terms of integration and convenience. The value it delivers is great, because it enables me to do some useful work without wasting time worrying about lower-level architecture details. Again, I am probably not in the demographics you have in mind (I am not a CS person and my programming is usually limited to HPC), but ollama is very useful to me. Its reputation is completely deserved, as far as I am concerned.

rkwz

2 replies

12h18m

2024-03-16 06:08:17 UTC

I use it for RAG and as a fancy knowledge extraction engine

Curious, can you share more details about your usecase?

kergonath

0 replies

5h27m

2024-03-16 12:58:28 UTC

The use case is exploratory literature review in a specific scientific field.

I have a setup that takes pdfs and does some OCR and layout detection with Amazon, and then bunch them with some internal reports. Then, I have a pipeline to write summaries of each document and another one to slice them into chunks, get embeddings and set up a vector store for a RAG chat bot. At the moment it’s using Mixtral and the command line. But I like being able to swap LLMs to experiments with different models and quantisation without hassle, and I more or less plan to set this up on a remote server to free some resources on my workstation so the web UI could come in handy. Running this locally is a must for confidentiality reasons. I’d like to get rid of Textract as well, but unfortunately I haven’t found a solution that’s even close. Tesseract in particular was very disappointing.

idncsk

0 replies

6h43m

2024-03-16 11:43:23 UTC

Try ollama webui(now open-webui). Sry on my phone now => no links

reustle

1 replies

22h5m

2024-03-15 20:21:15 UTC

Even for just running a model locally, Ollama provided a much simpler "one click install" earlier than most tools. That in itself is worth the support.

Aka456

0 replies

19h40m

2024-03-15 22:46:14 UTC

Koboldcpp is also very, very good, plug and play, very complet web UI, nice little api with sse text streaming, vulkan accelerated, have an AMD fork...

evilduck

1 replies

20h26m

2024-03-15 21:59:45 UTC

I am 100% uninterested in your production deployment of rent seeking behavior for tools and models I can run myself. Ollama empowers me to do more of that easier. That’s why it’s popular.

jameshart

0 replies

18h49m

2024-03-15 23:36:40 UTC

OP’s point is more that Ollama isn’t what’s doing the empowering. Llama.cpp is.

crooked-v

1 replies

22h4m

2024-03-15 20:21:34 UTC

Granted that it makes the experimentation process super easy

That's the answer to your question. It may have less space than a Zune, but the average person doesn't care about technically superior alternatives that are much harder to use.

thejohnconway

0 replies

17h56m

2024-03-16 00:29:49 UTC

*Nomad

And lame.

vikramkr

0 replies

21h59m

2024-03-15 20:27:09 UTC

It's nice for personal use which is what I think it was built for, has some nice frontend options too. The tooling around it is nice, and there are projects building in rag etc. I don't think people are intending to deploy days services through these tools

elwebmaster

0 replies

19h31m

2024-03-15 22:54:42 UTC

You are not making any sense. I am running ollama and Open WebUI (which takes care of auth) in production.

ecnahc515

0 replies

18h33m

2024-03-15 23:53:01 UTC

Ollama is the Docker of LLMs. Ollama made it _very_ easy to run LLMs locally. This is surprisingly not as easy as it seems, and incredibly useful.

davidhariri

0 replies

22h4m

2024-03-15 20:22:11 UTC

As it turns out, making it faster and better to manage things tends to get people’s attention. I think it’s well deserved.

cbhl

0 replies

21h25m

2024-03-15 21:00:37 UTC

In my opinion, pre-built binaries and an easy-to-use front-end are things that should exist and are valid as a separate project unto themselves (see, e.g., HandBrake vs ffmpeg).

Using the name of the authors or the project you're building on can also read like an endorsement, which is not _necessarily_ desirable for the original authors (it can lead to ollama bugs being reported against llama.cpp instead of to the ollama devs and other forms of support request toil). Consider the third clause of BSD 3-Clause for an example used in other projects (although llama.cpp is licensed under MIT).

brucethemoose2

0 replies

19h34m

2024-03-15 22:52:05 UTC

Interestingly, Ollama is not popular at all in the "localllama" community (which also extends to related discords and repos).

And I think thats because of capabilities... Ollama is somewhat restrictive compared to other frontends. I have a littany of reasons I personally wouldn't run it over exui or koboldcpp, both for performance and output quality.

This is a necessity of being stable and one-click though.

andrewstuart

0 replies

21h57m

2024-03-15 20:29:18 UTC

Sounds like you are dismissing Ollama as a "toy".

Refer:

https://paulgraham.com/startupideas.html

Abishek_Muthian

0 replies

9h15m

2024-03-16 09:10:47 UTC

For me all the projects which enable running & fine-tuning LLMs locally like llama.cpp, ollama, open-webui, unsolth etc. play a very important part in democratizing AI.

what do you do after the initial experimentation? you need to deploy these models eventually to production

I built GaitAnalyzer[1], to analyze my gait laptop; I had deployed it briefly in production when I had enough credits to foot the AWS GPU bills. Ollama made it very simple to deploy the application, Anyone who has used docker before can now run GaitAnalyzer in their computer.

[1] https://github.com/abishekmuthian/gaitanalyzer

reilly3000

32 replies

23h56m

2024-03-15 18:30:06 UTC

I’m curious as to how they pulled this off. OpenCL isn’t that common in the wild relative to Cuda. Hopefully it can become robust and widespread soon enough. I personally succumbed to the pressure and spent a relative fortune on a 4090 but wish I had some choice in the matter.

Apofis

11 replies

23h54m

2024-03-15 18:32:16 UTC

I'm surprised they didn't speak about the implementation at all. Anyone got more intel?

refulgentis

6 replies

23h51m

2024-03-15 18:34:57 UTC

They're open source and based on llama.cpp so nothings secret.

My money, looking at nothing, would be on one of the two Vulkan backends added in Jan/Feb.

I continue to be flummoxed by a mostly-programmer-forum treating ollama like a magical new commercial entity breaking new ground.

It's a CLI wrapper around llama.cpp so you don't have to figure out how to compile it

washadjeffmad

5 replies

22h31m

2024-03-15 19:55:21 UTC

I tried it recently and couldn't figure out why it existed. It's just a very feature limited app that doesn't require you to know anything or be able to read a model card to "do AI".

And that more or less answered it.

dartos

3 replies

22h18m

2024-03-15 20:08:01 UTC

It’s because most devs nowadays are new devs and probably aren’t very familiar with native compilation.

So compiling the correct version of llama.cpp for their hardware is confusing.

Compound that with everyone’s relative inexperience with configuring any given model and you have prime grounds for a simple tool to exist.

That’s what ollama and their Modelfiles accomplish.

tracerbulletx

0 replies

22h6m

2024-03-15 20:19:49 UTC

It's just because it's convenient. I wrote a rich text editor front end for llama.cpp and I originally wrote a quick go web server with streaming using the go bindings, but now I just use ollama because it's just simpler and the workflow for pulling down models with their registry and packaging new ones in containers is simpler. Also most people who want to play around with local models aren't developers at all.

mypalmike

0 replies

21h55m

2024-03-15 20:31:14 UTC

Eh, I've been building native code for decades and hit quite a few roadblocks trying to get llama.cpp building with cuda support on my Ubuntu box. Library version issues and such. Ended up down a rabbit hole related to codenames for the various Nvidia architectures... It's a project on hold for now.

Weirdly, the Python bindings built without issue with pip.

imtringued

0 replies

6h11m

2024-03-16 12:14:54 UTC

I'm not sure why you are assuming that ollama users are developers when there are at least 30 different applications that have direct API integration with ollama.

refulgentis

0 replies

22h16m

2024-03-15 20:09:43 UTC

Edited it out of my original comment because I didn't want to seem ranty/angry/like I have some personal vendatta, as opposed to just being extremely puzzled, but it legit took me months to realize it wasn't a GUI because of how it's discussed on HN, i.e. as key to democratizing, as a large, unique, entity, etc.

Hadn't thought about it recently. After seeing it again here, and being gobsmacked by the # of genuine, earnest, comments assuming there's extensive independent development of large pieces going on in it, I'm going with:

- "The puzzled feeling you have is simply because llama.cpp is a challenge on the best of days, you need to know a lot to get to fully accelerated on ye average MacBook. and technical users don't want a GUI for an LLM, they want a way to call an API, so that's why there isn't content extalling the virtues of GPT4All*. So TL;DR you're old and have been on computer too much :P"

but I legit don't know and still can't figure it out.

* picked them because they're the most recent example of a genuinely democratizing tool that goes far beyond llama.cpp and also makes large contributions back to llama.cpp, ex. GPT4All landed 1 of the 2 vulkan backends

harwoodr

2 replies

23h48m

2024-03-15 18:37:32 UTC

ROCm: https://github.com/ollama/ollama/commit/6c5ccb11f993ccc88c47...

skipants

1 replies

22h47m

2024-03-15 19:39:13 UTC

Another giveaway that it's ROCm is that it doesn't support the 5700 series...

I'm really salty because I "upgraded" to a 5700XT from a Nvidia GTX 1070 and can't do AI on the GPU anymore, purely because the software is unsupported.

But, as a dev, I suppose I should feel some empathy that there's probably some really difficult problem causing 5700XT to be unsupported by ROCm.

JonChesterfield

0 replies

22h8m

2024-03-15 20:17:56 UTC

I wrote a bunch of openmp code on a 5700XT a couple of years ago, if you're building from source it'll probably run fine

j33zusjuice

0 replies

23h49m

2024-03-15 18:36:40 UTC

Ahhhh, I see what you did there.

programmarchy

10 replies

23h27m

2024-03-15 18:58:45 UTC

Apple killed off OpenCL for their platforms when they created Metal which was disappointing. Sounds like ROCm will keep it alive but the fragmentation sucks. Gotta support CUDA, OpenCL, and Metal now to be cross-platform.

jart

9 replies

22h49m

2024-03-15 19:36:38 UTC

What is OpenCL? AMD GPUs support CUDA. It's called HIP. You just need a bunch of #define statements like this:

    #ifndef __HIP__
    #include <cuda_fp16.h>
    #include <cuda_runtime.h>
    #else
    #include <hip/hip_fp16.h>
    #include <hip/hip_runtime.h>
    #define cudaSuccess hipSuccess
    #define cudaStream_t hipStream_t
    #define cudaGetLastError hipGetLastError
    #endif

Then your CUDA code works on AMD.

jiggawatts

7 replies

21h38m

2024-03-15 20:47:32 UTC

Can you explain why nobody knows this trick, for some values of “nobody”?

wmf

5 replies

20h24m

2024-03-15 22:02:10 UTC

People know; it just hasn't been reliable.

jart

4 replies

19h46m

2024-03-15 22:40:02 UTC

What's not reliable about it? On Linux hipcc is about as easy to use as gcc. On Windows it's a little janky because hipcc is a perl script and there's no perl interpreter I'll admit. I'm otherwise happy with it though. It'd be nice if they had a shell script installer like NVIDIA, so I could use an OS that isn't a 2 year old Ubuntu. I own 2 XTX cards but I'm actually switching back to NVIDIA on my main workstation for that reason alone. GPUs shouldn't be choosing winners in the OS world. The lack of a profiler is also a source of frustration. I think the smart thing to do is to develop on NVIDIA and then distribute to AMD. I hope things change though and I plan to continue doing everything I can do to support AMD since I badly want to see more balance in this space.

wmf

3 replies

19h38m

2024-03-15 22:47:28 UTC

The compilation toolchain may be reliable but then you get kernel panics at runtime.

jart

2 replies

19h37m

2024-03-15 22:49:11 UTC

I've heard geohot is upset about that. I haven't tortured any of my AMD cards enough to run into that issue yet. Do you know how to make it happen?

imtringued

1 replies

6h1m

2024-03-16 12:24:32 UTC

Last time I used AMD GPUs for GPGPU all it took was running hashcat to make the desktop rendering unstable. I'm sure leaving it run overnight would've gotten me a system crash.

jart

0 replies

23m

2024-03-16 18:03:03 UTC

That's always happened with NVIDIA on Linux too, because Linux is an operating system that actually gives you the resources you ask for. Consider using a separate video card that's dedicated to your video needs. Otherwise you should use MacOS or Windows. It's 10x slower at building code. But I can fork bomb it while training a model and Netflix won't skip a frame. Yes I've actually done this.

jart

0 replies

21h18m

2024-03-15 21:07:48 UTC

No idea. My best guess is their background is in graphics and games rather than machine learning. When CUDA is all you've ever known, you try just a little harder to find a way to keep using it elsewhere.

programmarchy

0 replies

14h44m

2024-03-16 03:42:09 UTC

OpenCL is a Khronos open spec for GPU compute, and what you’d use on Apple platforms before Metal compute shaders and CoreML were released. If you wanted to run early ML models on Apple hardware, it was an option. There was an OpenCL backend for torch, for example.

moffkalast

3 replies

23h12m

2024-03-15 19:14:10 UTC

OpenCL is as dead as OpenGL and the inference implementations that exist are very unperformant. The only real options are CUDA, ROCm, Vulkan and CPU. And Vulkan is a proper pain too, takes forver to build compute shaders and has to do so for each model. It only makes sense on Intel Arc since there's nothing else there.

zozbot234

0 replies

22h49m

2024-03-15 19:37:06 UTC

SYCL is a fairly direct successor to the OpenCL model and is not quite dead, Intel seems to be betting on it more than others.

taminka

0 replies

22h39m

2024-03-15 19:47:22 UTC

why though? except for apple, most vendors still actively support it and newer versions of OpenCL are released…

mpreda

0 replies

22h43m

2024-03-15 19:42:52 UTC

ROCm includes OpenCL. And it's a very performant OpenCL implementation.

karmakaze

3 replies

22h39m

2024-03-15 19:47:16 UTC

It would serve Nvidia right if their insistence on only running CUDA workloads on their hardware results in adoption of ROCm/OpenCL.

aseipp

1 replies

22h23m

2024-03-15 20:02:54 UTC

You can use OpenCL just fine on Nvidia, but CUDA is just a superior compute programming model overall (both in features and design.) Pretty much every vendor offers something superior to OpenCL (HIP, OneAPI, etc), because it simply isn't very nice to use.

karmakaze

0 replies

21h4m

2024-03-15 21:21:37 UTC

I suppose that's about right. The implementors are busy building on a path to profit and much less concerned about any sort-of lock-in or open standards--that comes much later in the cycle.

KeplerBoy

0 replies

22h32m

2024-03-15 19:53:57 UTC

OpenCL is fine on Nvidia Hardware. Of course it's a second class citizen next to CUDA, but then again everything is a second class citizen on AMD hardware.

shmerl

0 replies

22h32m

2024-03-15 19:53:41 UTC

May be Vulkan compute? But yeah, interesting how.

KronisLV

18 replies

23h48m

2024-03-15 18:37:58 UTC

Feels like all of this local LLM stuff is definitely pushing people in the direction of getting new hardware, since nothing like RX 570/580 or other older cards sees support.

On one hand, the hardware nowadays is better and more powerful, but on the other, the initial version of CUDA came out in 2007 and ROCm in 2016. You'd think that compute on GPUs wouldn't require the latest cards.

hugozap

12 replies

23h29m

2024-03-15 18:56:51 UTC

I'm a happy user of Mistral on my Mac Air M1.

isoprophlex

6 replies

23h23m

2024-03-15 19:02:53 UTC

How many gbs of RAM do you have in your M1 machine?

hugozap

5 replies

23h18m

2024-03-15 19:07:46 UTC

8gb

isoprophlex

4 replies

23h13m

2024-03-15 19:13:16 UTC

Thanks, wow, amazing that you can already run a small model with so little ram. I need to buy a new laptop, guess more than 16 gb on a macbook isn't really needed

evilduck

0 replies

20h15m

2024-03-15 22:10:43 UTC

I use several LLM models locally for chat UIs and IDE autocompletions like copilot (continue.dev).

Between Teams, Chrome, VS Code, Outlook, and now LLMs my RAM usage sits around 20-22GB. 16GB will be a bottleneck to utility.

dartos

0 replies

22h16m

2024-03-15 20:09:26 UTC

Mistral is _very_ small when quantized.

I’d still go with 16gbs

TylerE

0 replies

21h49m

2024-03-15 20:36:36 UTC

I've run LLMs and some of the various image models on my M1 Studio 32GB without issue. Not as fast as my old 3080 card, but considering the Mac all in has about a 5th the power draw, it's a lot closer than I expected. I'm not sure of the exact details but there is clearly some secret sauce that allows it to leverage the onboard NN hardware.

SparkyMcUnicorn

0 replies

22h23m

2024-03-15 20:02:29 UTC

I would advise getting as much RAM as you possibly can. You can't upgrade later, so get as much as you can afford.

Mine is 64GB, and my memory pressure goes into the red when running a quantized 70B model with a dozen Chrome tabs open.

jonplackett

4 replies

23h22m

2024-03-15 19:03:37 UTC

Is it easy to set this up?

LoganDark

2 replies

23h21m

2024-03-15 19:05:23 UTC

Super easy. You can just head down to https://lmstudio.ai and pick up an app that lets you play around. It's not particularly advanced, but it works pretty well.

It's mostly optimized for M-series silicon, but it also technically works on Windows, and isn't too difficult to trick into working on Linux either.

glial

1 replies

23h19m

2024-03-15 19:07:15 UTC

Also, https://jan.ai is open source and worth trying out too.

LoganDark

0 replies

23h14m

2024-03-15 19:11:44 UTC

Looks super cool, though it seems to be missing a good chunk of features, like the ability to change the prompt format. (Just installed it myself to check out all the options.) All the other missing stuff I can see though is stuff that LM Studio doesn't have either (such as a notebook mode). If it has a good chat mode then that's good enough for most!

hugozap

0 replies

23h18m

2024-03-15 19:07:29 UTC

It is, it doesn't require any setup.

After installation:

ollama run mistral:latest

superkuh

0 replies

21h21m

2024-03-15 21:05:11 UTC

llama.cpp added first class support for the RX 580 by implementing the vulkan backend. There are some issues on older kernel amdgpu code where a llm process VRAM is never reloaded if it gets kicked out to GTT (in 5.x kernels) but overall it's much faster than the clBLAST opencl implementation.

mysteria

0 replies

23h27m

2024-03-15 18:59:01 UTC

The Ollama backend llama.cpp definitely supports those older cards with the OpenCL and Vulkan backends, though performance is worse than ROCm or CUDA. In their Vulkan thread for instance I see people getting it working with Polaris and even Hawaii cards.

https://github.com/ggerganov/llama.cpp/pull/2059

Personally I just run it on CPU and several tokens/s is good enough for my purposes.

jmorgan

0 replies

21h12m

2024-03-15 21:14:23 UTC

The compatibility matrix is quite complex for both AMD and NVIDIA graphics cards, and completely agree: there is a lot of work to do, but the hope is to gracefully fall back to older cards.. they still speed up inference quite a bit when they do work!

bradley13

0 replies

23h18m

2024-03-15 19:07:26 UTC

No new hardware needed. I was shocked that Mixtral runs well on my laptop, which has a so-so mobile GPU. Mixtral isn't hugely fast, but definitely good enough!

XCSme

0 replies

18h57m

2024-03-15 23:28:31 UTC

My 1080ti runs ok even this 47B model: https://ollama.com/library/dolphin-mixtral

sofixa

10 replies

2024-03-15 18:14:50 UTC

This is great news. The more projects do this, the less of a moat CUDA is, and the less of a competitive advantage Nvidia has.

anonymous-panda

8 replies

23h56m

2024-03-15 18:30:04 UTC

What does performance look like?

sevagh

6 replies

23h55m

2024-03-15 18:31:15 UTC

Spoiler alert: not good enough to break CUDA's moat

bornfreddy

3 replies

23h15m

2024-03-15 19:11:09 UTC

Not sure why you're downvoted, but as far as I've heard AMD cards can't beat 4090 - yet.

Still, I think AMD will catch or overtake NVidia in hardware soon, but software is a bigger problem. Hopefully the opensource strategy will pay off for them.

nerdix

1 replies

21h43m

2024-03-15 20:42:58 UTC

A RTX 4090 is about twice the price of and 50%-ish faster than AMD's most expensive consumer card so I'm not sure anyone really expects it to ever surpass a 4090.

A 7900 XTX beating a RTX 4080 at inference is probably a more realistic goal though I'm not sure how they compare right now.

Zambyte

0 replies

19h7m

2024-03-15 23:19:11 UTC

The 4080 is $1k for 16gb of VRAM, and the 7900 is $1k for 24gb of VRAM. Unless you're constantly hammering it with requests, the extra speed you may get with CUDA on a 4080 is basically irrelevant when you can run much better models at a reasonable speed.

arein3

0 replies

22h43m

2024-03-15 19:42:59 UTC

Really hope so, maybe this time will catch and last

Usually when corps open source stuff to get adoption, they stuff the adopters after they gain enough market share and the cycle repeats again

qeternity

0 replies

23h23m

2024-03-15 19:03:10 UTC

This is not CUDA's moat. That is on the R&D/training side.

Inference side is partly about performance, but mostly about cost per token.

And given that there has been a ton of standardization around LLaMA architectures, AMD/ROCm can target this much more easily, and still take a nice chunk of the inference market for non-SOTA models.

imtringued

0 replies

6h17m

2024-03-16 12:08:28 UTC

Hypotheticals don't matter. The average user won't have the most expensive GPU and when it comes to VRAM AMD is half as expensive so they lead in this area.

Zambyte

0 replies

23h27m

2024-03-15 18:59:04 UTC

I get 35tps on Mistral:7b-Instruct-Q6_K with my 6650 XT.

ixaxaar

0 replies

23h51m

2024-03-15 18:34:31 UTC

Hey I did, and sorry for the self promo,

Please check out https://github.com/geniusrise - tool for running llms and other stuff, behaves like docker compose, works with whatever is supported by underlying engines:

Huggingface - MPS, cuda VLLM - cuda, ROCm llama.cpp, whisper.cpp - cuda, mps, rocm

Also coming up integration with spark (TorchDistributor), kafka and airflow.

observationist

9 replies

21h49m

2024-03-15 20:36:44 UTC

There's a thing somewhat conspicuous in its absence - why isn't llama.cpp more directly credited and thanked for providing the base technology powering this tool?

All the other cool "run local" software seems to have the appropriate level of credit. You can find llama.cpp references in the code, being set up in a kind of "as is" fashion such that it might be OK as far as MIT licensing goes, but it seems kind of petty to have no shout out or thank you anywhere in the repository or blog or ollama website.

GPT4All - https://gpt4all.io/index.html

LM Studio - https://lmstudio.ai/

Both of these projects credit and attribute appropriately, Ollama seems to bend over backwards so they don't have to?

jart

8 replies

21h39m

2024-03-15 20:46:46 UTC

ollama has made a lot of nice contributions of their own. It's a good look to give a hat tip to the great work llama.cpp is also doing, but they're strictly speaking not required to do that in their advertising any more than llama.cpp is required to give credit to Google Brain, and I think that's because llama.cpp has pulled off tricks in the execution that Brain never could have accomplished, just as ollama has had great success focusing on things that wouldn't make sense for llama.cpp. Besides everyone who wants to know what's up can read the source code, research papers, etc. then make their own judgements about who's who. It's all in the open.

refulgentis

5 replies

18h4m

2024-03-16 00:21:42 UTC

Like what? What contributions has ollama made to llama.cpp? It's not a big deal or problem. It just hasnt.

And they could have very, very, easily, there's a server, sitting right there.

They chose not to.

That's fine. But it's a choice.

The rest I chalk it up to inexperience and being busy.

kleiba

3 replies

11h51m

2024-03-16 06:35:07 UTC

OP never said "contributions to llama.cpp", the comment just said just "contributions" which I read as "own developments".

refulgentis

2 replies

1h51m

2024-03-16 16:35:03 UTC

Fair. I didn't want to assume the worst, that it was just rhetorical slight of hand where "providing release packaging around an open source? you should credit it, at some point, somewhere." is implied as ridiculous, like asking llama.cpp to credit Google Brain. (Presumably, the implication is, for transformers / the Attention is All You Need paper)

jart

1 replies

1h1m

2024-03-16 17:25:06 UTC

If you're going to accuse me of rhetorical sleight of hand, you could start by at least spelling it correctly. This whole code stealing shtick is the kind of thing I'd expect from teenagers on 4chan not from someone who's been professionally trained like you. Many open source licenses like BSD-4 and X11 are actually written to prohibit people from "giving credit" in advertising in the manner you're expecting.

refulgentis

0 replies

12m

2024-03-16 18:13:33 UTC

I'm 35, got my start in coding by working on Handbrake at 17 when ffmpeg was added.

There, I learned that you're supposed to credit projects you depend on, especially ones you depend on heavily.

I don't know why you keep finding ways to dismiss this simple fact or find something else to attack (really? spelling? on Saturday morning!?!? :D).

Especially with a strong record of open source contributions yourself.

Especially when your project is a classic example of A) building around llama.cpp B) crediting it. Like, I literally was thinking about llamafile when I wrote my original comment, before I realized who I was replying to.

I'm really trying to find a communication bridge here because I'm deeply curious, and I'd appreciate you doing the same if I'm lucky enough to get a reply from your august personage again. (seriously! no sarcasm!) My latest guesses:

- you saw this post far after the early tide of, ex., exaggerating for clarity, "anyone got the leak on the deets on how these wizards did this?!?!?! CUDA going down!"

- You're unaware Ollama _does not mention or credit llama.cpp at all_. Not once. Never. Google search query I used to verify my presumption is `site:ollama.com "llama.cpp"`. You will find that it is only mentioned in READMEs of repos of other peoples models, mentioning how they quantized.

- You're unaware this is an ongoing situation. Probably the 3rd thread I've seen in 3 months with decreasing #s of people treating it like a independent commercial startup making independent breakthroughs, and increasing #s of people being like "...why are you still doing this..."

For those unaware, this is how jart's llamafile project credits llama.cpp, they certainly don't avoid it altogether, and they certainly don't seem to think its unnecessary. (source: https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file...)

- 2nd sentence in Github README: "Our goal is to make open LLMs much more accessible to both developers and end users. We're doing that by combining ___llama.cpp___ with Cosmopolitan Libc into one framework"

- 21 mentions in README altogether.

- Under "How llamafile works", 3 mentions crediting llama.cpp in 5 steps.

- Announcement blog post: 4 mentions, Justine co-authored it. https://hacks.mozilla.org/2023/11/introducing-llamafile/

varjag

0 replies

11h58m

2024-03-16 06:28:19 UTC

Where the name does come from?

galaxyLogic

1 replies

15h36m

2024-03-16 02:49:53 UTC

Newton only gave credits to "Gods". He said he stands on shoulders of Gods or something like that. But he never mentioned which Gods in particular, did he?

aragilar

0 replies

13h43m

2024-03-16 04:42:38 UTC

Giants, not Gods, and it's debated whether it was actually an insult or not: https://en.wikipedia.org/wiki/Isaac_Newton#Personality

renewiltord

8 replies

23h27m

2024-03-15 18:58:29 UTC

Wow, that's a huge feature. Thank you, guys. By the way, does anyone have a preferred case where they can put 4 AMD 7900XTX? There's a lot of motherboards and CPUs that support 128 lanes. It's the physical arrangement that I have trouble with.

nottorp

4 replies

23h25m

2024-03-15 19:00:51 UTC

Used crypto mining parts not available any more?

duskwuff

3 replies

22h39m

2024-03-15 19:46:46 UTC

Crypto mining didn't require significant bandwidth to the card. Mining-oriented motherboards typically only provisioned a single lane of PCIe to each card, and often used anemic host CPUs (like Celeron embedded parts).

renewiltord

1 replies

22h25m

2024-03-15 20:00:31 UTC

Exactly. They'd use PCIe x1 to PCIe x16 risers with power adapters. These require high-bandwidth.

nottorp

0 replies

20h41m

2024-03-15 21:44:57 UTC

Oh. Shows I wasn't into that.

I did once work with a crypto case, but yes, it was one motherboard with a lot of wifis and we still didn't need the pcie lanes.

cjbprime

0 replies

13h57m

2024-03-16 04:29:23 UTC

Does LLM inference require significant bandwidth to the card? You have to get the model into VRAM, but that's a fixed startup cost, not a per-output-token cost.

segmondy

2 replies

18h53m

2024-03-15 23:32:32 UTC

You don't need 128 lanes. 8x PCIe3 is more than enough, so for 4 cards that's 32. Most CPUs have about 40lanes. If you are not doing much that would be more than sufficient. Buy a PCIe riser. Go to amazon and search for it, a 16x to 16x PCIe riser. They go for about $25-$30 often about 20-30cm. If you want really long one, you can get one from China a 60cm for about the same price, you just have to wait for 3 weeks. That's what I did. Stuffing all those in a case is often difficult, so you have to go open rig. Either have the cables running out your computer and figuring out a way to support the cards while keeping them cool or just buy a small $20-$30 open rig frame.

renewiltord

0 replies

18h26m

2024-03-16 00:00:15 UTC

Ah ha, that's the part I was curious about. I was wondering if I could keep everything cool open rig. I'm waiting for the stuff to arrive: risers, board, CPU, GPUs. And I've been putting it off because I wasn't sure how about the case. All right then, open rig frame. Thank you!

Dylan16807

0 replies

12h39m

2024-03-16 05:46:30 UTC

Most CPUs have about 20 lanes (plus a link to the chipset).

On the one hand, they will be gen 4 or 5, so they're the equivalent of 40-80 gen 3 lanes.

On the other hand, you can only split them up if you have a motherboard that supports bifurcation. If you buy the wrong model, you're stuck dedicating the equivalent of 64 gen 3 lanes to a single card.

Edit: Actually, looking into it further, current Intel desktop processors will only run their lanes as 16(+4) or 8+8(+4). You can kind of make 4 cards work by using chipset-fed slots, but that sucks. You could also get a PCIe switch but those are very expensive. AMD will do 4+4+4+4(+4) on the right boards.

deadalus

8 replies

23h56m

2024-03-15 18:30:21 UTC

I wish AMD did well in the Stable Diffusion front because AMD is never greedy on VRAM. The 4060Ti 16GB(minimum required for Stable Diffusion in 2024) starts at $450.

AMD with ROCm is decent on Linux but pretty bad on Windows.

choilive

5 replies

23h48m

2024-03-15 18:37:45 UTC

They bump up VRAM because they can't compete on raw compute.

wongarsu

1 replies

23h4m

2024-03-15 19:21:38 UTC

Or rather Nvidia is purposefully restricting VRAM to avoid gaming cards canibalizing their supremely profitable professional/server cards. AMD has no relevant server cards, so they have no reason to hold back on VRAM in consumer cards

karolist

0 replies

10h35m

2024-03-16 07:50:38 UTC

Nvidia released consumer RTX 3090 with 24GB VRAM in Sep 2020, AMDs flagship release in that same month was 6900 XT with 16GB VRAM. Who is being restrictive here exactly?

risho

1 replies

23h22m

2024-03-15 19:04:17 UTC

it doesn't matter how much compute you have if you don't have enough vram to run the model.

Zambyte

0 replies

23h1m

2024-03-15 19:25:01 UTC

Exactly. My friend was telling me that I was making a mistake for getting a 7900 XTX to run language models, when the fact of the matter is the cheapest NVIDIA card with 24 GB of VRAM is over 50% more expensive than the 7900 XTX. Running a high quality model at like 80 tps is way more important to me than running a way lower quality model at like 120 tps.

api

0 replies

21h34m

2024-03-15 20:52:18 UTC

They lag on software a lot more than the lag on silicon.

wastewastewaste

0 replies

9h52m

2024-03-16 08:33:39 UTC

You don't need 16gb, literally majority of people don't have that and use 8gb and up. especially with forge

Adverblessly

0 replies

23h36m

2024-03-15 18:50:22 UTC

I run A1111, ComfyUI and kohya-ss on an AMD (6900XT which has 16GB, the minimum required for Stable Diffusion in 2024 ;)), though on Linux. Is it a Windows specific Issue for you?

Edit to add: Though apparently I still don't run ollama on AMD since it seems to disagree with my setup.

Zambyte

6 replies

23h46m

2024-03-15 18:40:12 UTC

It's pretty funny to see this blog post, when I have been running Ollama on my AMD RX 6650 for weeks :D

They have shipped ROCm containers since 0.1.27 (21 days ago). This blog post seems to be published along with the latest release, 0.1.29. I wonder what they actually changed in this release with regards to AMD support.

Also: see this issue[0] that I made where I worked through running Ollama on an AMD card that they don't "officially" support yet. It's just a matter of setting an environment variable.

[0] https://github.com/ollama/ollama/issues/2870

Edit: I did notice one change, now the starcoder2[1] model works now. Before that would crash[2].

[1] https://ollama.com/library/starcoder2

[2] https://github.com/ollama/ollama/issues/2953

throwaway5959

1 replies

23h15m

2024-03-15 19:11:21 UTC

I mean, it was 21 days ago. What’s the difference?

yjftsjthsd-h

0 replies

23h12m

2024-03-15 19:13:31 UTC

2 versions, apparently

mchiang

1 replies

21h16m

2024-03-15 21:10:12 UTC

While the PRs went in slightly earlier, much of the time was spent on testing the integrations, and working with AMD directly to resolve issues.

There were issues that we resolved prior to cutting the release, and many reported by the community as well.

Zambyte

0 replies

19h16m

2024-03-15 23:09:29 UTC

Thank you for clarifying and thanks for the great work you do!

layoric

0 replies

8h52m

2024-03-16 09:33:52 UTC

What kind of performance do you see with an RX 6650?

cjbprime

0 replies

14h3m

2024-03-16 04:23:19 UTC

Maybe they wanted to wait for bug reports for 21 days before publishing a popular blog post about it..?

freedomben

5 replies

23h1m

2024-03-15 19:24:36 UTC

I'm thrilled to see support for RX 6800/6800 XT / 6900 XT. I bought one of those for an outrageous amount during the post-covid shortage in hopes that I could use it for ML stuff, and thus far it hasn't been very successful, which is a shame because it's a beast of a card!

Many thanks to ollama project and llama.cpp!

mey

4 replies

22h19m

2024-03-15 20:07:18 UTC

Sad to see that the cut off is just after 6700 XT which is what is in my desktop. They indicate more devices are coming, hopefully that includes some of the more modern all in one chips with RDNA 2/3 from AMD as well.

throawayonthe

1 replies

20h37m

2024-03-15 21:48:39 UTC

I’ve already been using ollama with my 6700xt just fine, you just have to set some env variable to make rocm work “unoficially”

The linked page says they will support more soon, so i’m guessing this will just be integrated

Rising6378

0 replies

8h34m

2024-03-16 09:51:35 UTC

which env variable did you set to what? currently I'm struggeling setting it up

mey

0 replies

22h16m

2024-03-15 20:10:03 UTC

It appears that the cut off lines up with HIP SDK support from AMD, https://rocm.docs.amd.com/projects/install-on-windows/en/lat...

bavell

0 replies

13h20m

2024-03-16 05:05:42 UTC

I've been using my 6750XT for more than a year now on all sorts of AI projects. Takes a little research and a few env vars but no need to wait for "official" support most of the time.

simon83

4 replies

23h53m

2024-03-15 18:32:37 UTC

Does anyone know how the AMD consumer GPU support on Linux has been implemented? Must use something else than ROCm I assume? Because ROCm only supports the 7900 XTX on Linux[1], while on Windows[2] support is from RX 6600 and upwards.

[1]: https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi... [2]: https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi...

bravetraveler

2 replies

23h33m

2024-03-15 18:53:22 UTC

I wouldn't read too much into support. It's more in terms of business/warranty/promises than what can actually do things

I've had a 6900XT since launch and this is the first I'm hearing "unsupported", having played with ROCM plenty over the years with Fedora Linux.

I think, at most, it's taken a couple key environment variables

HarHarVeryFunny

1 replies

23h8m

2024-03-15 19:18:20 UTC

How hard would it be for AMD just to document the levels of support of different cards the way NVIDIA does with their "compute capability" numbers ?!

I'm not sure what is worse from AMD - the ML software support they provide for their cards, or the utterly crap documentation.

How about one page documenting AMD's software stack compared to NVIDIA, one page documenting what ML frameworks support AMD cards, and another documenting "compute capability" type numbers to define the capabilities of different cards.

londons_explore

0 replies

22h55m

2024-03-15 19:30:38 UTC

And almost looks like they're deliberately trying to not win any market share.

It's as if the CEO is mates with NVidias CEO and has an unwritten agreement not to try too hard to topple the applecart...

Oh wait... They're cousins!

Symmetry

0 replies

23h48m

2024-03-15 18:37:58 UTC

The newest release, 6.0.2, supports a number of other cards[1] and in general people are able to get a lot more cards to work than are officially supported. My 7900 XT worked on 6.0.0 for instance.

[1]https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

rahimnathwani

4 replies

22h28m

2024-03-15 19:57:26 UTC

Here is the commit that added ROCm support to llama.cpp back in August:

https://github.com/ggerganov/llama.cpp/commit/6bbc598a632560...

65a

3 replies

13h55m

2024-03-16 04:30:58 UTC

Yep, and it deserves the credit! He who writes the cuda kernel (or translates it) controls the spice.

I had wrapped this and had it working in Ollama months ago as well: https://github.com/ollama/ollama/pull/814. I don't use Ollama anymore, but I really like the way they handle device memory allocation dynamically, I think they were the first to do this well.

rahimnathwani

2 replies

13h3m

2024-03-16 05:23:19 UTC

I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?

65a

1 replies

12h39m

2024-03-16 05:47:12 UTC

Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.

rahimnathwani

0 replies

1h56m

2024-03-16 16:29:56 UTC

Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.

kodarna

2 replies

23h27m

2024-03-15 18:58:57 UTC

I wonder why they aren't supporting RX 6750 XT and lower yet, are there architectural differences between these and RX 6800+?

slavik81

0 replies

23h14m

2024-03-15 19:11:46 UTC

Those are Navi 22/23/24 GPUs while the RX 6800+ GPUs are Navi 21. They have different ISAs... however, the ISAs are identical in all but name.

LLVM has recently introduced a unified ISA for all RDNA 2 GPUs (gfx10.3-generic), so the need for the environment variable workaround mentioned in the other comment should eventually disappear.

Zambyte

0 replies

23h22m

2024-03-15 19:03:39 UTC

They don't support it, but it works if you set an environment variable.

https://github.com/ollama/ollama/issues/2870#issuecomment-19...

airocker

2 replies

22h52m

2024-03-15 19:33:50 UTC

I heard "Nvidia for LLM today is similar to how Sun Microsystems was for the web"

api

1 replies

21h35m

2024-03-15 20:50:51 UTC

... for a very brief period of time until Linux servers and other options caught up.

airocker

0 replies

18h40m

2024-03-15 23:46:14 UTC

OS was possibly much more complicated to write at that time than CUDA is to write today. And competition is too strong. It might be more briefer than even Sun.

tarruda

1 replies

22h27m

2024-03-15 19:58:35 UTC

Does this work with integrated Radeon graphics? If so it might be worth getting one of those Ryzen mini PCs to use as a local LLM server.

stebalien

0 replies

22h17m

2024-03-15 20:09:09 UTC

Not yet: https://github.com/ollama/ollama/issues/2637

haolez

1 replies

19h23m

2024-03-15 23:02:35 UTC

Is there an equivalent to ollama or gpt4all for Android? I'd like to host my model somewhere and talk to it via app.

verdverm

0 replies

18h36m

2024-03-15 23:50:06 UTC

At that point, it sounds like your after an API endpoint for any model. Lots of solutions out there, depends more on your hosting

Symmetry

1 replies

2024-03-15 18:18:43 UTC

Given the price of top line NVidia cards, if they can be had at all, there's got to be a lot of effort going on behind the scenes to improve AMD support in various places.

jeff-davis

0 replies

14h39m

2024-03-16 03:46:55 UTC

What are the barriers to doing so?

SushiHippie

1 replies

23h33m

2024-03-15 18:52:26 UTC

Can anyone (maybe ollama contributors) explain to me the relationship between llama.cpp and ollama?

I always thought that ollama basically just was a wrapper (i.e. not much changes to inference code, and only built on top) around llama.cpp, but this makes it seem like it is more than that?

Zambyte

0 replies

23h29m

2024-03-15 18:56:51 UTC

llama.cpp also supports AMD cards: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi...

OtomotO

1 replies

19h12m

2024-03-15 23:14:21 UTC

Hm, fooocus manages to run, but for Ollama I get:

time=2024-03-16T00:11:07.993+01:00 level=WARN source=amd_linux.go:50 msg="ollama >recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file >missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or >directory" >time=2024-03-16T00:11:07.993+01:00 level=INFO source=amd_linux.go:85 msg="detected amdgpu >versions [gfx1031]" >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:339 msg="amdgpu >detected, but no compatible rocm library found. Either install rocm v6, or follow manual >install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#man..." >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:96 msg="unable to verify >rocm library, will use cpu: no suitable rocm found, falling back to CPU" >time=2024-03-16T00:11:07.996+01:00 level=INFO source=routes.go:1105 msg="no GPU detected"

Need to check how to install rocm on arch again... have done it once, a few moons back, but alas...

jmorgan

0 replies

18h57m

2024-03-15 23:29:20 UTC

Ah, this is probably from missing ROCm libraries. The dynamic libraries are available as one of the release assets (warning: it's about 4GB expanded) https://github.com/ollama/ollama/releases/tag/v0.1.29 – dropping them in the same directory as the `ollama` binary should work.

userbinator

0 replies

16h55m

2024-03-16 01:30:36 UTC

Thanks, Ollama.

(Sorry, could not resist.)

treprinum

0 replies

19h16m

2024-03-15 23:10:01 UTC

Any info on the performance? How does it compare to 4080/4090?

skenderbeu

0 replies

5h59m

2024-03-16 12:26:56 UTC

llama.cpp now supports AMD graphics. Ollama just wraps around it

rcarmo

0 replies

22h59m

2024-03-15 19:27:16 UTC

Curious to see if this will work on APUs. Have a 7840HS to test, will give it a go ASAP.

latchkey

0 replies

23h18m

2024-03-15 19:08:15 UTC

If anyone wants to run some benchmarks on MI300x, ping me.

jjice

0 replies

20h49m

2024-03-15 21:36:39 UTC

Just downloaded this and gave it a go. I have no experience with running any local models, but this just worked out of the box on my 7600 on Ubuntu 22. This is fantastic.

h4x0rr

0 replies

19h4m

2024-03-15 23:22:09 UTC

No support for my rx 5700xt :(

ekianjo

0 replies

12h58m

2024-03-16 05:28:10 UTC

Does it also support AMD APUs?

VadimPR

0 replies

11h36m

2024-03-16 06:49:42 UTC

Ollama runs really, really slow on my MBP for Mistral - as in just a few tokens a second and it takes a long while before it starts giving a result. Anyone else run into this?

I've seen that it seems to be related to the amount of system memory available when ollama is started (??) however LM Studio does not have such issues.