return to table of content

Ollama now supports AMD graphics cards

eclectic29
63 replies
22h10m

I'm not sure why Ollama garners so much attention. It has limited value - used for only experimenting with models + cannot support more than 1 model at a time. It's not meant for production deployments. Granted that it makes the experimentation process super easy but for something that relies on llama.cpp completely and whose main value proposition is easy model management I'm not sure it deserves the brouhaha people are giving it.

Edit: what do you do after the initial experimentation? you need to deploy these models eventually to production. I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers. Not denying that it's a great product.

nerdix
26 replies
21h32m

The answer to your question is:

ollama run mixtral

That's it. You're running a local LLM. I have no clue how to run llama.cpp

I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.

jameshart
22 replies
18h54m

The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:

   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   wget https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf?download=true
   ./main -m ./mixtral-8x7b-v0.1.Q4_K_M.gguf -n 128

verdverm
15 replies
18h40m

This shows the value ollama provides

I only need to know the model name and then run a single command

jameshart
7 replies
18h28m

It should be fairly obvious that one can find alternative models and use them in the above command too.

Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.

Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.

More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.

verdverm
4 replies
18h24m

I'd rather focus on building on top of of LLMs than going lower level

Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box

jameshart
3 replies
18h22m

Sure.

Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.

verdverm
2 replies
18h13m

I'm aware, I don't need that amount of sophistication yet.

Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?

jameshart
1 replies
17h49m

Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.

verdverm
0 replies
17h34m

I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask

Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs

roenxi
1 replies
16h8m

There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.

Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.

The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.

jameshart
0 replies
5h14m

Your mileage may vary. It runs first time for me on an Apple Silicon Mac.

eclectic29
3 replies
18h34m

And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?

verdverm
0 replies
18h29m

I focus on building the application rather than figuring out someone else preferred method for how I should work?

I use Docker Compose locally, Kubernetes in the cloud

I run in hot-reload locally, I build for production

I often nuke my database locally, but I run it HA in production

It is very rare to use the same technology locally (or the same way) as in production

ramblerman
0 replies
12h19m

Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.

imtringued
0 replies
6h27m

There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.

hnfong
1 replies
15h6m

The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).

On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.

verdverm
0 replies
6h41m

When I get to the point of modification, I will go with Python. This is where the AI ecosystem is largely at

I stopped using C++ when Go came out, no interest in ever having to write it again.

read_if_gay_
0 replies
10h25m

Hacker News
cjbprime
1 replies
14h1m

This will likely build a version without GPU acceleration, I think?

jameshart
0 replies
3h21m

Builds with Metal support on my Mac M2

vidarh
0 replies
18h48m

Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.

It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.

kergonath
0 replies
18h11m

Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.

imtringued
0 replies
6h37m

The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.

ies7
0 replies
18h5m

For us this may like a walk in the park.

For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)

This is just like dropbox case years ago.

viraptor
0 replies
19h20m

On a mac, https://drawthings.ai is the ollama of Stable Diffusion.

icelain
0 replies
11h37m

Check out EasyDiffusion.

ghurtado
0 replies
19h13m

For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.

Karrot_Kream
9 replies
20h30m

I mean, it takes something difficult like an LLM and makes it easy to run. It's bound to get attention. If you've tried to get other models like BERT based models to run you'll realize just how big the usability gains are running ollama than anything else in the space.

If the question you're asking is why so many folks are focused on experimentation instead of productionizing these models, then I see where you're coming from. There's the question of how much LLMs are actually being used in prod scenarios right now as opposed to just excited people chucking things at them; that maybe LLMs are more just fun playthings than tools for production. But in my experience as HN has gotten bigger, the number of posters talking about productionizing anything has really gone down. I suspect the userbase has become more broadly "interested in software" rather than "ships production facing code" and the enthusiasm in these comments reflects those interests.

FWIW we use some LLMs in production and we do not use ollama at all. Our prod story is very different than what folks are talking about here and I'd love to have a thread that focuses more on language model prod deployments.

jart
8 replies
19h59m

Well you would be one of the few hundred people on the planet doing that. With local LLMs we're just trying to create a way for everyone else to use AI that doesn't require sharing all their data with them. First thing everyone asks for of course is how to turn the open source local llms into their own online service.

eclectic29
3 replies
18h39m

Few hundred on the planet? Are you kidding me? We're asking enterprises to run LLMs on-premise (I'm intentionally discounting the cloud scenario where the traffic rates are much higher). That's way more than a hundred and sorry to break it to you that Ollama is just not going to cut it.

Karrot_Kream
2 replies
18h34m

No need to be angry about this. Tech folks should be discussing this collectively and collaboratively. There's space for everything from local models running on smartphones all the way up to OpenAI style industrialized models. Back when social networks were first coming out, I used to read lots of comments about deploying and running distributed systems. I remember reading early incident reports about hotspotting and consistent hashing and TTL problems and such. We need to foster more of that kind of conversation for LMs. Sadly right now Xitter seems to be the best place for that.

eclectic29
1 replies
18h20m

Not angry. Having a discussion :-). It just amazes me how the HN crowd is more than happy with just trying out a model on their machine and calling it a day and not seeing the real picture ahead. Let ignore perf concerns for a moment. Let's say I want to run it on a shared server in the enterprise network so that any application can make use of it. Each application might want to use a model of their choosing. Ollama will unload/load/unload models as each new request arrives. Not sure if folks here are realizing this :-)

theshackleford
0 replies
16h23m

Not sure if folks here are realizing this :-)

I’m not sure you’re capable of understanding that your needs and requirements are just that, yours.

Karrot_Kream
3 replies
19h25m

Ollama's purpose and usefulness is clear. I don't think anyone is disputing that nor the large usability gains ollama has driven. At least I'm not.

As far as being one of the few hundred on the planet, well yeah that's why I'm on HN. There's tons of publications and subreddits and fora for generic tech conversation. I come here because I want to talk about the unknowns.

kergonath
2 replies
18h4m

I come here because I want to talk about the unknowns.

Your knowns an are unknowns to some people and vice versa. This is a great strength of HN; on a whole lot of subject you’ll find people ranging from enthusiastic to expert. There are probably subreddits or discord servers tailored to narrow niches and that’s cool, but HN is not that. They are complementary, if anything. In contrast, HN is much more interesting and with a much better S/N ratio than generic tech subreddits, it’s not even comparable.

Karrot_Kream
1 replies
17h49m

I've been using this site since 2008. This is my second account from 2009. HN very much used to be a tech niche site. I realize that for newer users to HN, the appeal is like r/programming or r/technology but with a more text oriented interface or higher SNR or whatever but this is a shift in audience and there are folks like I on this site who still want to use it for niche content.

There are still threads where people do discuss gory details, even if the topics aren't technical. A lot of the mapping articles on the site bring out folks with deep knowledge about mapping stacks. Alternate energy threads do it too. It can be like that for LLMs also, but the user base has to want this site to be more than just Yet Another Tech News Aggregator thread.

For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly. The LLM threads bring me here because experts like jart chime in.

kergonath
0 replies
17h14m

I've been using this site since 2008. This is my second account from 2009.

I did not really like HN back in the day because it felt too startup-y, but maybe I got a wrong impression. I much preferred Ars Technica and their forum (now, Ars is much less compelling).

For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly.

I think it depends on the stories. Different subjects have different demographics, and I have almost completely stopped reading physics stuff because it is way too Reddit-like (and full of confident people asserting embarrassingly wrong facts). I can see how you could feel about fields closer to your interests being dumbed down by over-confident non-specialists.

There are still good, highly technical discussions, but it is true that the home page is a bit limited and inefficient to find them.

airocker
7 replies
20h50m

More than one is easy: put it behind a load balancer. Put one ollama in one container or one port.

Zambyte
3 replies
19h11m

That is still one model per instance of Ollama, right?

airocker
2 replies
18h58m

yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.

eclectic29
1 replies
18h36m

Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.

airocker
0 replies
18h32m

Thanks! This is great to know.

eclectic29
2 replies
18h36m

FWIW Ollama has no concurrency support even though llama.cpp's server component (the thing that Ollama actually uses) supports it. Besides, you can't have more than 1 model running. Unloading and loading models is not free. Again, there's a lot more and really much of the real optimization work is not in Ollama; it's in llama.cpp which is completely ignored in this equation.

airocker
1 replies
18h27m

Thanks! Great to know. I did not know llama.cpp could do this. It should be pretty straight forward to support, not sure why they would not do it.

sdesol
0 replies
14h51m

I'm pretty sure their primary focus right now is to gain as much mindshare as possible and they seem to be doing a great job of it. If you look at the following GitHub metrics:

https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....

https://devboard.gitsense.com/ollama?r=ollama%2Follama&nb=tr...

The number of people engaging with ollama is twice that of llama.cpp. And there hasn't been a dip in people engaging with Ollama in the past 6 months. However, what I do find interesting with regards to these two projects is the number of merged pull requests. If you click on the "Groups" tab and look at "Hooray", you can see llama.cpp had 72 contributors with one or more merged pull requests vs 25 for Ollama.

For Ollama, people are certainly more interested in commenting and raising issues. Compare this to llama.cpp, where the number of people contributing code changes is double that of Ollama.

I know llama.cpp is VC funded and if they don't focus on make using llama.cpp as easy to use as Ollama, they may find themselves doing all the hard stuff with Ollama reaping all the benefits.

Full Disclosure: The tool that I used is mine.

kergonath
3 replies
18h13m

It's not meant for production deployments.

I am probably not the demographics you expect. I don’t do “production” in that sense, but I have ollama running quite often when I am working, as I use it for RAG and as a fancy knowledge extraction engine. It is incredibly useful:

- I can test a lot of models by just pulling them (very useful as progress is very fast),

- using their command line is trivial,

- the fact that it keeps running in the background means that it starts once every few days and stays out of the way,

- it integrates nicely with langchain (and a host of other libraries), which means that it is easy to set up some sophisticated process and abstract away the LLM itself.

what do you do after the initial experimentation?

I just keep using it. And for now, I keep tweaking my scripts but I expect them to stabilise at some point, because I use these models to do some real work, and this work is not monkeying about with LLMs.

I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers.

For me, there is nothing that comes close in terms of integration and convenience. The value it delivers is great, because it enables me to do some useful work without wasting time worrying about lower-level architecture details. Again, I am probably not in the demographics you have in mind (I am not a CS person and my programming is usually limited to HPC), but ollama is very useful to me. Its reputation is completely deserved, as far as I am concerned.

rkwz
2 replies
12h18m

I use it for RAG and as a fancy knowledge extraction engine

Curious, can you share more details about your usecase?

kergonath
0 replies
5h27m

The use case is exploratory literature review in a specific scientific field.

I have a setup that takes pdfs and does some OCR and layout detection with Amazon, and then bunch them with some internal reports. Then, I have a pipeline to write summaries of each document and another one to slice them into chunks, get embeddings and set up a vector store for a RAG chat bot. At the moment it’s using Mixtral and the command line. But I like being able to swap LLMs to experiments with different models and quantisation without hassle, and I more or less plan to set this up on a remote server to free some resources on my workstation so the web UI could come in handy. Running this locally is a must for confidentiality reasons. I’d like to get rid of Textract as well, but unfortunately I haven’t found a solution that’s even close. Tesseract in particular was very disappointing.

idncsk
0 replies
6h43m

Try ollama webui(now open-webui). Sry on my phone now => no links

reustle
1 replies
22h5m

Even for just running a model locally, Ollama provided a much simpler "one click install" earlier than most tools. That in itself is worth the support.

Aka456
0 replies
19h40m

Koboldcpp is also very, very good, plug and play, very complet web UI, nice little api with sse text streaming, vulkan accelerated, have an AMD fork...

evilduck
1 replies
20h26m

I am 100% uninterested in your production deployment of rent seeking behavior for tools and models I can run myself. Ollama empowers me to do more of that easier. That’s why it’s popular.

jameshart
0 replies
18h49m

OP’s point is more that Ollama isn’t what’s doing the empowering. Llama.cpp is.

crooked-v
1 replies
22h4m

Granted that it makes the experimentation process super easy

That's the answer to your question. It may have less space than a Zune, but the average person doesn't care about technically superior alternatives that are much harder to use.

thejohnconway
0 replies
17h56m

*Nomad

And lame.

vikramkr
0 replies
21h59m

It's nice for personal use which is what I think it was built for, has some nice frontend options too. The tooling around it is nice, and there are projects building in rag etc. I don't think people are intending to deploy days services through these tools

elwebmaster
0 replies
19h31m

You are not making any sense. I am running ollama and Open WebUI (which takes care of auth) in production.

ecnahc515
0 replies
18h33m

Ollama is the Docker of LLMs. Ollama made it _very_ easy to run LLMs locally. This is surprisingly not as easy as it seems, and incredibly useful.

davidhariri
0 replies
22h4m

As it turns out, making it faster and better to manage things tends to get people’s attention. I think it’s well deserved.

cbhl
0 replies
21h25m

In my opinion, pre-built binaries and an easy-to-use front-end are things that should exist and are valid as a separate project unto themselves (see, e.g., HandBrake vs ffmpeg).

Using the name of the authors or the project you're building on can also read like an endorsement, which is not _necessarily_ desirable for the original authors (it can lead to ollama bugs being reported against llama.cpp instead of to the ollama devs and other forms of support request toil). Consider the third clause of BSD 3-Clause for an example used in other projects (although llama.cpp is licensed under MIT).

brucethemoose2
0 replies
19h34m

Interestingly, Ollama is not popular at all in the "localllama" community (which also extends to related discords and repos).

And I think thats because of capabilities... Ollama is somewhat restrictive compared to other frontends. I have a littany of reasons I personally wouldn't run it over exui or koboldcpp, both for performance and output quality.

This is a necessity of being stable and one-click though.

Abishek_Muthian
0 replies
9h15m

For me all the projects which enable running & fine-tuning LLMs locally like llama.cpp, ollama, open-webui, unsolth etc. play a very important part in democratizing AI.

what do you do after the initial experimentation? you need to deploy these models eventually to production

I built GaitAnalyzer[1], to analyze my gait laptop; I had deployed it briefly in production when I had enough credits to foot the AWS GPU bills. Ollama made it very simple to deploy the application, Anyone who has used docker before can now run GaitAnalyzer in their computer.

[1] https://github.com/abishekmuthian/gaitanalyzer

reilly3000
32 replies
23h56m

I’m curious as to how they pulled this off. OpenCL isn’t that common in the wild relative to Cuda. Hopefully it can become robust and widespread soon enough. I personally succumbed to the pressure and spent a relative fortune on a 4090 but wish I had some choice in the matter.

Apofis
11 replies
23h54m

I'm surprised they didn't speak about the implementation at all. Anyone got more intel?

refulgentis
6 replies
23h51m

They're open source and based on llama.cpp so nothings secret.

My money, looking at nothing, would be on one of the two Vulkan backends added in Jan/Feb.

I continue to be flummoxed by a mostly-programmer-forum treating ollama like a magical new commercial entity breaking new ground.

It's a CLI wrapper around llama.cpp so you don't have to figure out how to compile it

washadjeffmad
5 replies
22h31m

I tried it recently and couldn't figure out why it existed. It's just a very feature limited app that doesn't require you to know anything or be able to read a model card to "do AI".

And that more or less answered it.

dartos
3 replies
22h18m

It’s because most devs nowadays are new devs and probably aren’t very familiar with native compilation.

So compiling the correct version of llama.cpp for their hardware is confusing.

Compound that with everyone’s relative inexperience with configuring any given model and you have prime grounds for a simple tool to exist.

That’s what ollama and their Modelfiles accomplish.

tracerbulletx
0 replies
22h6m

It's just because it's convenient. I wrote a rich text editor front end for llama.cpp and I originally wrote a quick go web server with streaming using the go bindings, but now I just use ollama because it's just simpler and the workflow for pulling down models with their registry and packaging new ones in containers is simpler. Also most people who want to play around with local models aren't developers at all.

mypalmike
0 replies
21h55m

Eh, I've been building native code for decades and hit quite a few roadblocks trying to get llama.cpp building with cuda support on my Ubuntu box. Library version issues and such. Ended up down a rabbit hole related to codenames for the various Nvidia architectures... It's a project on hold for now.

Weirdly, the Python bindings built without issue with pip.

imtringued
0 replies
6h11m

I'm not sure why you are assuming that ollama users are developers when there are at least 30 different applications that have direct API integration with ollama.

refulgentis
0 replies
22h16m

Edited it out of my original comment because I didn't want to seem ranty/angry/like I have some personal vendatta, as opposed to just being extremely puzzled, but it legit took me months to realize it wasn't a GUI because of how it's discussed on HN, i.e. as key to democratizing, as a large, unique, entity, etc.

Hadn't thought about it recently. After seeing it again here, and being gobsmacked by the # of genuine, earnest, comments assuming there's extensive independent development of large pieces going on in it, I'm going with:

- "The puzzled feeling you have is simply because llama.cpp is a challenge on the best of days, you need to know a lot to get to fully accelerated on ye average MacBook. and technical users don't want a GUI for an LLM, they want a way to call an API, so that's why there isn't content extalling the virtues of GPT4All*. So TL;DR you're old and have been on computer too much :P"

but I legit don't know and still can't figure it out.

* picked them because they're the most recent example of a genuinely democratizing tool that goes far beyond llama.cpp and also makes large contributions back to llama.cpp, ex. GPT4All landed 1 of the 2 vulkan backends

skipants
1 replies
22h47m

Another giveaway that it's ROCm is that it doesn't support the 5700 series...

I'm really salty because I "upgraded" to a 5700XT from a Nvidia GTX 1070 and can't do AI on the GPU anymore, purely because the software is unsupported.

But, as a dev, I suppose I should feel some empathy that there's probably some really difficult problem causing 5700XT to be unsupported by ROCm.

JonChesterfield
0 replies
22h8m

I wrote a bunch of openmp code on a 5700XT a couple of years ago, if you're building from source it'll probably run fine

j33zusjuice
0 replies
23h49m

Ahhhh, I see what you did there.

programmarchy
10 replies
23h27m

Apple killed off OpenCL for their platforms when they created Metal which was disappointing. Sounds like ROCm will keep it alive but the fragmentation sucks. Gotta support CUDA, OpenCL, and Metal now to be cross-platform.

jart
9 replies
22h49m

What is OpenCL? AMD GPUs support CUDA. It's called HIP. You just need a bunch of #define statements like this:

    #ifndef __HIP__
    #include <cuda_fp16.h>
    #include <cuda_runtime.h>
    #else
    #include <hip/hip_fp16.h>
    #include <hip/hip_runtime.h>
    #define cudaSuccess hipSuccess
    #define cudaStream_t hipStream_t
    #define cudaGetLastError hipGetLastError
    #endif
Then your CUDA code works on AMD.

jiggawatts
7 replies
21h38m

Can you explain why nobody knows this trick, for some values of “nobody”?

wmf
5 replies
20h24m

People know; it just hasn't been reliable.

jart
4 replies
19h46m

What's not reliable about it? On Linux hipcc is about as easy to use as gcc. On Windows it's a little janky because hipcc is a perl script and there's no perl interpreter I'll admit. I'm otherwise happy with it though. It'd be nice if they had a shell script installer like NVIDIA, so I could use an OS that isn't a 2 year old Ubuntu. I own 2 XTX cards but I'm actually switching back to NVIDIA on my main workstation for that reason alone. GPUs shouldn't be choosing winners in the OS world. The lack of a profiler is also a source of frustration. I think the smart thing to do is to develop on NVIDIA and then distribute to AMD. I hope things change though and I plan to continue doing everything I can do to support AMD since I badly want to see more balance in this space.

wmf
3 replies
19h38m

The compilation toolchain may be reliable but then you get kernel panics at runtime.

jart
2 replies
19h37m

I've heard geohot is upset about that. I haven't tortured any of my AMD cards enough to run into that issue yet. Do you know how to make it happen?

imtringued
1 replies
6h1m

Last time I used AMD GPUs for GPGPU all it took was running hashcat to make the desktop rendering unstable. I'm sure leaving it run overnight would've gotten me a system crash.

jart
0 replies
23m

That's always happened with NVIDIA on Linux too, because Linux is an operating system that actually gives you the resources you ask for. Consider using a separate video card that's dedicated to your video needs. Otherwise you should use MacOS or Windows. It's 10x slower at building code. But I can fork bomb it while training a model and Netflix won't skip a frame. Yes I've actually done this.

jart
0 replies
21h18m

No idea. My best guess is their background is in graphics and games rather than machine learning. When CUDA is all you've ever known, you try just a little harder to find a way to keep using it elsewhere.

programmarchy
0 replies
14h44m

OpenCL is a Khronos open spec for GPU compute, and what you’d use on Apple platforms before Metal compute shaders and CoreML were released. If you wanted to run early ML models on Apple hardware, it was an option. There was an OpenCL backend for torch, for example.

moffkalast
3 replies
23h12m

OpenCL is as dead as OpenGL and the inference implementations that exist are very unperformant. The only real options are CUDA, ROCm, Vulkan and CPU. And Vulkan is a proper pain too, takes forver to build compute shaders and has to do so for each model. It only makes sense on Intel Arc since there's nothing else there.

zozbot234
0 replies
22h49m

SYCL is a fairly direct successor to the OpenCL model and is not quite dead, Intel seems to be betting on it more than others.

taminka
0 replies
22h39m

why though? except for apple, most vendors still actively support it and newer versions of OpenCL are released…

mpreda
0 replies
22h43m

ROCm includes OpenCL. And it's a very performant OpenCL implementation.

karmakaze
3 replies
22h39m

It would serve Nvidia right if their insistence on only running CUDA workloads on their hardware results in adoption of ROCm/OpenCL.

aseipp
1 replies
22h23m

You can use OpenCL just fine on Nvidia, but CUDA is just a superior compute programming model overall (both in features and design.) Pretty much every vendor offers something superior to OpenCL (HIP, OneAPI, etc), because it simply isn't very nice to use.

karmakaze
0 replies
21h4m

I suppose that's about right. The implementors are busy building on a path to profit and much less concerned about any sort-of lock-in or open standards--that comes much later in the cycle.

KeplerBoy
0 replies
22h32m

OpenCL is fine on Nvidia Hardware. Of course it's a second class citizen next to CUDA, but then again everything is a second class citizen on AMD hardware.

shmerl
0 replies
22h32m

May be Vulkan compute? But yeah, interesting how.

KronisLV
18 replies
23h48m

Feels like all of this local LLM stuff is definitely pushing people in the direction of getting new hardware, since nothing like RX 570/580 or other older cards sees support.

On one hand, the hardware nowadays is better and more powerful, but on the other, the initial version of CUDA came out in 2007 and ROCm in 2016. You'd think that compute on GPUs wouldn't require the latest cards.

hugozap
12 replies
23h29m

I'm a happy user of Mistral on my Mac Air M1.

isoprophlex
6 replies
23h23m

How many gbs of RAM do you have in your M1 machine?

hugozap
5 replies
23h18m

8gb

isoprophlex
4 replies
23h13m

Thanks, wow, amazing that you can already run a small model with so little ram. I need to buy a new laptop, guess more than 16 gb on a macbook isn't really needed

evilduck
0 replies
20h15m

I use several LLM models locally for chat UIs and IDE autocompletions like copilot (continue.dev).

Between Teams, Chrome, VS Code, Outlook, and now LLMs my RAM usage sits around 20-22GB. 16GB will be a bottleneck to utility.

dartos
0 replies
22h16m

Mistral is _very_ small when quantized.

I’d still go with 16gbs

TylerE
0 replies
21h49m

I've run LLMs and some of the various image models on my M1 Studio 32GB without issue. Not as fast as my old 3080 card, but considering the Mac all in has about a 5th the power draw, it's a lot closer than I expected. I'm not sure of the exact details but there is clearly some secret sauce that allows it to leverage the onboard NN hardware.

SparkyMcUnicorn
0 replies
22h23m

I would advise getting as much RAM as you possibly can. You can't upgrade later, so get as much as you can afford.

Mine is 64GB, and my memory pressure goes into the red when running a quantized 70B model with a dozen Chrome tabs open.

jonplackett
4 replies
23h22m

Is it easy to set this up?

LoganDark
2 replies
23h21m

Super easy. You can just head down to https://lmstudio.ai and pick up an app that lets you play around. It's not particularly advanced, but it works pretty well.

It's mostly optimized for M-series silicon, but it also technically works on Windows, and isn't too difficult to trick into working on Linux either.

glial
1 replies
23h19m

Also, https://jan.ai is open source and worth trying out too.

LoganDark
0 replies
23h14m

Looks super cool, though it seems to be missing a good chunk of features, like the ability to change the prompt format. (Just installed it myself to check out all the options.) All the other missing stuff I can see though is stuff that LM Studio doesn't have either (such as a notebook mode). If it has a good chat mode then that's good enough for most!

hugozap
0 replies
23h18m

It is, it doesn't require any setup.

After installation:

ollama run mistral:latest
superkuh
0 replies
21h21m

llama.cpp added first class support for the RX 580 by implementing the vulkan backend. There are some issues on older kernel amdgpu code where a llm process VRAM is never reloaded if it gets kicked out to GTT (in 5.x kernels) but overall it's much faster than the clBLAST opencl implementation.

mysteria
0 replies
23h27m

The Ollama backend llama.cpp definitely supports those older cards with the OpenCL and Vulkan backends, though performance is worse than ROCm or CUDA. In their Vulkan thread for instance I see people getting it working with Polaris and even Hawaii cards.

https://github.com/ggerganov/llama.cpp/pull/2059

Personally I just run it on CPU and several tokens/s is good enough for my purposes.

jmorgan
0 replies
21h12m

The compatibility matrix is quite complex for both AMD and NVIDIA graphics cards, and completely agree: there is a lot of work to do, but the hope is to gracefully fall back to older cards.. they still speed up inference quite a bit when they do work!

bradley13
0 replies
23h18m

No new hardware needed. I was shocked that Mixtral runs well on my laptop, which has a so-so mobile GPU. Mixtral isn't hugely fast, but definitely good enough!

sofixa
10 replies
1d

This is great news. The more projects do this, the less of a moat CUDA is, and the less of a competitive advantage Nvidia has.

anonymous-panda
8 replies
23h56m

What does performance look like?

sevagh
6 replies
23h55m

Spoiler alert: not good enough to break CUDA's moat

bornfreddy
3 replies
23h15m

Not sure why you're downvoted, but as far as I've heard AMD cards can't beat 4090 - yet.

Still, I think AMD will catch or overtake NVidia in hardware soon, but software is a bigger problem. Hopefully the opensource strategy will pay off for them.

nerdix
1 replies
21h43m

A RTX 4090 is about twice the price of and 50%-ish faster than AMD's most expensive consumer card so I'm not sure anyone really expects it to ever surpass a 4090.

A 7900 XTX beating a RTX 4080 at inference is probably a more realistic goal though I'm not sure how they compare right now.

Zambyte
0 replies
19h7m

The 4080 is $1k for 16gb of VRAM, and the 7900 is $1k for 24gb of VRAM. Unless you're constantly hammering it with requests, the extra speed you may get with CUDA on a 4080 is basically irrelevant when you can run much better models at a reasonable speed.

arein3
0 replies
22h43m

Really hope so, maybe this time will catch and last

Usually when corps open source stuff to get adoption, they stuff the adopters after they gain enough market share and the cycle repeats again

qeternity
0 replies
23h23m

This is not CUDA's moat. That is on the R&D/training side.

Inference side is partly about performance, but mostly about cost per token.

And given that there has been a ton of standardization around LLaMA architectures, AMD/ROCm can target this much more easily, and still take a nice chunk of the inference market for non-SOTA models.

imtringued
0 replies
6h17m

Hypotheticals don't matter. The average user won't have the most expensive GPU and when it comes to VRAM AMD is half as expensive so they lead in this area.

Zambyte
0 replies
23h27m

I get 35tps on Mistral:7b-Instruct-Q6_K with my 6650 XT.

ixaxaar
0 replies
23h51m

Hey I did, and sorry for the self promo,

Please check out https://github.com/geniusrise - tool for running llms and other stuff, behaves like docker compose, works with whatever is supported by underlying engines:

Huggingface - MPS, cuda VLLM - cuda, ROCm llama.cpp, whisper.cpp - cuda, mps, rocm

Also coming up integration with spark (TorchDistributor), kafka and airflow.

observationist
9 replies
21h49m

There's a thing somewhat conspicuous in its absence - why isn't llama.cpp more directly credited and thanked for providing the base technology powering this tool?

All the other cool "run local" software seems to have the appropriate level of credit. You can find llama.cpp references in the code, being set up in a kind of "as is" fashion such that it might be OK as far as MIT licensing goes, but it seems kind of petty to have no shout out or thank you anywhere in the repository or blog or ollama website.

GPT4All - https://gpt4all.io/index.html

LM Studio - https://lmstudio.ai/

Both of these projects credit and attribute appropriately, Ollama seems to bend over backwards so they don't have to?

jart
8 replies
21h39m

ollama has made a lot of nice contributions of their own. It's a good look to give a hat tip to the great work llama.cpp is also doing, but they're strictly speaking not required to do that in their advertising any more than llama.cpp is required to give credit to Google Brain, and I think that's because llama.cpp has pulled off tricks in the execution that Brain never could have accomplished, just as ollama has had great success focusing on things that wouldn't make sense for llama.cpp. Besides everyone who wants to know what's up can read the source code, research papers, etc. then make their own judgements about who's who. It's all in the open.

refulgentis
5 replies
18h4m

Like what? What contributions has ollama made to llama.cpp? It's not a big deal or problem. It just hasnt.

And they could have very, very, easily, there's a server, sitting right there.

They chose not to.

That's fine. But it's a choice.

The rest I chalk it up to inexperience and being busy.

kleiba
3 replies
11h51m

OP never said "contributions to llama.cpp", the comment just said just "contributions" which I read as "own developments".

refulgentis
2 replies
1h51m

Fair. I didn't want to assume the worst, that it was just rhetorical slight of hand where "providing release packaging around an open source? you should credit it, at some point, somewhere." is implied as ridiculous, like asking llama.cpp to credit Google Brain. (Presumably, the implication is, for transformers / the Attention is All You Need paper)

jart
1 replies
1h1m

If you're going to accuse me of rhetorical sleight of hand, you could start by at least spelling it correctly. This whole code stealing shtick is the kind of thing I'd expect from teenagers on 4chan not from someone who's been professionally trained like you. Many open source licenses like BSD-4 and X11 are actually written to prohibit people from "giving credit" in advertising in the manner you're expecting.

refulgentis
0 replies
12m

I'm 35, got my start in coding by working on Handbrake at 17 when ffmpeg was added.

There, I learned that you're supposed to credit projects you depend on, especially ones you depend on heavily.

I don't know why you keep finding ways to dismiss this simple fact or find something else to attack (really? spelling? on Saturday morning!?!? :D).

Especially with a strong record of open source contributions yourself.

Especially when your project is a classic example of A) building around llama.cpp B) crediting it. Like, I literally was thinking about llamafile when I wrote my original comment, before I realized who I was replying to.

I'm really trying to find a communication bridge here because I'm deeply curious, and I'd appreciate you doing the same if I'm lucky enough to get a reply from your august personage again. (seriously! no sarcasm!) My latest guesses:

- you saw this post far after the early tide of, ex., exaggerating for clarity, "anyone got the leak on the deets on how these wizards did this?!?!?! CUDA going down!"

- You're unaware Ollama _does not mention or credit llama.cpp at all_. Not once. Never. Google search query I used to verify my presumption is `site:ollama.com "llama.cpp"`. You will find that it is only mentioned in READMEs of repos of other peoples models, mentioning how they quantized.

- You're unaware this is an ongoing situation. Probably the 3rd thread I've seen in 3 months with decreasing #s of people treating it like a independent commercial startup making independent breakthroughs, and increasing #s of people being like "...why are you still doing this..."

For those unaware, this is how jart's llamafile project credits llama.cpp, they certainly don't avoid it altogether, and they certainly don't seem to think its unnecessary. (source: https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file...)

- 2nd sentence in Github README: "Our goal is to make open LLMs much more accessible to both developers and end users. We're doing that by combining ___llama.cpp___ with Cosmopolitan Libc into one framework"

- 21 mentions in README altogether.

- Under "How llamafile works", 3 mentions crediting llama.cpp in 5 steps.

- Announcement blog post: 4 mentions, Justine co-authored it. https://hacks.mozilla.org/2023/11/introducing-llamafile/

varjag
0 replies
11h58m

Where the name does come from?

galaxyLogic
1 replies
15h36m

Newton only gave credits to "Gods". He said he stands on shoulders of Gods or something like that. But he never mentioned which Gods in particular, did he?

renewiltord
8 replies
23h27m

Wow, that's a huge feature. Thank you, guys. By the way, does anyone have a preferred case where they can put 4 AMD 7900XTX? There's a lot of motherboards and CPUs that support 128 lanes. It's the physical arrangement that I have trouble with.

nottorp
4 replies
23h25m

Used crypto mining parts not available any more?

duskwuff
3 replies
22h39m

Crypto mining didn't require significant bandwidth to the card. Mining-oriented motherboards typically only provisioned a single lane of PCIe to each card, and often used anemic host CPUs (like Celeron embedded parts).

renewiltord
1 replies
22h25m

Exactly. They'd use PCIe x1 to PCIe x16 risers with power adapters. These require high-bandwidth.

nottorp
0 replies
20h41m

Oh. Shows I wasn't into that.

I did once work with a crypto case, but yes, it was one motherboard with a lot of wifis and we still didn't need the pcie lanes.

cjbprime
0 replies
13h57m

Does LLM inference require significant bandwidth to the card? You have to get the model into VRAM, but that's a fixed startup cost, not a per-output-token cost.

segmondy
2 replies
18h53m

You don't need 128 lanes. 8x PCIe3 is more than enough, so for 4 cards that's 32. Most CPUs have about 40lanes. If you are not doing much that would be more than sufficient. Buy a PCIe riser. Go to amazon and search for it, a 16x to 16x PCIe riser. They go for about $25-$30 often about 20-30cm. If you want really long one, you can get one from China a 60cm for about the same price, you just have to wait for 3 weeks. That's what I did. Stuffing all those in a case is often difficult, so you have to go open rig. Either have the cables running out your computer and figuring out a way to support the cards while keeping them cool or just buy a small $20-$30 open rig frame.

renewiltord
0 replies
18h26m

Ah ha, that's the part I was curious about. I was wondering if I could keep everything cool open rig. I'm waiting for the stuff to arrive: risers, board, CPU, GPUs. And I've been putting it off because I wasn't sure how about the case. All right then, open rig frame. Thank you!

Dylan16807
0 replies
12h39m

Most CPUs have about 20 lanes (plus a link to the chipset).

On the one hand, they will be gen 4 or 5, so they're the equivalent of 40-80 gen 3 lanes.

On the other hand, you can only split them up if you have a motherboard that supports bifurcation. If you buy the wrong model, you're stuck dedicating the equivalent of 64 gen 3 lanes to a single card.

Edit: Actually, looking into it further, current Intel desktop processors will only run their lanes as 16(+4) or 8+8(+4). You can kind of make 4 cards work by using chipset-fed slots, but that sucks. You could also get a PCIe switch but those are very expensive. AMD will do 4+4+4+4(+4) on the right boards.

deadalus
8 replies
23h56m

I wish AMD did well in the Stable Diffusion front because AMD is never greedy on VRAM. The 4060Ti 16GB(minimum required for Stable Diffusion in 2024) starts at $450.

AMD with ROCm is decent on Linux but pretty bad on Windows.

choilive
5 replies
23h48m

They bump up VRAM because they can't compete on raw compute.

wongarsu
1 replies
23h4m

Or rather Nvidia is purposefully restricting VRAM to avoid gaming cards canibalizing their supremely profitable professional/server cards. AMD has no relevant server cards, so they have no reason to hold back on VRAM in consumer cards

karolist
0 replies
10h35m

Nvidia released consumer RTX 3090 with 24GB VRAM in Sep 2020, AMDs flagship release in that same month was 6900 XT with 16GB VRAM. Who is being restrictive here exactly?

risho
1 replies
23h22m

it doesn't matter how much compute you have if you don't have enough vram to run the model.

Zambyte
0 replies
23h1m

Exactly. My friend was telling me that I was making a mistake for getting a 7900 XTX to run language models, when the fact of the matter is the cheapest NVIDIA card with 24 GB of VRAM is over 50% more expensive than the 7900 XTX. Running a high quality model at like 80 tps is way more important to me than running a way lower quality model at like 120 tps.

api
0 replies
21h34m

They lag on software a lot more than the lag on silicon.

wastewastewaste
0 replies
9h52m

You don't need 16gb, literally majority of people don't have that and use 8gb and up. especially with forge

Adverblessly
0 replies
23h36m

I run A1111, ComfyUI and kohya-ss on an AMD (6900XT which has 16GB, the minimum required for Stable Diffusion in 2024 ;)), though on Linux. Is it a Windows specific Issue for you?

Edit to add: Though apparently I still don't run ollama on AMD since it seems to disagree with my setup.

Zambyte
6 replies
23h46m

It's pretty funny to see this blog post, when I have been running Ollama on my AMD RX 6650 for weeks :D

They have shipped ROCm containers since 0.1.27 (21 days ago). This blog post seems to be published along with the latest release, 0.1.29. I wonder what they actually changed in this release with regards to AMD support.

Also: see this issue[0] that I made where I worked through running Ollama on an AMD card that they don't "officially" support yet. It's just a matter of setting an environment variable.

[0] https://github.com/ollama/ollama/issues/2870

Edit: I did notice one change, now the starcoder2[1] model works now. Before that would crash[2].

[1] https://ollama.com/library/starcoder2

[2] https://github.com/ollama/ollama/issues/2953

throwaway5959
1 replies
23h15m

I mean, it was 21 days ago. What’s the difference?

yjftsjthsd-h
0 replies
23h12m

2 versions, apparently

mchiang
1 replies
21h16m

While the PRs went in slightly earlier, much of the time was spent on testing the integrations, and working with AMD directly to resolve issues.

There were issues that we resolved prior to cutting the release, and many reported by the community as well.

Zambyte
0 replies
19h16m

Thank you for clarifying and thanks for the great work you do!

layoric
0 replies
8h52m

What kind of performance do you see with an RX 6650?

cjbprime
0 replies
14h3m

Maybe they wanted to wait for bug reports for 21 days before publishing a popular blog post about it..?

freedomben
5 replies
23h1m

I'm thrilled to see support for RX 6800/6800 XT / 6900 XT. I bought one of those for an outrageous amount during the post-covid shortage in hopes that I could use it for ML stuff, and thus far it hasn't been very successful, which is a shame because it's a beast of a card!

Many thanks to ollama project and llama.cpp!

mey
4 replies
22h19m

Sad to see that the cut off is just after 6700 XT which is what is in my desktop. They indicate more devices are coming, hopefully that includes some of the more modern all in one chips with RDNA 2/3 from AMD as well.

throawayonthe
1 replies
20h37m

I’ve already been using ollama with my 6700xt just fine, you just have to set some env variable to make rocm work “unoficially”

The linked page says they will support more soon, so i’m guessing this will just be integrated

Rising6378
0 replies
8h34m

which env variable did you set to what? currently I'm struggeling setting it up

bavell
0 replies
13h20m

I've been using my 6750XT for more than a year now on all sorts of AI projects. Takes a little research and a few env vars but no need to wait for "official" support most of the time.

bravetraveler
2 replies
23h33m

I wouldn't read too much into support. It's more in terms of business/warranty/promises than what can actually do things

I've had a 6900XT since launch and this is the first I'm hearing "unsupported", having played with ROCM plenty over the years with Fedora Linux.

I think, at most, it's taken a couple key environment variables

HarHarVeryFunny
1 replies
23h8m

How hard would it be for AMD just to document the levels of support of different cards the way NVIDIA does with their "compute capability" numbers ?!

I'm not sure what is worse from AMD - the ML software support they provide for their cards, or the utterly crap documentation.

How about one page documenting AMD's software stack compared to NVIDIA, one page documenting what ML frameworks support AMD cards, and another documenting "compute capability" type numbers to define the capabilities of different cards.

londons_explore
0 replies
22h55m

And almost looks like they're deliberately trying to not win any market share.

It's as if the CEO is mates with NVidias CEO and has an unwritten agreement not to try too hard to topple the applecart...

Oh wait... They're cousins!

Symmetry
0 replies
23h48m

The newest release, 6.0.2, supports a number of other cards[1] and in general people are able to get a lot more cards to work than are officially supported. My 7900 XT worked on 6.0.0 for instance.

[1]https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

65a
3 replies
13h55m

Yep, and it deserves the credit! He who writes the cuda kernel (or translates it) controls the spice.

I had wrapped this and had it working in Ollama months ago as well: https://github.com/ollama/ollama/pull/814. I don't use Ollama anymore, but I really like the way they handle device memory allocation dynamically, I think they were the first to do this well.

rahimnathwani
2 replies
13h3m

I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?

65a
1 replies
12h39m

Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.

rahimnathwani
0 replies
1h56m

Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.

kodarna
2 replies
23h27m

I wonder why they aren't supporting RX 6750 XT and lower yet, are there architectural differences between these and RX 6800+?

slavik81
0 replies
23h14m

Those are Navi 22/23/24 GPUs while the RX 6800+ GPUs are Navi 21. They have different ISAs... however, the ISAs are identical in all but name.

LLVM has recently introduced a unified ISA for all RDNA 2 GPUs (gfx10.3-generic), so the need for the environment variable workaround mentioned in the other comment should eventually disappear.

airocker
2 replies
22h52m

I heard "Nvidia for LLM today is similar to how Sun Microsystems was for the web"

api
1 replies
21h35m

... for a very brief period of time until Linux servers and other options caught up.

airocker
0 replies
18h40m

OS was possibly much more complicated to write at that time than CUDA is to write today. And competition is too strong. It might be more briefer than even Sun.

tarruda
1 replies
22h27m

Does this work with integrated Radeon graphics? If so it might be worth getting one of those Ryzen mini PCs to use as a local LLM server.

haolez
1 replies
19h23m

Is there an equivalent to ollama or gpt4all for Android? I'd like to host my model somewhere and talk to it via app.

verdverm
0 replies
18h36m

At that point, it sounds like your after an API endpoint for any model. Lots of solutions out there, depends more on your hosting

Symmetry
1 replies
1d

Given the price of top line NVidia cards, if they can be had at all, there's got to be a lot of effort going on behind the scenes to improve AMD support in various places.

jeff-davis
0 replies
14h39m

What are the barriers to doing so?

SushiHippie
1 replies
23h33m

Can anyone (maybe ollama contributors) explain to me the relationship between llama.cpp and ollama?

I always thought that ollama basically just was a wrapper (i.e. not much changes to inference code, and only built on top) around llama.cpp, but this makes it seem like it is more than that?

OtomotO
1 replies
19h12m

Hm, fooocus manages to run, but for Ollama I get:

time=2024-03-16T00:11:07.993+01:00 level=WARN source=amd_linux.go:50 msg="ollama >recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file >missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or >directory" >time=2024-03-16T00:11:07.993+01:00 level=INFO source=amd_linux.go:85 msg="detected amdgpu >versions [gfx1031]" >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:339 msg="amdgpu >detected, but no compatible rocm library found. Either install rocm v6, or follow manual >install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#man..." >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:96 msg="unable to verify >rocm library, will use cpu: no suitable rocm found, falling back to CPU" >time=2024-03-16T00:11:07.996+01:00 level=INFO source=routes.go:1105 msg="no GPU detected"

Need to check how to install rocm on arch again... have done it once, a few moons back, but alas...

jmorgan
0 replies
18h57m

Ah, this is probably from missing ROCm libraries. The dynamic libraries are available as one of the release assets (warning: it's about 4GB expanded) https://github.com/ollama/ollama/releases/tag/v0.1.29 – dropping them in the same directory as the `ollama` binary should work.

userbinator
0 replies
16h55m

Thanks, Ollama.

(Sorry, could not resist.)

treprinum
0 replies
19h16m

Any info on the performance? How does it compare to 4080/4090?

skenderbeu
0 replies
5h59m

llama.cpp now supports AMD graphics. Ollama just wraps around it

rcarmo
0 replies
22h59m

Curious to see if this will work on APUs. Have a 7840HS to test, will give it a go ASAP.

latchkey
0 replies
23h18m

If anyone wants to run some benchmarks on MI300x, ping me.

jjice
0 replies
20h49m

Just downloaded this and gave it a go. I have no experience with running any local models, but this just worked out of the box on my 7600 on Ubuntu 22. This is fantastic.

h4x0rr
0 replies
19h4m

No support for my rx 5700xt :(

ekianjo
0 replies
12h58m

Does it also support AMD APUs?

VadimPR
0 replies
11h36m

Ollama runs really, really slow on my MBP for Mistral - as in just a few tokens a second and it takes a long while before it starts giving a result. Anyone else run into this?

I've seen that it seems to be related to the amount of system memory available when ollama is started (??) however LM Studio does not have such issues.