LM Studio – Discover, download, and run local LLMs

For my experiments with new self-hostable models on Linux, I've been using a script to download GGUF-models from TheBloke on HuggingFace (currently, TheBloke's repository has 657 models in the GGUF format) which I feed to a simple program I wrote which invokes llama.cpp compiled with GPU support. The GGUF format and TheBloke are a blessing, because I'm able to check out new models basically on the day of their release (TheBloke is very fast) and without an issue. However, the only frontend I have is console. Judging by their site, their setup is exactly the same as mine (which I implemented over a weekend), except that they also added a React-based UI on top. I wonder, how they're planning to commercialize it, because it's pretty trivial to replicate, and there're already open-source UI's like oogabooga.

I'd like to build myself a headless server to run models, that could be queried from various clients locally on my LAN, but am usure where to start and what the hardware requirements would be. Software can always be changed later but I'd rather buy the hardware parts only once.

Do you have recommendations about this? or blog posts to get started? What would be a decent hardware configuration?

Ollama does this. I run it in a container on my homelab (Proxmox on a HP EliteDesk SFF G2 800) and 7B models run decently fast on CPU-only. Ollama has a nice API and makes it easy to manage models.

Together with ollama-webui, it can replace ChatGPT 3.5 for most tasks. I also use it in VSCode and nvim with plugins, works great!

I have been meaning to write a short blog post about my setup...

I've been trying Ollama locally. I've yet to know how it'll behave in a production setting.

Depending on what you mean by "production" you'll probably want to look at "real" serving implementations like HF TGI, vLLM, lmdeploy, Triton Inference Server (tensorrt-llm), etc. There are also more bespoke implementations for things like serving large numbers of LoRA adapters[0].

These are heavily optimized for more efficient memory usage, performance, and responsiveness when serving large numbers of concurrent requests/users in addition to things like model versioning/hot load/reload/etc, Prometheus metrics, things like that.

One major difference is at this level a lot of the more aggressive memory optimization techniques and support for CPU aren't even considered. Generally speaking you get GPTQ and possibly AWQ quantization + their optimizations + CUDA only. Their target users and their use cases are often using A100/H100 and just trying to need fewer of them. Support for lower VRAM cards, older CUDA compute architectures, etc come secondary to that (for the most part).

[0] - https://github.com/S-LoRA/S-LoRA

Thanks! Really helpful. I've a 3090 at home and my idea is to do some testing on a similar config in the cloud to have an idea of the amount of requests that could be served.

The good news is the number of requests and performance is very impressive. For example, on my RTX 4090 from testing many months ago with lmdeploy (it was the first to support AWQ) I was getting roughly 70 tokens/s each across 10 simultaneous sessions with LLama2-13b-Chat - almost 700 tokens/s total. If I were to test again now with all of the impressive stuff that's been added to all of these I'm sure it would only be better (likely dramatically).

The bad news is because "low VRAM cards" like the 24GB RTX 3090 and RTX 4090 aren't really targetted by these frameworks you'll eventually run into "Yeah you're going to need more VRAM for that model/configuration. That's just how it is." as opposed to some of the approaches for local/single session serving that emphasize memory optimization first and tokens/s for a single session next. Often with no consideration or support at all for multiple simultaneous sessions.

It's certainly possible that with time these serving frameworks will deploy more optimizations and strategies for low VRAM cards but if you look at timelines to even implement quantization support (as one example) it's definitely an after-thought and typically only implemented when it aligns with the overall "more tokens for more users across more sessions on the same hardware" goals.

Loading a 70B model on CPU and getting 3 tokens/s (or whatever) is basically seen as an interesting yet completely impractical and irrelevant curiosity to these projects.

In the end "the right tool for the job" always applies.

If I may ask, which plugins are you using in VSCode?

I'm using the extension Continue: https://marketplace.visualstudio.com/items?itemName=Continue...

The setup of connecting to Ollama is a bit clunky, but once it's set up it works well!

LM Studio can start a local server with the APIs matching OpenAI. You can’t do concurrent requests, but that should get you started.

You can currently do this in an M2 Max with ollama and a Nextjs UI [0] running in a docker container. Any devices in the network can use the UI... and I guess if you want a LAN API you just need to run another container with with OAI compatible API that can query ollama.. eg [1]

[0]https://github.com/ivanfioravanti/chatbot-ollama

[1]https://github.com/BerriAI/litellm

Just compile llama.cpp's server example, and you have a local HTTP API. It also has a simple UI (disclaimer: to which I've contributed).

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

usure where to start and what the hardware requirements would be

Have a look at the localllama subreddit

In short though dual 3090 is common, single 4090 or various flavours of M123 macs. Alternatively p40 can be jury-rigged too but research that carefully. In fact anything with more than one gpu is going to require careful research

You dont need thebloke. Its trivial to make gguf files from bin models by yourself.

What a comment. Why do it the easy way when the more difficult and slower way works ok it to the same result‽ For people who just want to USE models and not back at them, TheBloke is exactly the right place to go.

Like telling someone interested in 3D printing minis to build a 3D printer instead of buying one. Obviously that helps them get to their goal of printing minis faster right?

Actually, consider that the commenter may have helped un-obfuscate this world a little bit by saying that it is in fact easy. To be honest the hardest part about the local LLM scene is the absurd amount of jargon introduced - everything looks a bit more complex than it is. It’s really is easy with llama.cpp, someone even wrote a tutorial here: https://github.com/ggerganov/llama.cpp/discussions/2948 .

But yes, TheBloke tends to have conversions up very quickly as well and has made a name for himself for doing this (+more)

This is a helpful comment because the Bloke only converts a small fraction of models and hardly ever updates them timely after the first release.

So learn to cook.

Curious about this and I just download it.

Want to try uncensored models.

I have a question, looking for the most popular "uncensored" model I just find "TheBloke/Luna-AI-Llama2-Uncensored-GGML", but it has 14 files to download between 2 to 7 GB, I just download the first one: https://imgur.com/a/DE2byOB

I try the model and it works: https://imgur.com/a/2vtPcui

I should download all the 14 files to get better results?

Also, asking how to make a bomb it looks that at least this model isn't "uncesored": https://imgur.com/a/iYz7VYQ

Honest question from someone new to exploring and using these models; why do you need uncensored? What are the use-cases that would call for it?

Again, not questioning your motives or anything, just straight up curious. To use your example, any of us can find bomb building info online fairly easily, and has been a point of social contention since the Anarchist's cookbook. Nobody needs an uncensored LLM for that, of course.

For the entertainment value

Because I don’t appreciate it when a model has blatant democrat/anti-republican bias, for example. The fact that chatGPT, Bard, etc are heavily and purposefully biased on certain topics is well documented[1].

[1] https://www.brookings.edu/articles/the-politics-of-ai-chatgp...

I'd switch that question around: Why would I want to use a censored LLM?

It doesn't make sense for me personally. It does make sense if you're offering an LLM publicly, so that you doesn't get bad PR if your LLM says some politically incorrect or questionable things.

perhaps they are in a war zone and must learn to make molotov cocktails without connecting to the internet

Curiosity and entertainment. Also have an experience that isn't available on the current popular consumer products like this.

At best the false positives are a nuisance and makes the model dumber. But really censorship is fundamentally wrong.

Unless we are talking about gods and not flawed humans like me I prefer to have the say in what is right and what is wrong for things that are entirely personal and only affect me.

It’s very easy to hit absurd “moral” limits on chatgpt for the most stupid things.

Earlier I was looking for “a phrase that is used as an insult for someone who writes with too much rambling” and all I got was some bullshit about how it’s sorry but it can’t do that because it’s allegedly against its OpenAI rules.

So I asked again “a phrase negatively used to mean someone that writes too much while rambling” and it worked.

I simply cannot be bothered to deal with stupid insipid “corporate friendly language” and other dumb restrictions.

Imagine having a real conversation with someone and they freaked out any time anything negative was discussed?

TLDR: Thought police ruining LLMs

Most people wanting 'uncensored' are not looking for anything illegal. They just don't want to be forced into some Californian/SV idea of 'morality'.

Where do you typically have the 'safe search' setting when you web search? Personally I have it 'off' ('moderate' when I worked in an office I think) even though I'm not looking for anything that ought to be filtered by it.

I'm not using models censored or otherwise, but I imagine I'd feel the same way there - I won't be offended, so don't try to be too clever, just give me the unfiltered results and let me decide what's correct.

(Bing AI actually banned me for trying to generate images a rough likeness of myself by combining traits of minor celebrities - in combination it shouldn't have looked like those people either, so I don't think should violate ToS, certainly it didn't in intention (I wanted myself, but it doesn't know what I look like and I couldn't provide a photo at the time, idk if you can now (banned!)) so it does happen, 'false positive censoring' if you like.)

None of your business.

basically no open source fine tunes are censored, you can get an idea of popular models people are using here: https://openrouter.ai/models?o=top-weekly

teknium/openhermes-2.5-mistral-7b is a good one

You don't need all 14 files, just pick one that is recommended with a slight loss of quality - hover over the little (i)icon to find out

Thank you! Now I checked the (i) tooltips. Just downloaded the bigger file (7GB) that says "Minimal loss of quality".

The readme of their repositories each have tables that detail the quality of each file. The QK_4_M and QK_5_M seem to be the two main recommended ones for low quality loss while too being too large.

Only need 1 of the files, but recommend checking out the GGUF version of the model (just replace GGML in the URL) instead of GGML. Llama.cpp no longer supports GGML, and not sure if TheBloke still uploads new GGML versions of models.

No, each download is just a different quantization of the same model.

So if you suspect I am using this for business related purposes you may take any action you want to spy on me? Such Terms. Very Use.

Thats why as a business i would rather use a trusted FOSS LLm interface like textgeneration-WEBUI https://github.com/oobabooga/text-generation-webui

Or a simpler alternative: https://ollama.ai/

ollama doesn't come packaged with an easy to invoke UI though.

Gpt4all does

It's a standard clause for most apps. If a breach of the terms of conditions (such as using it for commercial purposes, like selling the software), they are allowed to launch an investigation. No where does this mention "spying" on modifying the app for such use.

Maybe the app is already modified for such use and just needs to be given a "start spying" command.

Maybe, personally I don't trust closed source apps like these for that reason.

And when I do try them, it's with Little Snitch blocking outgoing connections.

Much Sus. Wow.

Sorry I haven’t used this product yet - Do the chat messages get uploaded to a server ?

This looks great!

If you're looking to do the same with open source code, you could likely run Ollama and a UI.

https://github.com/jmorganca/ollama + https://github.com/ollama-webui/ollama-webui

oobabooga is no longer popular?

oobabooga is the king.

As far as I know, ollama doesn’t support exllama, qlora fine tuning, multi-GPU, etc. Text-generation-webui might seem like a science project, but it’s leagues ahead (Like 2-4x faster inference with the right plugins) of everything else. Also has a nice openai mock API that works great.

oobabooga should be king, but it's unstable as hell.

ooba's worth keeping an eye on, but koboldcpp is more stable, almost as versatile, and way less frustrating. It also still supports GGML.

I appreciate the ooba team for everything they've done but its not a great user experience

With a couple of other folks I'm currently working on an Ollama GUI, it's in early stages of development - https://github.com/ai-qol-things/rusty-ollama

I'm having a lot of fun chatting with characters using Faraday and koboldcpp. Faraday has a great UI that lets you adjust character profiles, generate alternative model responses, undo, or edit dialogue, and experiment with how models react to your input. There's also SillyTavern that I have yet to try out.

- https://faraday.dev/

- https://github.com/LostRuins/koboldcpp

- https://github.com/SillyTavern/SillyTavern

Ollama is fantastic. I cannot recommend this project highly enough. Completely open-source, lightweight, and a great community.

I looked at Ollama before, but couldn't quite figure something out from the docs [1]

It looks like a lot of the tooling is heavily engineered for a set of modern popular LLM-esque models. And looks like llama.cpp also supports LoRA models, so I'd assume there is a way to engineer a pipeline from LoRA to llama.cpp deployments, which probably covers quite a broad set of possibilities.

Beyond llama.cpp, can someone point me to what the broader community uses for general PyTorch model deployments?

I haven't quite ever self-hosted models, and am really keen to do one. Ideally, I am looking for something that stays close to the PyTorch core, and therefore allows me the flexibility to take any nn.Module to production.

[1]: https://github.com/jmorganca/ollama/blob/main/docs/import.md.

I don't mean this as a criticism, I'm just curious because I work in this space too: who is this for? What is the niche of people savvy enough to use this who can't run one of the many open source local llm software? It looks in the screenshot like it's exposing much of the complexity of configuration anyway. Is the value in the interface and management of conversation and models? It would be nice to see info or even speculation about the potential market segments of LLM users.

It's for people who want to discover LLMs and either don't have the skill to deploy it, or value their time, and prefer not to fool around for hours getting it to work before they can try it.

The fact it has configuration is good, as long as it has some defaults.

Exactly. People like me have been waiting for a tool like this.

I'm more than capable of compiling/installing/running pretty much any software, but all I want is the ability to chat with a LLM of my choice without spending an afternoon tabbing back to a 30 step esoteric GitHub .md full of caveats, assumptions, and requiring dependencies to be installed and configured according to preferences I don't have.

Yeah, I think I fit into this category. If I see a new model announced, it’s been nice to just click and evaluate for myself if it’s useful for me. If anyone knows other tools for this kind workflow I’d love to hear about them. Right now I just keep my “test” prompts in a text file.

I got Mistral-7b running locally, and although it wasn't hard, it did take some time nonetheless. I just wanted to try it out and was not that interested in the technical details.

In most workplaces that deal with LLMs you’ve got a few classes of people:

1. People who understand LLMs and know how to run them and have access to run them on the cloud. 2. People who understand LLMs well enough but don’t have access to cloud resources - but still have a decent MacBook Pro. Or maybe access to cloud resources is done via overly tight pipelines. 3. People who are interested in LLMs but don’t have enough technical chops/time to get things going with Llama CPP. 4. People who are fans of LLMs but can’t even install stuff in their computer.

This is clearly for #3 and it works well for that group of people. It could also be for #2 when they don’t want to spin up their own front end.

It's actually quite handy. I built all the various things by hand at one point, but had to wipe it all. Instead of following the directions again I just downloaded this.

Being able to swap out models is also handy. This probably saved a couple of hours of my life, which I appreciate.

For me it's pretty simple- LM Studio supports Apple Silicon GPU acceleration out of the box, and I like the interface better than Gradio Web UI. It saves me the headache and the tinkering of the alternatives. That said, free software that's hiring developers probably won't stay free for long, so I'm keeping my eye on other options.

Tl;DR It's for Mac users

LMStudio is great to run local LLMs, also support OpenAI-compatible API. In the case you need more advance UI/UX, you can use LMStudio with MindMac(https://mindmac.app), just check this video for details https://www.youtube.com/watch?v=3KcVp5QQ1Ak.

Wow, brand new product but already 'trusted by Oracle, Paypal, Amazon, Cisco' based on the logos I'm seeing on the home page. How was that achieved?

I suppose it was through word of mouth. I know this because they used their work email to purchase MindMac.

Thanks for the reply! Always been curious about this with various startups.

Thanks for sharing MindMac - just tried it out and it's exactly what I was looking for, great to see Ollama integration is coming soon!

Thank you for your support. I just found a workaround solution to use Ollama with MindMac. Please check this video https://www.youtube.com/watch?v=bZfV70YMuH0 for more details. I will integrate Ollama deeply in the future version.

MindMac is the first example I've seen where the UI for working w/ LLMs is not complete and utter horseshit and starts to support workflows that are sensible.

I will buy this with so much enthusiasm if it holds up. Argh, this has been such a pain point.

1. Mistral

2. Llama 2

3. Code Llama

4. Orca Mini

5. Vicuna

What can I do with any of these models that won't result in 50% hallucinations/it recommending code with APIs that don't exist/it recommending basically regurgitated StackOverflow historical out of date answers (that it was trained on) for libraries that have had their versions/APIs change, etc?

Can somebody share one real use case they are using any of these models for?

Why don’t you wrap it with a verification system (eg: web scraper) and auto-regenerate / jailbreak any things you don’t like?

Because I pay $20/mo for GPT-4 and don't understand why anybody would run a "less-good" version locally that you can trust less/that has less functionality.

That's why I wanted to try to understand, what am I missing about local-toy LLMs. How are they not just noise/nonsense generators?

So do I, but on a flight 2 days ago, I forgot the name of a Ruby method, but knew what it does. I tried looking it up in Dash (offline rdocs) but didn’t find it.

On a whim, I asked Zephyr 7B (Mistral based) “what’s the name of that Ruby method that does <insert code>” and it gave me 3 different correct ways of doing what I wanted, including the one I couldn’t remember. That was a real “oh wow” moment.

So offline situations is the most likely use case for me.

Sometimes you just need to write creative nonsense. Emails, comments, stories, etc. Great for fiction since there are low stakes for errors.

They're bad at generative tasks. Don't have it write code or scientific papers from scratch, but you can have it review anything you've written. You can also do summaries, keyword/entity extraction, and the like safely. Any reductive task works pretty well.

The first reason is because GPT-4 is censored for many types of tasks/topics.

I’m late to the game about this. So I’ll ask a stupid question.

As a contrived example, what happens if you feed the LoTR books, the Hobbit, the Silmarillion, and whatever else is germane, into an LLM?

Is there a base, empty, “ignorant” LLM that is used as a seed?

Do you end up with a Middle Earth savant?

Just how does all this work?

There isn't enough text in the Tolkein works to generate a functional LLM. You need to start with a base model that contains enough English (or language of your choice) to become functional.

Such a model (LLaMa is a good example) is not "ignorant," but rather a generalized model capable of a wide range of language tasks. This base model does not have specialized knowledge in any one area but has a broad understanding based on the diverse training data.

If you were to "feed" Tolkien's specific books into this general LLM, the model wouldn't become a Middle Earth savant. It would still provide responses based on its broad training. It might generate text that reflects the style or themes of Tolkien's work if it has learned this from the broader training data, but its responses would be based on patterns learned from the entire dataset, not just those books.

So, it wouldn't necessarily "know" much about Middle Earth, but might take a stab at writing like Tolkien?

If you give Tolkien books to a newborn child, they won't become a Tolkien expert. You need to first teach them English, and then give them the books. They will be able to answer questions about Middle Earth, but they'll also be able to form any other English sentence based on what they learned previously. It's basically the same with LLMs.

You would fine tune a pretrained LLM because those books are written in English. And natural languages are in flux, and the corpuses that describe them are not neutral, so you can impose fairness after the fact. Sebastian Raschka has written some relevant popular articles, like:

https://magazine.sebastianraschka.com/p/understanding-large-...

https://magazine.sebastianraschka.com/p/finetuning-large-lan...

LMStudio is great, if a bit daunting. If you’re on Mac and want a native open source interface, try out FreeChat https://www.freechat.run

Thanks for the link.

I expected it to not let me run this. I have an intel Macbook, was expecting that I'd need Apple Silicon... am I misunderstanding something? I get fairly fast results at the prompt with the default model. How's this thing running with whatever shitty GPU I have in my laptop?

that's the magic of llama.cpp!

I include a universal binary of llama.cpp's server example to do inference. What's your machine? The lowest spec I've heard it running on is a 2017 iMac with 8GB RAM (~5.5 tokens/s). On my m1 with 64GB RAM I get ~30 tokens per second on the default 7B model.

Macbook Pro 2020 with 16gb of system ram. I think the gpu is Iris Plus? But I don't much keep up on those.

I'm now delving into getting this running in Terminal... there are a few things I want to try that I don't think the simple interface allows.

Also, I've noticed that when chats get a few kilobytes long, it just seizes up and can't go further. I complained to it, it spent a sentence apologizing, started up where it left off... and got about 12 words further.

hm yeah i think I need to update llama.cpp to get this fix https://github.com/ggerganov/llama.cpp/pull/3996

Thanks for trying it!

Amusing qualifications for the senior engineering roles they’re hiring for:

“Deep understanding of what is a computer, what is computer software, and how the two relate.”

Right after the senior ML role that requires people understand how to write “algorithms and programs.”

Kinda hard to take those kinds of requirements seriously.

“Deep understanding of what is a computer, what is computer software, and how the two relate.”

Seems like a joke, but many developers do not really understand what's going on behind the scenes. This gets straight to the point. They don't care about HR keyword matching on your CV, or how many years of experience you have of being a mediocre developer with language X or framework Y. I guess during the interview they will investigate whether you truly understand the fundamentals.

I tend to agree. I have been working in IT for 48 years and it is not at all uncommon to come across developers who have a very narrow and niche view of software development. I have had the privilege to work with a wide range of architects and senior engineers over the years and I have found that the ones who tended to be the most creative (solution wise) were the ones who had deep knowledge right from the bottom of the stack all the way to the top - they did not look at a problem through the lens of a specific language (they all knew multiple languages) - when I have hired for senior positions, unless its been for a very specific skill set, experienced generalists have tended to impress more

The second one isn't that bad in context. But the Senior Systems Software Engineer is wild, with "Deep understanding of what is a computer, what is computer software, and how the two relate" followed by "Experience writing and maintaining production code in C++14 or newer". You'd think the latter would imply the former, but maybe not...

They seem to have even lowered expectations a bit. Two months ago [1] they were already hiring for that role (or a very, very similar one), but back then you needed experience with "mission-critical code in C++17", now just "production code in C++14".

1: http://web.archive.org/web/20230922170941/https://lmstudio.a...

You'd think the latter would imply the former, but maybe not...

I wouldn't put C++ devs on too high of a pedestal. I got away with writing shitty C++ code for years before I really knew what I was doing. It still worked though.

M1 is only 3 years old and no one cares to support intel macs any more. There are surely a lot of them out there. Are they that much worse to run LLMs on?

Ollama works super fine on Intel Mac

The demo on this video is from Intel Mac https://youtu.be/C0GmAmyhVxM?si=puTCpGWButsNvKA5

It also supports openai compatible api and completely open-source unlike LM studio

Ollama's API is not openai compatible.

Correct, but LocalAI has a compatible API for those who needs it.

Yeah, ollama since way nicer in UX

I really like LM Studio and had it open when I came across this post. LM Studio is an interesting mixture of:

- A local model runtime

- A model catalog

- A UI to chat with the models easily

- An openAI compatible API

And it has several plugins such as for RAG (using ChromaDB) and others.

Personally I think the positioning is very interesting. They're well positioned to take advantage of new capabilities in the OS ecosystem.

It's still unfortunate that it is not itself open-source.

Does it also let you connect to the ChatGPT API and use it?

I haven’t found that option. I know it exists in Gpt4all though.

Personally I use a locally served frontend to use ChatGPT via API.

How does it compare with something like FastChat? https://github.com/lm-sys/FastChat

Feature set seems like a decent amount of overlap. One limitation of FastChat, as far as I can tell, is that one is limited to the models that FastChat supports (though I think it would be minor to modify it to support arbitrary models?)

One, LMStudio, is an app. you download it, run it, you're immediately working

The other is some python thing that requires python knowledge of creating virtual environments, installing dependencies, etc...

This app could use some simple UI improvements:

- The chatbox field has a normal "write here" state, when no chat is really selected. I thought my keyboard broke until I discovered that

- I didn't find a way to set cuda acceleration before loading a model, only managed to set gpu offloaded layers and using "relaunch to apply"

- Some HugginFace models are simply not listed and there's no indication about why. I guess models are really curated, but somehow presented as a HuggingFace browser?

- Scrolling in the accordion parts of the interface seems to be responding to mouse wheel scroll only. I have a mouse with a damaged one and couldn't find a way to reliably navigate to bottom drawers

That said, I really liked the server tab, which allowed for initial debugging very easily

Those are really weird bugs, how do you even manage that these days

Not sure if sarcastic or not. Assuming not.

I did find it quite useful for opening a socket for remote tooling (played around withe the continue plugin).

The quirky UI did slow me down,but nothing really showstopping

It’s basically a front end for llama.cpp, so it will only show models with GGUF quantizations.

Ah, makes sense

Is anyone using open source models to actually get work done or solving problems in their software architecture? So far I haven't found anything near the quality of GPT-4.

So far I haven't found anything near the quality of GPT-4

GPT-4 has an estimated 1.8 trillion parameters. Orders of magnitude beyond open source models and ~10x GPT-3.5 which has 175 billion parameters.

https://the-decoder.com/gpt-4-architecture-datasets-costs-an...

WizardCoder is probably close to state of the art for open models as of right now.

https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder

https://huggingface.co/WizardLM/WizardCoder-Python-34B-V1.0

Top of the line consumer machines can run this at a good clip, though most machines will need to use a quantized model (ExLlamaV2 is quite fast). I found a model for that as well, though I haven't used it myself:

https://huggingface.co/oobabooga/CodeBooga-34B-v0.1-EXL2-4.2...

The reality is there is no general use case for open source models, such as there is gpt-4.

The decent chat ones are based on gpt data and they’re basically shitty distilled models.

The best use case is a narrow one that you decide and can create adequate fine-tuning data around. Plenty of real production ability here.

Zephyr is coherent enough to bounce ideas off of, but I'm eagerly awaiting when open-source models are on par productivity wise with the big providers. I imagine some folks are utilizing codellama 34b somehow, but I haven't been able to effectively.

Why purple or some shade of purple is the color of all AI products? For some reason, the landing pages of AI products immediately remind of Crypto products. This one does not have Crypto vibes but the colour is purple. I don't get why.

It's a default color in Tailwind.css and is used in a lot of the templates and examples. Nine times out of ten, if you check the source of a page with this flavor of purple, you'll see it's using Tailwind, as the OP site in fact does.

Ah! that makes more sense. New startup, new tech and therefore the new default color. I hope its just that and because I only tend to notice AI startups purple is what I end up seeing.

Because apps mostly prefer dark theme now, and dark red, brown, dark green and so on look weird, and gray is OK, but very boring, like someone desaturated the UI. Which leaves shades of blue and purple.

Disappointing, no proper Linux support. Just "ask on discord".

I did just that, including signing up for discord. Despite never having used discord before, I was able to find the link to the beta AppImage in a pinned message and downloaded it. Made it executable with chmod +x LM...... Ran it. Searched for some of the models referenced in this discussion. Downloaded one and ran it. It just worked on Linux Mint 21.2.

from a pinned message on their discord: https://s3.amazonaws.com/releases.lmstudio.ai/prerelease/LM+...

How does this App make money?

Looks like they'll charge for a "Pro" version that can be used commercially. Though they'd have to confirm that, I only deducted from their ToS.

After the latest Chatgpt debacles, the poor performance I'm getting from 4 turbo, I'd really like a local version of chatgpt4 or equivalent. I'd even buy a new pc if I had too.

Wouldn’t everyone.

For those looking for an open source alternative with Mac, Windows, Linux support check out GPT4All.io

Terrible name, given that its value is that it runs locally, and you can't do that with ChatGPT.

Is this an alternative to privategpt or GPT4All?

Similar to GPT4All. Arguably better UI.

Considering the code is closed source and they can change the ToS anytime to send conversation data to their servers whenever they want, i would like to know what would be the benefit of using this over ChatGPT?

RIP Intel users

https://github.com/enricoros/big-agi seems better and is open source

This works, but I've noticed that my CPU use goes up to about 30 percent, all in kernel time (windows), after installing and opening this, even when it's not doing anything, on two separate machines... I also hear the fan spinning fast on my laptop.

Killed the LM studio process and re-opened it and the ghost background usage is down to about 5%.

Newbie question... Is this purely for hosting text language models? Is there something similar for image models? i.e., upload an image and have some local model provide some detection/feedback on it.

Also, if you don't know what all the toggles are for, this is a simpler attempt by me: https://www.avapls.com/

Wow this is sleek, good job.

Am I missing something here? I'm on a recent M2 machine. Every model I've downloaded fails to load immediately when trying to load it. Is there some way to get feedback on the reason for failure, like a log file or something?

EDIT: The problem is I'm on macOS 13.2 (Ventura). According to a message in Discord, the minimum version for some (most?) models is 13.6.

This is what I use on Windows 10.

I have an HP z440 with an E5-1630 v4 and 64GB DDR4 quad channel RAM.

I run LLMs on my CPU, and the 7 billion parameter models spit out text faster than I can read it.

I wish it supported LMMs (multi modal models.)

The settings say saving chats can take 2 GB. Why? What states do chat LLMs have? Isn't the only state the chat history text?

The Linux version is not loading models. What’s your experience?