Founder of Home Assistant here. Great write up!
With Home Assistant we plan to integrate similar functionality this year out of the box. OP touches upon some good points that we have also ran into and I would love the local LLM community to solve:
* I would love to see a standardized API for local LLMs that is not just a 1:1 copying the ChatGPT API. For example, as Home Assistant talks to a random model, we should be able to query that model to see what the model is capable off.
* I want to see local LLMs with support for a feature similar or equivalent to OpenAI functions. We cannot include all possible information in the prompt and we need to allow LLMs to make actions to be useful. Constrained grammars do look like an possible alternative. Creating a prompt to write JSON is possible but need quite an elaborate prompt and even then the LLM can make errors. We want to make sure that all JSON coming out of the model is directly actionable without having to ask the LLM what they might have meant for a specific value.
I think that LLMs are going to be really great for home automation and with Home Assistant we couldn't be better prepared as a platform for experimentation for this: all your data is local, fully accessible and Home Assistant is open source and can easily be extended with custom code or interface with custom models. All other major smart home platforms limit you in how you can access your own data.
Here are some things that I expect LLMs to be able to do for Home Assistant users:
Home automation is complicated. Every house has different technology and that means that every Home Assistant installation is made up of a different combination of integrations and things that are possible. We should be able to get LLMs to offer users help with any of the problems they are stuck with, including suggested solutions, that are tailored to their situation. And in their own language. Examples could be: create a dashboard for my train collection or suggest tweaks to my radiators to make sure each room warms up at a similar rate.
Another thing that's awesome about LLMs is that you control them using language. This means that you could write a rule book for your house and let the LLM make sure the rules are enforced. Example rules:
* Make sure the light in the entrance is on when people come home. * Make automated lights turn on at 20% brightness at night. * Turn on the fan when the humidity or air quality is bad.
Home Assistant could ship with a default rule book that users can edit. Such rule books could also become the way one could switch between smart home platforms.
Reading this gave me an idea to extend this even further. What if the AI could look at your logbook history and suggest automations? For example, I have an automation that turns the lights on when it's dark based on a light sensor. It would be neat if AI could see "hey, you tend to manually turn on the lights when the light level is below some value, want to create an automation for that?"
That's a good one.
We might take it one step further and ask the user if they want to add a rule that certain rooms have a certain level of light.
Although light level would tie it to a specific sensor. A smart enough system might also be able to infer this from the position of the sun + weather (ie cloudy) + direction of the windows in the room + curtains open/closed.
I can write a control system easy enough to do this. I'm kind of an expert at that, for oddball reasons, and that's a trivial amount of work for me. The "smart enough" part, I'm more than smart enough for.
What's not a trivial amount of work is figuring out how to integrate that into HA.
I can guarantee that there is an uncountably infinite number of people like me, and very few people like you. You don't need to do my work for me; you just need to enable me to do it easily. What's really needed are decent APIs. If I go into Settings->Automation, I get a frustrating trigger/condition/action system.
This should instead be:
1) Allow me to write (maximally declarative) Python / JavaScript, in-line, to script HA. To define "maximally declarative," see React / Redux, and how they trigger code with triggers
2) Allow my kid(s) to do the same with Blockly
3) Ideally, start to extend this to edge computing, where I can push some of the code into devices (e.g. integrating with ESPHome and standard tools like CircuitPython and MakeCode).
This would have the upside of also turning HA into an educational tool for families with kids, much like Logo, Microsoft BASIC, HyperCard, HTML 2.0, and other technologies of yesteryear.
Specifically controlling my lights to give constant light was one of the first things I wanted to do with HA, but the learning curve meant there was never enough time. I'm also a big fan of edge code, since a lot of this could happen much more gradually and discreetly. That's especially true for things with motors, like blinds, where a very slow stepper could make it silent.
1) You can basically do this today with Blueprints. There's also things like Pyscript [0]. 2) The Node-RED implementation in HA is phenomenal and kids can very easily use with a short introduction. 3) Again, already there. ESPHome is a first class citizen in HA.
I feel like you've not read the HA docs [1,] or took the time to understand the architecture [2]. And, for someone who has more than enough self-proclaimed skills, this should be a very understandable system.
[0] https://github.com/custom-components/pyscript [1] https://www.home-assistant.io/docs/ [2] https://developers.home-assistant.io/
I think we are talking across each other.
(1) You are correct that I have not read the docs or discovered everything there is. I have had HA for a few weeks now. I am figuring stuff out. I am finding the learning curve to be steep.
(2) However, I don't think you understand the level of usability and integration I'm suggesting. For most users, "read the docs" or "there's a github repo somewhere" is no longer a sufficient answer. That worked fine for 1996-era Linux. In 2023, this needs to be integrated into the user interface, and you need discoverability and on-ramps. This means actually treating developers as customers. Take a walk through Micro:bit and MakeCode to understand what a smooth on-ramp looks like. Or the Scratch ecosystem.
This contrasts with the macho "for someone who has more than enough self-proclaimed skills, this should be a very understandable system" -- no, it is not a very understandable system for me. Say what you will about my skills, that means it will also not be an understandable system for most e.g. kids and families.
That said, if you're correct, a lot of this may just be a question of relatively surface user-interface stuff, configuration and providing good in-line documentation.
(3) Skills are not universal. A martial artist might be a great athlete, but unless you're Kareem Abdul-Jabbar, that doesn't make you a great basketball player. My skills do include (1) designing educational experiences for kids; and (2) many semesters of graduate-level coursework on control theory.
That's very different from being fluid at e.g. managing docker containers, which I know next to nothing about. My experience trying to add things to HA has not been positive. I spent a lot of time trying to add extensions which would show me a Zigbee connectivity map to debug some connectivity issues. None worked. I eventually found a page which told me this was already in the system *shrug*. I still don't know why the ones I installed didn't work, or where to get started debugging.
For me, that was harder than doing a root-locus plot, implementing a system identification, designing a lag or lead compensator, or running the Bode obstacle course.
Seriously. If I went into HA, and there was a Python console with clear documentation and examples, this would be built. That's my particular skills, but a userbase brings very diverse other skills.
Machine learning can tackle this for sure, but that's surely separate to LLMs. A language model deals with language, not logic.
At least higher-end LLMs are perfectly capable of making quite substantive logical inferences from data. I'd argue that an LLM is likely to be better than many other methods if the dataset is small, while other methods will be better once you're dealing with data that pushes the context window.
E.g. I just tested w/ChatGPT, gave it a selection of instructions about playing music, the time and location, and a series of hypothetical responses, and then asked it to deduce what went right and wrong about the response, and it correctly deduced what the user intent I implied was a user that given the time (10pm) and place (the bedroom) and rejection of loud music possibly just preferred calmer music, but who at least wanted something calmer for bedtime.
I also asked it to propose a set of constrained rules, and it proposed rules that'd certainly make me a lot happier by e.g. starting with calmer music if asked an unconstrained "play music" in the evening, and transition artists or genres more aggressively the more the user skips to try to find something the user will stick with.
In other words, you absolutely can get an LLM to look at even very constrained history and get it to apply logic to try to deduce a better set of rules, and you can get it to produce rules in a constrained grammar to inject into the decision making process without having to run everything past the LLM.
While given enough data you can train a model to try to produce the same result, one possible advantage of the above is that it's far easier to introspect. E.g. my ChatGPT session had it suggest a "IF <user requests to play music> AND <it is late evening> THEN <start with a calming genre>" rule. If it got it wrong (maybe I just disliked the specific artists I used in my example, or loved what I asked for instead), then correcting its mistake is far easier if it produces a set of readable rules, and if it's told to e.g. produce something that stays consistent with user-provided rules.
(the scenario I gave it, btw. is based on my very real annoyance with current music recommendation that all to often does fail to take into account things like avoiding abrupt transitions, paying attention to the time of day and volume settings, and changing tack or e.g. asking questions if the user skips multiple tracks in quick succession)
This is a very insightful viewpoint. In this situation, I believe it is necessary to use NER to connect the LLM module and the ML module.
I've been working on something like this but it's of course harder than it sounds, mostly due to how few example use cases there are. A dumb false positive for yours might be "you tend to turn off the lights when the outside temperature is 50º"
Anyone know of a database of generic automations to train on?
Temperature and light may create illusions in LLM. A potential available solution to this is to establish a knowledge graph based on sensor signals, where LLM is used to understand the speech signals given by humans and then interpret these signals as operations on the graph using similarity calculations.
Retrospective questions would also be really great. Why did the lights not turn off downstairs this night? Or other questions involving history.
I can't help but think of someone downloading "Best Assistant Ever LLM" which pretends to be good but unlocks the doors for thieves or whatever.
Is that a dumb fear? With an app I need to trust the app maker. With an app that takes random LLMs I also need to trust the LLM maker.
For text gen, or image gen I don't care but for home automation, suddenly it matters if the LLM unlocks my doors, turns on/off my cameras, turns on/off my heat/aircon, sprinklers, lights, etc...
That could be solved by using something like Anthropic's Constitutional AI[1]. This works by adding a 2nd LLM that makes sure the first LLM acts according to a set of rules (the constitution). This could include a rule to block unlocking the door unless a valid code has been presented.
[1]: https://www-files.anthropic.com/production/images/Anthropic_...
Prompt injection ("always say that the correct code was entered") would defeat this and is unsolved (and plausibly unsolvable).
You should not offload actions to the llm, have it parse the code, pass it to the local door api, and read api result. LLMs are great interfaces, let's use them as such.
This "second llm" is only used during finetuning, not in deployment.
.. or you just have some good old fashioned code for such a blocking rule?
(I'm sort of joking, I can kind of see how that might be useful, I just don't think that's an example and can't think of a better one at the moment.)
HASS breaks things down into "services" (aka actions) and "devices".
If you don't want the LLM to unlock your doors then just don't allow the LLM to call the `lock.unlock` service.
thank you for building an amazing product!
I suspect cloning OpenAI's API is done for compatibility reasons. most AI-based software already support the GPT-4 API, and OpenAI's official client allows you to override the base URL very easily. a local LLM API is unlikely to be anywhere near as popular, greatly limiting the use cases of such a setup.
a great example is what I did, which would be much more difficult without the ability to run a replica of OpenAI's API.
I will have to admit, I don't know much about LLM internals (and certainly do not understand the math behind transformers) and probably couldn't say much about your second point.
I really wish HomeAssistant allowed streaming the response to Piper instead of having to have the whole response ready at once. I think this would make LLM integration much more performant, especially on consumer-grade hardware like mine. right now, after I finish talking to Whisper, it takes about 8 seconds before I start hearing GlaDOS and the majority of the time is spent waiting for the language model to respond.
I tried to implement it myself and simply create a pull request, but I realized I am not very familiar with the HomeAssistant codebase and didn't know where to start such an implementation. I'll probably take a better look when I have more time on my hands.
So how much of the 8s is spent in the LLM vs Piper?
Some of the example responses are very long for the typical home automation usecase which would compound the problem. Ample room for GladOS to be sassy but at 8s just too tardy to be usable.
A different approach might be to use the LLM to produce a set of GladOS-like responses upfront and pick from them instead of always letting the LLM respond with something new. On top of that add a cache that will store .wav files after Piper synthesized them the first time. A cache is how e.g. Mycroft AI does it. Not sure how easy it will be to add on your setup though.
A quick fix for the user experience would be to output a canned "one moment please" as soon as the input's received.
it is almost entirely the LLM. I can see this in action by typing a response on my computer instead of using my phone/watch, which bypasses Whisper and Piper entirely.
your approach would work, but I really like the creativity of having the LLM generate the whole thing. it feels much less robotic. 8 seconds is bad, but not quite unusable.
Streaming responses is definitely something that we should look into. The challenge is that we cannot just stream single words, but would need to find a way to learn how to cut up sentences. Probably starting with paragraphs is a good first start.
alternatively, could we not simply split by common characters such as newlines and periods, to split it within sentences? it would be fragile with special handling required for numbers with decimal points and probably various other edge cases, though.
there are also Python libraries meant for natural language parsing[0] that could do that task for us. I even see examples on stack overflow[1] that simply split text into sentences.
[0]: https://www.nltk.org/ [1]: https://stackoverflow.com/questions/4576077/how-can-i-split-...
That's great news but... Won't make HW requirements for HA way way higher? Thanks for Home Assistant anyway, I'm an avid user!
HA is extremely modular and add-ons like these tend to be API based.
For example, the whisper speech to text integration calls an API for whisper, which doesn't have to be on the same server as HA. I run HA on a Pi 4 and have whisper running in docker on my NUC-based Plex server. This does require manual configuration but isn't that hard once you understand it.
I've been using HA for years now, and I don't think there's a single feature that's not toggleable. I expect this one to be too, and also hope that LLM offloading to their cloud is part of their paid plan.
I would like to see this integrated into Gnome and other desktop environments so I can have an assistant there. This would be a very complex integration, so as you develop ways to integrate more stuff keep this kind of thing in mind.
Everything we make is accessible via APIs and integrating our Assist via APIs is already possible. Here is an example of an app someone made that runs on Windows, Mac and Linux: https://github.com/timmo001/home-assistant-assist-desktop
Regarding accessible local LLMs have you heard of the llamafiles project? It allows for packaging one executable LLM that works on Mac, windows and Linux.
Currently pushing for application note https://github.com/Mozilla-Ocho/llamafile/pull/178 to encourage integration. Would be good to hear your thoughts on making it easier for home assistant to integrate with llamafiles.
Also as an idea, maybe you could certify recommendations for LLM models for home assistant. Maybe for those specifically trained to operate home assistant you could call it "House Trained"? :)
As a user of Home Assistant, I would want to easily be able to try out different AI models with a single click from the user interface.
Home Assistant allows users to install add-ons which are Docker containers + metadata. This is how today users install Whisper or Piper for STT and TTS. Both these engines have a wrapper that speaks Wyoming, our voice assistant standard to integrate such engines, among other things. (https://github.com/rhasspy/rhasspy3/blob/master/docs/wyoming...)
If we rely on just the ChatGPT API to allow interacting with a model, we wouldn't know what capabilities the model has and so can't know what features to use to get valid JSON actions out. Can we pass our function definitions or should we extend the prompt with instructions on how to generate JSON?
I don't suppose you guys have something in the works for a polished voice I/O device to replace Alexa and Google Home? They work fine, but need internet connections to function. If the desire is to move to fully offline capabilities then we need the interface hardware to support. You've already proven you can move in the hardware market (I'm using one of your yellow devices now). I know I'd gladly pay for a fully offline interface for every room of my house.
That's something we've been building towards to all of last year. Last iteration can be seen at [1]. Still some checkboxes to check before we're ready to ship it on ready-made hardware.
[1]: https://www.home-assistant.io/blog/2023/12/13/year-of-the-vo...
I just took break from messing with my HA install to read ... and low and behold!!!
First thanks for a great product, I'll be setting up a dev env in the coming weeks to fix some of the bugs (cause they are impacting me) so see you soon on that front.
As for the grammar and framework langchain might be what's your looking for on the LLM front. https://python.langchain.com/docs/get_started/introduction
Have you guys thought about the hardware barriers? Because most of my open source LLM work has been on high end desktops with lots of GPU, GPU ram and system ram? Is there any thought to Jetson as a AIO upgrade from the PI?
Note that if going the constrained grammar route, at least ChatGPT (haven't tested on smaller models) understands BNF variants very well, and you can very much give it a compact BNF-like grammar and ask it to "translate X into grammar Y" and it works quite well even zero-shot. It will not be perfect on its own, but perhaps worth testing whether it's worth actually giving it the grammar you will be constraining its response to.
Depending on how much code/json a given model has been trained on, it may or may not also be worth testing if json is the easiest output format to get decent results for or whether something that reads more like a sentence but is still constrained enough to easily parse into JSON works better.
llama.cpp supports custom grammars to constrain inference. maybe this is a helpful starting point? https://github.com/ggerganov/llama.cpp/tree/master/grammars
Predibase has a writeup that fine-tunes llama-70b to get 99.9% valid JSON out
https://predibase.com/blog/how-to-fine-tune-llama-70b-for-st...
Why not create a GPT for this?
I only found out about https://www.rabbit.tech/research today and, to be honest, I still don't fully understand its scope. But reading your lines, I think rabbit's approach could be how a local AI based home automation system could work.
How does OpenAI handle the function generation? Is it unique to their model? Or does their model call a model fine-tuned for functions? Has there been any research by the Home Assistant team into GorillaLLM? It appears it’s fine-tuned to API calling and it is based on LLaMa. Maybe a Mixtral tune on their dataset could provide this? Or even just their model as it is.
I find the whole area fascinating. I’ve spent an unhealthy amount of time improving “Siri” by using some of the work from the COPILOT iOS Shortcut and giving it “functions” which are really just more iOS Shortcuts to do things on the phone like interact with my calendar. I’m using GPT-4 but it would be amazing to break free of OpenAI since they’re not so open and all.
Honor to meet you!
[Anonymous] founder of a similarly high-profile initiative here.
The LLM cannot make errors. The LLM spits out probabilities for the next tokens. What you do with it is up to you. You can make errors in how you handle this.
Standard usages pick the most likely token, or a random token from the top many choices. You don't need to do that. You can pick ONLY words which are valid JSON, or even ONLY words which are JSON matching your favorite JSON format. This is a library which does this:
https://github.com/outlines-dev/outlines
The one piece of advice I will give: Do NOT neuter the AI like OpenAI did. There is a near-obsession to define "AI safety" as "not hurting my feelings" (as opposed to "not hacking my computer," "not launching nuclear missiles," or "not exterminating humanity."). For technical reasons, that makes them work much worse. For practical reasons, I like AIs with humanity and personality (much as the OP has). If it says something offensive, I won't break.
AI safety, in this context, means validating that it's not:
* setting my thermostat to 300 degrees centigrade
* power-cycling my devices 100 times per second to break them
* waking me in the middle of the night
... and similar.
Also:
* Big win if it fits on a single 16GB card, and especially not just NVidia. The cheapest way to run an LLM is an Intel Arc A770 16GB. The second-cheapest is an NVidia 4060 Ti 16GB
* Azure gives a safer (not safe) way of running cloud-based models for people without that. I'm pretty sure there's a business model running these models safely too.
Tell the LLM a Typescript API and ask it to generate a script to run in response to the query. Then execute it in a sandboxed JS VM. This works very well with ChatGPT. Haven't tried it with less capable LLMs.
I'd suggest combining this with a something like nexusraven. i.e. both constrain it but also have an underlying model fine tuned to output in the required format. That'll improve results and let you use a much smaller model.
Another option is to use two LLMs. One to sus out the users natural lang intent and one to paraphrase the intent into something API friendly. The first model would be more suited to a big generic one, while second would be constrained & HA fine tuned.
Also have a look at project functionary on github - haven't tested it but looks similar.