return to table of content

Show HN: Pi-C.A.R.D, a Raspberry Pi Voice Assistant

rkagerer
16 replies
22h38m

I wanted to create a voice assistant that is completely offline and doesn't require any internet connection. This is because I wanted to ensure that the user's privacy is protected and that the user's data is not being sent to any third party servers.

Props, and thank you for this.

pyaamb
11 replies
19h55m

I would love for Apple/Google to introduce some tech that would make it provable/verifiable that the camera/mic on the device can only be captured when the indicator is on and that it isn't possible for apps or even higher layers of OS to spoof this

herval
4 replies
19h49m

That’s allegedly the case in iOS (not the provable part, but I wonder if anyone managed to disprove it yet?)

pyaamb
1 replies
19h37m

I'm thinking perhaps a standardized open design circuit that can you can view by opening up back cover and zooming in with a microscope.

feel like privacy tech like this that seemed wildly overkill for everyday users becomes necessary as the value of collecting data and profiling humans goes through the roof

sneak
0 replies
19h3m

The value of the data you willingly transmit (both to data brokers, as well as in terms of the harm that it could do to you) via the use of apps (that are upfront about their transmission of your private data) and websites is far, far greater than the audio stream from inside your house.

If you don’t want your private information transmitted, worry about the things that are verifiably and obviously transmitting your private information instead of pointlessly fretting over things that are verifiably behaving as they claim.

Do you have the Instagram or Facebook apps on your phone? Are you logged in to Google?

These are much bigger threats to your privacy than a mic.

The sum total of all of your Telegram, Discord, and iMessage DMs (all of which are effectively provided to the service provider without end to end encryption) is way more interesting than 86400 images of you sitting in front of your computer with your face scrunched up, or WAVs of you yelling at your kids. One you knowingly provide to the service provider. The other never leaves your house.

Rebelgecko
1 replies
16h42m

It's annoying difficult to try and find an article about it, but IIRC there was a hack for the MacBook Pro's green camera LED where toggling the camera on and off super quickly would wouldn't give the LED time to light up

JakeStone
0 replies
16h6m

I didn't know that, but I'm glad I have my MacBook in clamshell mode with an external camera with a physical cover.

I mean, I appreciated the little green light, but the fact that it seemed necessary indicates to me that humanity still needs some evolving.

abraae
2 replies
19h39m

Red nail polish would like a word

pyaamb
1 replies
19h35m

I missed the reference. Red nail polish?

abraae
0 replies
18h18m

Used to paint over the indicator LED

krono
1 replies
17h46m

Thinkpads come with that, an unspoofable indicator that will tell you with 100% certainty that your image is not being recorded or even recordable unless the physical operator of the machine allows it. Can't beat a physical cover if you really want to be sure!

jerbear4328
0 replies
3h46m

Now make a cover for the microphone :)

everforward
0 replies
2h41m

You could probably add that if you were sufficiently motivated. Either by adding an LED on the power or data path, or by adding a physical switch. I think it should be fairly easy on laptops; I'm not sure where you'd jam the hardware in a phone or if you can access the cables for the camera/mic without messing with traces.

I'm a little curious if iPhones could be modified to route the {mic,camera} {power,data} through the silent mode switch, either instead of or in addition to the current functionality. I don't really have a need for a physical silent mode toggle, I'm totally fine with doing that in the settings or control panel.

rob74
1 replies
10h45m

Extra kudos for the name - and extra extra for using the good old "Picard facepalm" meme.

But seriously - the name got my attention, then I read the introduction and thought "hey, Alexa without uploading everything you say to Amazon? This might actually be something for me!".

The default wake word is "hey assistant" - I would suggest "Computer" :) And of course it should have a voice that sounds like https://en.wikipedia.org/wiki/Majel_Barrett
gpderetta
0 replies
9h27m

"Hey HAL, open the pod bay doors please"

squarefoot
0 replies
12h16m

+1. That's the #1 feature I want in any "assistant".

A question: does it run only on the Pi5 or other (also non Raspberry Pi) boards?

ornornor
0 replies
22h9m

Ditto!

eddieroger
16 replies
22h42m

Why Pi-card? > Raspberry Pi - Camera Audio Recognition Device.

Missed opportunity for LCARS - LLM Camera Audio Recognition Service, responding to the keyword "computer," naturally. I guess if this ran elsewhere from a Pi, it could be LCARS.

rkagerer
11 replies
22h40m

Pi-C.A.R.D is perfect. Read it 100% as Picard, and more recognizable that LCARS.

orthecreedence
10 replies
22h19m

Just configure it to respond to "Computer" and you're good to go.

fnordpiglet
2 replies
16h33m

As a professional technology person I say “computer” about 1megatoken per day

TeMPOraL
1 replies
9h17m

Yeah, but how often do you say "computer" in a querying/interrogatory tone?

That's a perfect opportunity to get better at cosplaying a Starfleet officer.

(Seriously though, a Federation-grade system would just recognize from context whether or not you meant to invoke the voice interface. Federation is full of near-AGI in simple appliances. Like the sliding doors that just know whether you want to go through them, or are just passing by.)

fnordpiglet
0 replies
2h12m

While totally true it’s not a good reason to use it as a wake word in 2024 with my raspberry pi voice assistant ;-)

datadrivenangel
2 replies
15h13m

Captain might be funnier.

Or just use the mouse.

rob74
0 replies
10h41m

It's just not very realistic - if you think you can give orders to your captain, you'll be out of Starfleet in no time!

eddieroger
0 replies
3h55m

How quaint.

thesnide
1 replies
21h0m

"Number One" would be my code word...

pimeys
0 replies
19h32m

And finally saying "make it so" to make the command happen.

nkaz123
1 replies
22h4m

The wake word detection is an interesting problem here. As you can see in the repo, I have a lot of mis-heard versions of the wake word in place, in this case being "Raspberry". Since the system heats up fast you need a fan, and with the microphone directly on a USB port next to the fan, I needed something distinct, and computer wasn't cutting it for this.

Changing the transcription model to something a bit better or moving the mic away from the fan could help this happen.

jhbruhn
0 replies
5h4m

Have a look at the openWakeWord model which is especially built for detecting wakewords in a stream of speech.

bdcravens
1 replies
22h16m

Or LLM Offline Camera, User Trained Understanding Speech

LOCUTUS

TeMPOraL
0 replies
9h13m

s/Offline/Online/ and make sure it has all the cloud features enabled, so you and your friends and loved ones can become one with the FAANG collective.

layer8
0 replies
22h8m

It should be really something like Beneficial Audio Realtime Recognition Electronic Transformer.

pawelduda
9 replies
21h3m

All I need is a voice assistant that: - RPi 4 can handle, - I can integrate with HomeAssistant, - is offline only, and doesn't send my data anywhere.

This project seems to be ticking most, if not all of the boxes, compared to anything else I've seen. Good job!

While at it, can someone drop a recommendation for a Rpi-compatible mic for Alexa-like usecase?

baobun
5 replies
19h52m

Check out Rhasspy.

You won't get anything practically useful running LLMs on the 4B but you also don't strictly need LLM-based models.

In the Rhasspy community, a common pattern is to do (cheap and lightweight) wake-word detection locally on mic-attached satellites (here 4B should be sufficient) and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

phkahler
2 replies
4h49m

> and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

This frustrates me. I ran Dragon Dictate on a 200MHz PC in the 1990s. Now that wasn't top quality, but it should have been good enough for voice assistants. I expect at least that quality on-device with an R-Pi today if not better.

IMHO the end game is on-device speech recognition and anything streaming audio somewhere else for processing is delaying this result.

genewitch
0 replies
4h14m

in the same vein there was a product called "Copernic Summarizer" that was functionally equivalent to asking chatgpt to summarize an article - 20 years ago.

everforward
0 replies
2h28m

IMHO the end game is on-device speech recognition and anything streaming audio somewhere else for processing is delaying this result.

Why? There's practically no latency to a central hub on the local network. A Raspberry Pi is probably over-specced for this, but I do very much see value in buying 5 $20 speaker/mic streaming stations and a $200 hub instead of buying 5 $100 Raspberry Pis.

If anything, I would expect the streaming to a hub solution to respond faster than the locally processed variant. My wifi latency is ~2ms, and my PC will run Whisper way more than 2ms faster than a Raspberry Pi. Add in running a model and my PC runs circles around a Raspberry Pi.

I ran Dragon Dictate on a 200MHz PC in the 1990s. Now that wasn't top quality, but it should have been good enough for voice assistants. I expect at least that quality on-device with an R-Pi today if not better.

You should get that via Whisper. I haven't used Dragon Dictate, but Whisper works very well. I've never trained it on my voice, and it rarely struggles outside of things that aren't "real words" (and even then it does a passable job of trying to spell it phonetically).

No idea about resources. It's such an unholy pain in the ass to get working in the same process as LLMs on Windows that I'm usually just ecstatic it will run. One of these days I'll figure out how to do all the passthroughs to WSL so I can yet again pretend I'm not using Windows while I use Windows (lol).

shaan7
0 replies
11h47m

I've been a happy Rhasspy user and now even more excited for the future because I'm hoping the conversational style (similar to what OpenAI demo'd yesterday) will eventually come to the offline world as well. It is okay if I gotta buy a GPU to make that happen, or maybe we get luckier and it wont? Ok, maybe I'm getting too optimistic now

BizarroLand
0 replies
2h40m

I have a decent home server, is there a system like this where the heavy lifting can be run on my home server but the front end piped to a ras-pi with speaker/camera, etc. ?

If not, any good pointers to start adapting this platform to do that?

yangikan
0 replies
19h33m

I have got good results with Playstation 3 and Playstation 4 cameras (which also have a mic). They are available for $15-20 on ebay.

jhbruhn
0 replies
5h5m

Have you checked out the Voice Assistant functionalities integrated into HA? https://www.home-assistant.io/voice_control/

NabuCasa employed the Rhasspy main dev to work on these functionalities and they are progressing with every update.

stereosteve
5 replies
19h21m

Why does Picard always have to specify temperature preference for his Earl Gray tea? Shouldn’t the super smart AI have learned his preference by now?

pkaye
0 replies
16h8m

Well the one time he just said "Tea, Earl Grey" the computer assumed, "Tea, Earl Grey, luke warm".

lttlrck
0 replies
17h45m

Perhaps he must be specific to override a hard, lawsuit proof, default that is too tepid for his tastes.

Will there still be lawsuits in the post-scarcity world? Probably.

dragonwriter
0 replies
19h16m

Why does Picard always have to specify temperature preference for his Earl Gray tea?

Completely OT, but, most likely, he doesn’t. Lots of people in the show direct with the replicators more fluidly; “Tea, Earl Grey, Hot” seems like a Picard quirk, possibly developed with more primitive food/beverage units than the replicators on the Enterprise-D.

TeMPOraL
0 replies
9h11m

Force of habit?

Most of Starfleet folks seem to not know how to use their replicators well anyway. For all the smarts they have, they use it like a mundane appliance they never bothered to read the manual for, miss 90% of its functionality, and then complain that replicated food tastes bad.

cmcconomy
5 replies
21h39m

Funny, I just picked up a device for use with https://heywillow.io for similar reasons

knodi123
4 replies
20h5m

me too, but I bricked mine when flashing the bios. just a fluke, nothing to be done about it.

nkaz123
3 replies
18h36m

I watched the demo, to be honest if I saw it sooner I probably would have tried to start this as a fork from there. Any idea what the issue was?

knodi123
1 replies
14h18m

No, but the core concept had changed so much, between the firmware version when I bought it, and what it is now, that I'm not surprised if the upgrade is a buggy process.

It's a shame, because I only wanted a tenth of what it could do- I just wanted it to send text to a REST server. That's all! And I had it working! But I saw there was a major firmware update, and to make a long story short, KABOOM.

kkielhofner
0 replies
6h47m

I'm the founder of Willow.

Early versions sent requests directly from devices and we found that to be problematic/inflexible for a variety of reasons.

Our new architecture uses the Willow Application Server (WAS) with a standard generic protocol to devices and handles the subtleties of the various command endpoints (Home Assistant, REST, OpenHAB, etc). This can be activated in WAS with the "WAS Command Endpoint" configuration option.

This approach also enables a feature we call Willow Auto Correct (WAC) that our users have found to be a game-changer in the open source voice assistant space.

I don't want to seem like I'm selling or shilling Willow, I just have a lot of experience with this application and many hard-learned lessons over the past two decades.

kkielhofner
0 replies
6h17m

I'm the founder of Willow.

Generally speaking the approach of "run this on your Pi with distro X" has a surprising number of issues that make it, umm, "rough" for this application. As anyone who has tried to get reliable low-latency audio with Linux can attest the hoops to get a standard distro (audio drivers, ALSA, pulseaudio/pipewire, etc) to work well are many. This is why all commercial solutions using similar architectures/devices build the "firmware" from the kernel up with stuff like Yocto.

More broadly, many people have tried this approach (run everything on the Pi) and frankly speaking it just doesn't have a chance to be even remotely competitive with commercial solutions in terms of quality and responsiveness. Decent quality speech recognition with Whisper (as one example) itself introduces significant latency.

We (and our users) have determined over the past year that in practice Whisper medium is pretty much the minimum for reliable transcription for typical commands under typical real-world conditions. If you watch a lot of these demo videos you'll notice the human is very close to the device and speaking very carefully and intently. That's just not the real world.

We haven't benchmarked Whisper on the Pi 5 but the HA community has and they find it to be about 2x in performance from the Pi 4 for Whisper. If you look at our benchmarks[0] that means you'll be waiting at least 25 seconds just for transcription with typical voice commands and Whisper medium.

At that point you can find your phone, unlock, open an app, and just do it yourself faster with a fraction of the frustration - if you have to repeat yourself even once you're looking at one minute for an action to complete.

The hardware just isn't there. Maybe on the Raspberry Pi 10 ;).

We have a generic software-only Willow client for "bring your distro and device" and it does not work anywhere close to as well as our main target of ESP32-S3 BOX devices based devices (acoustic engineering, targeted audio drivers, real time operating system, known hardware). That's why we haven't released it.

Many people also want multiple devices and this approach becomes even more cumbersome when you're trying to coordinate wake activation and functionality between all of them. You then also end up with a fleet of devices with full-blown Linux distros that becomes difficult to manage.

Using random/generic audio hardware itself also has significant issues. Many people in the Home Assistant community that have tried the "just bring a mic and speaker!" approach end up buying > $50 USB speakerphone devices with the acoustic engineering necessary for reliable far-field audio. They are often powered by XMOS chips that themselves cost a multiple of the ESP32-S3 (as one example). So at this point you're in the range of > $150 per location for a solution that is still nowhere near the experience of Echo, Google Home, or Willow :).

I don't want to seem critical of this project but I think it's important to set very clear expectations before people go out and spend money thinking it's going to resemble anything close to

[0] - https://heywillow.io/components/willow-inference-server/#com...

robbyiq999
2 replies
10h32m

It would be cool to see some raspi hats you could plug a GPU into, though unsure of how practical or feasible that would be. Todays graphics cards are tomorrows e-waste, perhaps they could get a second life beefing up a diy raspi project like this

piltdownman
0 replies
8h9m

Most of the USP, outside of the ecosystem developed around the single platform, is to do with form and power draw. Adding in the GPU/adapter/PSU to leverage cheap CUDA cores probably works out worse power/price/form wise than going for a better SoC or x86 NUC solution.

genewitch
0 replies
4h8m

for crypto-mining you'd convert a single PCIe slot into 4 x1 PCIe slots - or just have a board with 12+ x1 PCIe slots. I'm not sure what magic goes in to PCIe, but i do know that at least one commodity board had "exposed" PCIe interface - the Atomic Pi.

anyhow the GPU would sit on a small PCB that would connect to an even smaller PCB in the PCIe slot on the motherboard, via USB3 cable. My point here is merely that whatever PCIe is, it can be transported to a GPU to do work via USB3 cables.

knodi123
1 replies
20h4m

Is it possible to run this on a generic linux box? Or if not, are you aware of a similar project that can?

I've googled it before, but the space is crowded and the caveats are subtle.

CaptainOfCoit
0 replies
5h57m

Raspberry-pi is very much like a generic Linux box, biggest difference being that it's ARM rather than Intel/AMD CPU which makes things slightly less widely supported.

But overall, Pi-C.A.R.D seems to be using Python and cpp so shouldn't be any issues whatsoever to run this on whatever Python and cpp can be run/compiled on.

harwoodr
1 replies
22h52m

I see that a speaker is in the hardware list - does this speak back?

nkaz123
0 replies
22h13m

Yes! I'm currently using https://espeak.sourceforge.net/, so it isn't especially fun to listen to though.

Additionally, since I'm streaming the LLM response, it won't take long to get your reply. Since it does it a chunk at a time, there's occasionally only parts of words that are said momentarily. Also of course depends on what model you use or what the context size is for how long you need to wait.

ghnws
1 replies
9h29m

I didn't see a mention of languages in the readme. Does this understand languages other than english?

timendum
0 replies
8h54m

The suggested model for vision capabilities is english only.

ethagnawl
1 replies
16h12m

I'm looking forward to trying this. Hopefully this gains traction, as (AFAIK) an open, reliable, flexible, privacy focused voice assistant is still sorely needed.

About a year ago, my family was really keen on getting an Alexa. I don't want Bezos spy devices in our home, so I convinced them to let me try making our own. I went with Mycroft on a Pi 4 and it did not go well. The wake word detection was inconsistent, the integrations were lacking and I think it'd been effectively abandoned by that point. I'd intended to contribute to the project and some of the integrations I was struggling with but life intervened and I never got back to it. Also, thankfully, my family forgot about the Alexa.

genewitch
0 replies
4h5m

some "maker" products that were sold at target had a cardboard box with an arcade RGB-LED button on top, a speaker, and 4 microphones on a "hat" for a rpi... nano? pico? whatever the one that is roughly the size of a sodimm. It didn't have a wakeword, you'd push the white-lit button, it would change color twice, once to acknowledge the press, and again to indicate it was listening. It would change color once you finished speaking and then it would speak back to you.

It used some google thing on the backend, and it was really frustrating to get set up and keep working - but it did work.

i have two of those devices, so i've been waiting for something to come that would let me self-host something similar.

dasl
1 replies
21h31m

What latency do you get? I'd be interested in seeing a demo video.

nkaz123
0 replies
21h28m

Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.

aci_12
1 replies
21h33m

how does wake word work? Does it keep listening and ignore if the last few seconds does not have the wake word/phrase?

knodi123
0 replies
20h5m

that's the general idea, yes. Or rather, store several chunks of audio, and discard the oldest. aka "Rolling window".

zenkalia
0 replies
14h46m

How hard is it to make an offline model like this learn?

The readme mentions a memory that lasts as long as each conversation which seems like such a hard limitation to live with.

nl
0 replies
11h22m

It uses distributed models so latency is something I'm working on,

I think it uses local models, right?

nickthegreek
0 replies
19h1m

Show HN:

kazinator
0 replies
17h6m

I would have hidden the name behind a bit of indirection by calling it Jean Luc or something.

ddingus
0 replies
18h35m

Thank you! This is a high value effort!

MH15
0 replies
19h26m

I tried to build this on an early gen RPI 4 about three years ago- but ran in to limitations in the hardware (and in my own knowledge). Super cool to see it happening now!

8mobile
0 replies
13h24m

I really like having a voice assistant that focuses on privacy first. I will definitely try it. always add a video to show how it works. Thank you