HN comments for: Show HN: Pi-C.A.R.D, a Raspberry Pi Voice Assistant

rkagerer

16 replies

22h38m

2024-05-13 19:50:08 UTC

I wanted to create a voice assistant that is completely offline and doesn't require any internet connection. This is because I wanted to ensure that the user's privacy is protected and that the user's data is not being sent to any third party servers.

Props, and thank you for this.

pyaamb

11 replies

19h55m

2024-05-13 22:33:13 UTC

I would love for Apple/Google to introduce some tech that would make it provable/verifiable that the camera/mic on the device can only be captured when the indicator is on and that it isn't possible for apps or even higher layers of OS to spoof this

herval

4 replies

19h49m

2024-05-13 22:38:53 UTC

That’s allegedly the case in iOS (not the provable part, but I wonder if anyone managed to disprove it yet?)

pyaamb

1 replies

19h37m

2024-05-13 22:50:28 UTC

I'm thinking perhaps a standardized open design circuit that can you can view by opening up back cover and zooming in with a microscope.

feel like privacy tech like this that seemed wildly overkill for everyday users becomes necessary as the value of collecting data and profiling humans goes through the roof

sneak

0 replies

19h3m

2024-05-13 23:24:46 UTC

The value of the data you willingly transmit (both to data brokers, as well as in terms of the harm that it could do to you) via the use of apps (that are upfront about their transmission of your private data) and websites is far, far greater than the audio stream from inside your house.

If you don’t want your private information transmitted, worry about the things that are verifiably and obviously transmitting your private information instead of pointlessly fretting over things that are verifiably behaving as they claim.

Do you have the Instagram or Facebook apps on your phone? Are you logged in to Google?

These are much bigger threats to your privacy than a mic.

The sum total of all of your Telegram, Discord, and iMessage DMs (all of which are effectively provided to the service provider without end to end encryption) is way more interesting than 86400 images of you sitting in front of your computer with your face scrunched up, or WAVs of you yelling at your kids. One you knowingly provide to the service provider. The other never leaves your house.

Rebelgecko

1 replies

16h42m

2024-05-14 01:45:30 UTC

It's annoying difficult to try and find an article about it, but IIRC there was a hack for the MacBook Pro's green camera LED where toggling the camera on and off super quickly would wouldn't give the LED time to light up

JakeStone

0 replies

16h6m

2024-05-14 02:22:08 UTC

I didn't know that, but I'm glad I have my MacBook in clamshell mode with an external camera with a physical cover.

I mean, I appreciated the little green light, but the fact that it seemed necessary indicates to me that humanity still needs some evolving.

abraae

2 replies

19h39m

2024-05-13 22:48:26 UTC

Red nail polish would like a word

pyaamb

1 replies

19h35m

2024-05-13 22:53:08 UTC

I missed the reference. Red nail polish?

abraae

0 replies

18h18m

2024-05-14 00:09:49 UTC

Used to paint over the indicator LED

krono

1 replies

17h46m

2024-05-14 00:41:50 UTC

Thinkpads come with that, an unspoofable indicator that will tell you with 100% certainty that your image is not being recorded or even recordable unless the physical operator of the machine allows it. Can't beat a physical cover if you really want to be sure!

jerbear4328

0 replies

3h46m

2024-05-14 14:41:29 UTC

Now make a cover for the microphone :)

everforward

0 replies

2h41m

2024-05-14 15:46:21 UTC

You could probably add that if you were sufficiently motivated. Either by adding an LED on the power or data path, or by adding a physical switch. I think it should be fairly easy on laptops; I'm not sure where you'd jam the hardware in a phone or if you can access the cables for the camera/mic without messing with traces.

I'm a little curious if iPhones could be modified to route the {mic,camera} {power,data} through the silent mode switch, either instead of or in addition to the current functionality. I don't really have a need for a physical silent mode toggle, I'm totally fine with doing that in the settings or control panel.

rob74

1 replies

10h45m

2024-05-14 07:43:10 UTC

Extra kudos for the name - and extra extra for using the good old "Picard facepalm" meme.

But seriously - the name got my attention, then I read the introduction and thought "hey, Alexa without uploading everything you say to Amazon? This might actually be something for me!".

The default wake word is "hey assistant" - I would suggest "Computer" :) And of course it should have a voice that sounds like https://en.wikipedia.org/wiki/Majel_Barrett

gpderetta

0 replies

9h27m

2024-05-14 09:00:42 UTC

"Hey HAL, open the pod bay doors please"

squarefoot

0 replies

12h16m

2024-05-14 06:11:56 UTC

+1. That's the #1 feature I want in any "assistant".

A question: does it run only on the Pi5 or other (also non Raspberry Pi) boards?

ornornor

0 replies

22h9m

2024-05-13 20:18:43 UTC

Ditto!

eddieroger

16 replies

22h42m

2024-05-13 19:45:25 UTC

Why Pi-card? > Raspberry Pi - Camera Audio Recognition Device.

Missed opportunity for LCARS - LLM Camera Audio Recognition Service, responding to the keyword "computer," naturally. I guess if this ran elsewhere from a Pi, it could be LCARS.

rkagerer

11 replies

22h40m

2024-05-13 19:47:19 UTC

Pi-C.A.R.D is perfect. Read it 100% as Picard, and more recognizable that LCARS.

orthecreedence

10 replies

22h19m

2024-05-13 20:09:03 UTC

Just configure it to respond to "Computer" and you're good to go.

fnordpiglet

2 replies

16h33m

2024-05-14 01:54:24 UTC

As a professional technology person I say “computer” about 1megatoken per day

TeMPOraL

1 replies

9h17m

2024-05-14 09:10:43 UTC

Yeah, but how often do you say "computer" in a querying/interrogatory tone?

That's a perfect opportunity to get better at cosplaying a Starfleet officer.

(Seriously though, a Federation-grade system would just recognize from context whether or not you meant to invoke the voice interface. Federation is full of near-AGI in simple appliances. Like the sliding doors that just know whether you want to go through them, or are just passing by.)

fnordpiglet

0 replies

2h12m

2024-05-14 16:15:48 UTC

While totally true it’s not a good reason to use it as a wake word in 2024 with my raspberry pi voice assistant ;-)

datadrivenangel

2 replies

15h13m

2024-05-14 03:15:14 UTC

Captain might be funnier.

Or just use the mouse.

rob74

0 replies

10h41m

2024-05-14 07:47:13 UTC

It's just not very realistic - if you think you can give orders to your captain, you'll be out of Starfleet in no time!

eddieroger

0 replies

3h55m

2024-05-14 14:32:48 UTC

How quaint.

thesnide

1 replies

21h0m

2024-05-13 21:27:21 UTC

"Number One" would be my code word...

pimeys

0 replies

19h32m

2024-05-13 22:55:54 UTC

And finally saying "make it so" to make the command happen.

nkaz123

1 replies

22h4m

2024-05-13 20:23:30 UTC

The wake word detection is an interesting problem here. As you can see in the repo, I have a lot of mis-heard versions of the wake word in place, in this case being "Raspberry". Since the system heats up fast you need a fan, and with the microphone directly on a USB port next to the fan, I needed something distinct, and computer wasn't cutting it for this.

Changing the transcription model to something a bit better or moving the mic away from the fan could help this happen.

jhbruhn

0 replies

5h4m

2024-05-14 13:23:44 UTC

Have a look at the openWakeWord model which is especially built for detecting wakewords in a stream of speech.

bdcravens

1 replies

22h16m

2024-05-13 20:11:26 UTC

Or LLM Offline Camera, User Trained Understanding Speech

LOCUTUS

TeMPOraL

0 replies

9h13m

2024-05-14 09:14:17 UTC

s/Offline/Online/ and make sure it has all the cloud features enabled, so you and your friends and loved ones can become one with the FAANG collective.

layer8

0 replies

22h8m

2024-05-13 20:19:38 UTC

It should be really something like Beneficial Audio Realtime Recognition Electronic Transformer.

MisterTea

0 replies

22h15m

2024-05-13 20:13:00 UTC

This is why we can't have nice LCARS things: https://en.wikipedia.org/wiki/LCARS#Legal

pawelduda

9 replies

21h3m

2024-05-13 21:24:54 UTC

All I need is a voice assistant that: - RPi 4 can handle, - I can integrate with HomeAssistant, - is offline only, and doesn't send my data anywhere.

This project seems to be ticking most, if not all of the boxes, compared to anything else I've seen. Good job!

While at it, can someone drop a recommendation for a Rpi-compatible mic for Alexa-like usecase?

baobun

5 replies

19h52m

2024-05-13 22:35:36 UTC

Check out Rhasspy.

You won't get anything practically useful running LLMs on the 4B but you also don't strictly need LLM-based models.

In the Rhasspy community, a common pattern is to do (cheap and lightweight) wake-word detection locally on mic-attached satellites (here 4B should be sufficient) and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

phkahler

2 replies

4h49m

2024-05-14 13:39:00 UTC

> and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

This frustrates me. I ran Dragon Dictate on a 200MHz PC in the 1990s. Now that wasn't top quality, but it should have been good enough for voice assistants. I expect at least that quality on-device with an R-Pi today if not better.

IMHO the end game is on-device speech recognition and anything streaming audio somewhere else for processing is delaying this result.

genewitch

0 replies

4h14m

2024-05-14 14:14:11 UTC

in the same vein there was a product called "Copernic Summarizer" that was functionally equivalent to asking chatgpt to summarize an article - 20 years ago.

everforward

0 replies

2h28m

2024-05-14 15:59:55 UTC

IMHO the end game is on-device speech recognition and anything streaming audio somewhere else for processing is delaying this result.

Why? There's practically no latency to a central hub on the local network. A Raspberry Pi is probably over-specced for this, but I do very much see value in buying 5 $20 speaker/mic streaming stations and a $200 hub instead of buying 5 $100 Raspberry Pis.

If anything, I would expect the streaming to a hub solution to respond faster than the locally processed variant. My wifi latency is ~2ms, and my PC will run Whisper way more than 2ms faster than a Raspberry Pi. Add in running a model and my PC runs circles around a Raspberry Pi.

I ran Dragon Dictate on a 200MHz PC in the 1990s. Now that wasn't top quality, but it should have been good enough for voice assistants. I expect at least that quality on-device with an R-Pi today if not better.

You should get that via Whisper. I haven't used Dragon Dictate, but Whisper works very well. I've never trained it on my voice, and it rarely struggles outside of things that aren't "real words" (and even then it does a passable job of trying to spell it phonetically).

No idea about resources. It's such an unholy pain in the ass to get working in the same process as LLMs on Windows that I'm usually just ecstatic it will run. One of these days I'll figure out how to do all the passthroughs to WSL so I can yet again pretend I'm not using Windows while I use Windows (lol).

shaan7

0 replies

11h47m

2024-05-14 06:40:52 UTC

I've been a happy Rhasspy user and now even more excited for the future because I'm hoping the conversational style (similar to what OpenAI demo'd yesterday) will eventually come to the offline world as well. It is okay if I gotta buy a GPU to make that happen, or maybe we get luckier and it wont? Ok, maybe I'm getting too optimistic now

BizarroLand

0 replies

2h40m

2024-05-14 15:47:35 UTC

I have a decent home server, is there a system like this where the heavy lifting can be run on my home server but the front end piped to a ras-pi with speaker/camera, etc. ?

If not, any good pointers to start adapting this platform to do that?

yangikan

0 replies

19h33m

2024-05-13 22:54:18 UTC

I have got good results with Playstation 3 and Playstation 4 cameras (which also have a mic). They are available for $15-20 on ebay.

jhbruhn

0 replies

5h5m

2024-05-14 13:22:32 UTC

Have you checked out the Voice Assistant functionalities integrated into HA? https://www.home-assistant.io/voice_control/

NabuCasa employed the Rhasspy main dev to work on these functionalities and they are progressing with every update.

8xeh

0 replies

20h55m

2024-05-13 21:33:13 UTC

https://www.robotshop.com/products/respeaker-usb-microphone-...

stereosteve

5 replies

19h21m

2024-05-13 23:06:26 UTC

Why does Picard always have to specify temperature preference for his Earl Gray tea? Shouldn’t the super smart AI have learned his preference by now?

pkaye

0 replies

16h8m

2024-05-14 02:19:49 UTC

Well the one time he just said "Tea, Earl Grey" the computer assumed, "Tea, Earl Grey, luke warm".

lttlrck

0 replies

17h45m

2024-05-14 00:42:55 UTC

Perhaps he must be specific to override a hard, lawsuit proof, default that is too tepid for his tastes.

Will there still be lawsuits in the post-scarcity world? Probably.

dragonwriter

0 replies

19h16m

2024-05-13 23:11:24 UTC

Why does Picard always have to specify temperature preference for his Earl Gray tea?

Completely OT, but, most likely, he doesn’t. Lots of people in the show direct with the replicators more fluidly; “Tea, Earl Grey, Hot” seems like a Picard quirk, possibly developed with more primitive food/beverage units than the replicators on the Enterprise-D.

cheeseomlit

0 replies

4h39m

2024-05-14 13:48:21 UTC

If anything he's not being specific enough https://i.redd.it/hluqexh3oqc91.jpg

TeMPOraL

0 replies

9h11m

2024-05-14 09:17:03 UTC

Force of habit?

Most of Starfleet folks seem to not know how to use their replicators well anyway. For all the smarts they have, they use it like a mundane appliance they never bothered to read the manual for, miss 90% of its functionality, and then complain that replicated food tastes bad.

cmcconomy

5 replies

21h39m

2024-05-13 20:48:26 UTC

Funny, I just picked up a device for use with https://heywillow.io for similar reasons

knodi123

4 replies

20h5m

2024-05-13 22:23:14 UTC

me too, but I bricked mine when flashing the bios. just a fluke, nothing to be done about it.

nkaz123

3 replies

18h36m

2024-05-13 23:51:19 UTC

I watched the demo, to be honest if I saw it sooner I probably would have tried to start this as a fork from there. Any idea what the issue was?

knodi123

1 replies

14h18m

2024-05-14 04:09:58 UTC

No, but the core concept had changed so much, between the firmware version when I bought it, and what it is now, that I'm not surprised if the upgrade is a buggy process.

It's a shame, because I only wanted a tenth of what it could do- I just wanted it to send text to a REST server. That's all! And I had it working! But I saw there was a major firmware update, and to make a long story short, KABOOM.

kkielhofner

0 replies

6h47m

2024-05-14 11:41:07 UTC

I'm the founder of Willow.

Early versions sent requests directly from devices and we found that to be problematic/inflexible for a variety of reasons.

Our new architecture uses the Willow Application Server (WAS) with a standard generic protocol to devices and handles the subtleties of the various command endpoints (Home Assistant, REST, OpenHAB, etc). This can be activated in WAS with the "WAS Command Endpoint" configuration option.

This approach also enables a feature we call Willow Auto Correct (WAC) that our users have found to be a game-changer in the open source voice assistant space.

I don't want to seem like I'm selling or shilling Willow, I just have a lot of experience with this application and many hard-learned lessons over the past two decades.

kkielhofner

0 replies

6h17m

2024-05-14 12:11:12 UTC

I'm the founder of Willow.

Generally speaking the approach of "run this on your Pi with distro X" has a surprising number of issues that make it, umm, "rough" for this application. As anyone who has tried to get reliable low-latency audio with Linux can attest the hoops to get a standard distro (audio drivers, ALSA, pulseaudio/pipewire, etc) to work well are many. This is why all commercial solutions using similar architectures/devices build the "firmware" from the kernel up with stuff like Yocto.

More broadly, many people have tried this approach (run everything on the Pi) and frankly speaking it just doesn't have a chance to be even remotely competitive with commercial solutions in terms of quality and responsiveness. Decent quality speech recognition with Whisper (as one example) itself introduces significant latency.

We (and our users) have determined over the past year that in practice Whisper medium is pretty much the minimum for reliable transcription for typical commands under typical real-world conditions. If you watch a lot of these demo videos you'll notice the human is very close to the device and speaking very carefully and intently. That's just not the real world.

We haven't benchmarked Whisper on the Pi 5 but the HA community has and they find it to be about 2x in performance from the Pi 4 for Whisper. If you look at our benchmarks[0] that means you'll be waiting at least 25 seconds just for transcription with typical voice commands and Whisper medium.

At that point you can find your phone, unlock, open an app, and just do it yourself faster with a fraction of the frustration - if you have to repeat yourself even once you're looking at one minute for an action to complete.

The hardware just isn't there. Maybe on the Raspberry Pi 10 ;).

We have a generic software-only Willow client for "bring your distro and device" and it does not work anywhere close to as well as our main target of ESP32-S3 BOX devices based devices (acoustic engineering, targeted audio drivers, real time operating system, known hardware). That's why we haven't released it.

Many people also want multiple devices and this approach becomes even more cumbersome when you're trying to coordinate wake activation and functionality between all of them. You then also end up with a fleet of devices with full-blown Linux distros that becomes difficult to manage.

Using random/generic audio hardware itself also has significant issues. Many people in the Home Assistant community that have tried the "just bring a mic and speaker!" approach end up buying > $50 USB speakerphone devices with the acoustic engineering necessary for reliable far-field audio. They are often powered by XMOS chips that themselves cost a multiple of the ESP32-S3 (as one example). So at this point you're in the range of > $150 per location for a solution that is still nowhere near the experience of Echo, Google Home, or Willow :).

I don't want to seem critical of this project but I think it's important to set very clear expectations before people go out and spend money thinking it's going to resemble anything close to

[0] - https://heywillow.io/components/willow-inference-server/#com...

robbyiq999

2 replies

10h32m

2024-05-14 07:55:30 UTC

It would be cool to see some raspi hats you could plug a GPU into, though unsure of how practical or feasible that would be. Todays graphics cards are tomorrows e-waste, perhaps they could get a second life beefing up a diy raspi project like this

piltdownman

0 replies

8h9m

2024-05-14 10:18:20 UTC

Most of the USP, outside of the ecosystem developed around the single platform, is to do with form and power draw. Adding in the GPU/adapter/PSU to leverage cheap CUDA cores probably works out worse power/price/form wise than going for a better SoC or x86 NUC solution.

genewitch

0 replies

4h8m

2024-05-14 14:19:20 UTC

for crypto-mining you'd convert a single PCIe slot into 4 x1 PCIe slots - or just have a board with 12+ x1 PCIe slots. I'm not sure what magic goes in to PCIe, but i do know that at least one commodity board had "exposed" PCIe interface - the Atomic Pi.

anyhow the GPU would sit on a small PCB that would connect to an even smaller PCB in the PCIe slot on the motherboard, via USB3 cable. My point here is merely that whatever PCIe is, it can be transported to a GPU to do work via USB3 cables.

knodi123

1 replies

20h4m

2024-05-13 22:24:01 UTC

Is it possible to run this on a generic linux box? Or if not, are you aware of a similar project that can?

I've googled it before, but the space is crowded and the caveats are subtle.

CaptainOfCoit

0 replies

5h57m

2024-05-14 12:31:13 UTC

Raspberry-pi is very much like a generic Linux box, biggest difference being that it's ARM rather than Intel/AMD CPU which makes things slightly less widely supported.

But overall, Pi-C.A.R.D seems to be using Python and cpp so shouldn't be any issues whatsoever to run this on whatever Python and cpp can be run/compiled on.

harwoodr

1 replies

22h52m

2024-05-13 19:35:45 UTC

I see that a speaker is in the hardware list - does this speak back?

nkaz123

0 replies

22h13m

2024-05-13 20:14:53 UTC

Yes! I'm currently using https://espeak.sourceforge.net/, so it isn't especially fun to listen to though.

Additionally, since I'm streaming the LLM response, it won't take long to get your reply. Since it does it a chunk at a time, there's occasionally only parts of words that are said momentarily. Also of course depends on what model you use or what the context size is for how long you need to wait.

ghnws

1 replies

9h29m

2024-05-14 08:58:53 UTC

I didn't see a mention of languages in the readme. Does this understand languages other than english?

timendum

0 replies

8h54m

2024-05-14 09:33:28 UTC

The suggested model for vision capabilities is english only.

ethagnawl

1 replies

16h12m

2024-05-14 02:15:22 UTC

I'm looking forward to trying this. Hopefully this gains traction, as (AFAIK) an open, reliable, flexible, privacy focused voice assistant is still sorely needed.

About a year ago, my family was really keen on getting an Alexa. I don't want Bezos spy devices in our home, so I convinced them to let me try making our own. I went with Mycroft on a Pi 4 and it did not go well. The wake word detection was inconsistent, the integrations were lacking and I think it'd been effectively abandoned by that point. I'd intended to contribute to the project and some of the integrations I was struggling with but life intervened and I never got back to it. Also, thankfully, my family forgot about the Alexa.

genewitch

0 replies

4h5m

2024-05-14 14:23:07 UTC

some "maker" products that were sold at target had a cardboard box with an arcade RGB-LED button on top, a speaker, and 4 microphones on a "hat" for a rpi... nano? pico? whatever the one that is roughly the size of a sodimm. It didn't have a wakeword, you'd push the white-lit button, it would change color twice, once to acknowledge the press, and again to indicate it was listening. It would change color once you finished speaking and then it would speak back to you.

It used some google thing on the backend, and it was really frustrating to get set up and keep working - but it did work.

i have two of those devices, so i've been waiting for something to come that would let me self-host something similar.

dasl

1 replies

21h31m

2024-05-13 20:57:01 UTC

What latency do you get? I'd be interested in seeing a demo video.

nkaz123

0 replies

21h28m

2024-05-13 21:00:00 UTC

Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.

aci_12

1 replies

21h33m

2024-05-13 20:54:24 UTC

how does wake word work? Does it keep listening and ignore if the last few seconds does not have the wake word/phrase?

knodi123

0 replies

20h5m

2024-05-13 22:22:37 UTC

that's the general idea, yes. Or rather, store several chunks of audio, and discard the oldest. aka "Rolling window".

zenkalia

0 replies

14h46m

2024-05-14 03:41:17 UTC

How hard is it to make an offline model like this learn?

The readme mentions a memory that lasts as long as each conversation which seems like such a hard limitation to live with.

0 replies

11h22m

2024-05-14 07:06:10 UTC

It uses distributed models so latency is something I'm working on,

I think it uses local models, right?

nickthegreek

0 replies

19h1m

2024-05-13 23:26:31 UTC

Show HN:

kazinator

0 replies

17h6m

2024-05-14 01:21:35 UTC

I would have hidden the name behind a bit of indirection by calling it Jean Luc or something.

ddingus

0 replies

18h35m

2024-05-13 23:53:07 UTC

Thank you! This is a high value effort!

MH15

0 replies

19h26m

2024-05-13 23:01:45 UTC

I tried to build this on an early gen RPI 4 about three years ago- but ran in to limitations in the hardware (and in my own knowledge). Super cool to see it happening now!

8mobile

0 replies

13h24m

2024-05-14 05:03:56 UTC

I really like having a voice assistant that focuses on privacy first. I will definitely try it. always add a video to show how it works. Thank you