return to table of content

Show HN: Voice bots with 500ms response times

geofffox
17 replies
1d13h

I use Firefox... still.

panja
10 replies
1d13h

I hate that everyone just develops for chromium only

the8472
2 replies
1d8h

They're not necessarily standards. I clicked on the first negative one and it said draft.

One browser vendor proposing things other vendor NAKing it makes it a vendor-specific feature. Like IE had its own features.

RockRobotRock
1 replies
16h31m

Yeah but we’re trying to fight against browser engine superiority, aren’t we?

I hate using chrome, but I’m forced to with any application that uses we busb/web serial or Bluetooth.

the8472
0 replies
7h0m

But that's basically complaining that firefox doesn't just blindly adopt whatever google proposes. A lot of the concerns are about security and privacy, the thing that mozilla is praised for doing better than google.

And no, you're not forced to use google. You can make native applications when it's necessary to use privileged interfaces.

93po
1 replies
23h38m

You prefer the management of Chromium, which makes billions a year from invading your privacy and force feeding you advertising, while also ruining the internet ecosystem?

RockRobotRock
0 replies
21h4m

Yes, me lightly criticizing Mozilla means that I endorse Google. Fuck off

Bluestein
0 replies
1d8h

A shame. They used to be the free (freedom) option.-

darren_
2 replies
1d12h

This site works fine in safari/mobile safari, it is not ‘chromium only’

hawski
1 replies
1d10h

WebKit and its derivatives then.

mcny
0 replies
1d10h

I tried it with FIrefox 127 (production) and it worked just fine for me even though there is a huge banner on the top.

makeitmore
1 replies
1d11h

Hi, I built the client UI for this and... yea, I really wanted to get Firefox working :(

We needed a way to measure voice-to-voice latency from the end-user's perspective, and found Silero voice activity detection (https://github.com/snakers4/silero-vad) to be the most reliable at detecting when the user has stopped speaking, so we can start the timer (and stop it again when audio is received from the bot.)

Silero runs via onnx-runtime (with wasm). Whilst it sort-of-kinda works in Firefox, the VAD seems to misfire more than it should, causing the latency numbers to be somewhat absurd. I really want to get it working though! I'm still trying.

The code for the UI VAD is here: https://github.com/pipecat-ai/web-client-ui/tree/main/src/va...

stavros
0 replies
1d9h

Do you know why there's a difference in the performance of the algorithm in another browser? I would expect that all browsers run the code exactly the same way.

chungus
1 replies
1d10h

It is working perfectly for me on Firefox (version 127).

makeitmore
0 replies
1d10h

Thanks for sharing. I did make some changes that seems to have improved things, although I do still see the occasional misfire. Perhaps good enough to remove that ugly red banner though!

sa-code
0 replies
1d13h

Likely a lot of people on HN use Firefox

proxygeek
0 replies
1d11h

Do not go by the warning message. It does work just fine on Firefox latest. Cool, demo, btw!

firefoxd
11 replies
1d10h

Well that was fast. Kudos, really neat. Speed trumps everything else. I only noticed the robotic voice after I read the comments.

I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.

One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."

The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.

lukan
7 replies
1d9h

"The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake."

I think not everyone would react the same way. For some calling each other bitch is normal talk (which is likely, why I it got into the training data in the first place). For others, not so much.

jstanley
3 replies
1d8h

It's also possible that it's such an unlikely thing to hear that she actually misheard it and thought it said something nicer.

sillysaurusx
2 replies
1d6h

Am I the only one who would be delighted to be called Bitch (or any of the worst male-specific terms) by random professionals?

"Hey fucker, your prescription has been ready for pickup for three days. Be sure to get your lazy ass over here or else you’ll need to reorder it. Love you bye"

dietr1ch
0 replies
16h17m

This is something I've been wanting ever since maps/driving apps came. I'd love to have Waze/GoogleMaps be angry when you miss an exit or miss the initial ETA by too much.

However, I don't think it fits the culture too well in the companies that could do it as trying hard not to offend anybody is of utmost importance.

big_man_ting
0 replies
1d3h

I would love this so much.

999900000999
2 replies
1d8h

If I'm used to waiting 2 days, and you get it down to 30 seconds you can call me what ever you want.

I'm more pissed if I'm waiting days for a response.

lukan
1 replies
1d7h

Me too. But I learned that not everyone is like me. And i general I also would not trust a LLM so much, that cannot divide between formal talk and ghetto slang. It will likely get other things wrong as well, humans will, too - so the error bar needs to be lower for me as a customer to be happier. I am not happy to get a fast, but wrong response and then fight for days to get an actual human to solve the mess.

999900000999
0 replies
22h38m

I've grown up in various neighborhoods. In no context would calling someone a slur like that when you don't even know them be acceptable .

That said, it's obviously a technical glitch. Let's say it was something really important like medication, would you rather wait two or three days to find out when it gets here, or would you rather have a glitchy AI say some gibberish but then add it's coming tomorrow

firefoxd
1 replies
1d1h

Fun fact, we fixed this issue by adding a #profanity tag and dropping the message to the next human agent.

Now our most prolific sales engineer could no longer run demos to potential clients. He had many embarrassing calls where the Ai would just not respond. His last name was Dick.

leobg
0 replies
13h2m

I find it odd that your engineer would make the system rely on instructions (“Do this. Never do that.”). This exposes your system to inconsistencies from the instruct tuning and future changes thereof by OpenAI or whoever. System prompts and instructions are maybe great for demos. But for a prod system where you have to cover all the bases I would never rely on such a thin layer of control.

(You can imagine the instruct layer to be like the skin on a peach. It’s tiny in influence compared to what’s inside. Even more so than, in humans, the cortex vs. the mammalian brain. Whoever tried to tell their kids not to touch the cookies while putting them in front of them and then leaving the room knows that relying on high level instructions is a bad idea.)

asjir
0 replies
1d3h

Maybe that was their first name, at least the one they put in lol

az226
4 replies
1d10h

Your marketing says 500 but your math says 759.

whizzter
0 replies
1d6h

500 are the transcription/llm/tts steps (ie the response time from data arriving on the server to sending back), the rest seems to be various non-AI "overheads" such as encoding, network traffic,etc.

vr000m
0 replies
1d7h

The latencies in the table are based on heuristics or averages that we’ve observed. However, in reality, based on the conversation, some of the larger latency components can be much lower.

vessenes
0 replies
1d6h

My tests had one outlier at 1400ms, and ten or so between 400-500ms. I think the marketing numbers were fair.

dietr1ch
0 replies
1d9h

That's called marketing

aussieguy1234
4 replies
1d15h

Fast yes, but the voice sounds robotic.

lofties
0 replies
1d8h

Typical HN comment. Absolutely incredible tech is displayed that honestly, one year ago nobody could've imagined. Yet people still find something to moan about. I'm sure the authors of the project, who should be very proud, are fully aware the voice is robotic.

kwindla
0 replies
1d14h

Voice models are getting both faster and more natural at a, well, a fast clip.

cloudking
0 replies
1d2h

It's literally a robot

bombela
0 replies
1d10h

I prefer a slighty robotic voice. This was way I know I am talking to a bot, and this sets expectations.

vessenes
3 replies
1d6h

This is so, so good. I like that it seems to be a teaser app for cerebrium, if I understand it. It has good killer app potential. My tests from iPad ranged from 1400ms to 400ms reported latency; in the low end, it felt very fluid.

One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.

Humans work like this; we use lots of filler words as we sort of get going responding to things.

Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].

Very cool, thank you for sharing this.

c0brac0bra
1 replies
1d3h

I've wondered about this as well. Is there a way to have a small, efficient LLM model that can estimate general task complexity without actually running the full task workload?

Scoring complexity on a gradient would let you know you need to send a "Sure, one second let me look that up for you." instead of waiting for a long round trip.

vessenes
0 replies
1d2h

For sure: in fact MoE models train such a router directly, and the routers are not super large. But it would also be easy to run phi-3 against a request.

I almost think you could do like a check my work style response: ‘I’m pretty sure xx, .. wait, actually y.’ Or if you were right, ‘yep that’s correct. I just checked.’

There’s time in there to do the check and to get the large model to bridge the first sentence with the final response.

za_mike157
0 replies
1d5h

Hi Vessenes

From Cerebrium here. Really appreciate the feedback - glad you had a good experience!

This application is easy to extend/implement meaning you can edit it to however you like: - Swap in different LLM's, STT and TTS models - Change prompts as well as implement RAG etc

In partnership with Daily, we really wanted to focus on the engineer here. So make it extremely flexible for them to edit the application to suit their use case/preference while at the same time take away the mundane infrastructure setup.

You can read more about how to extend it here: https://docs.cerebrium.ai/v4/examples/realtime-voice-agents

spuz
3 replies
1d9h

It's not exactly clear is this a voice-to-voice model or a voice-to-text-to-voice model? When it is finally released, OpenAI claim their GPT4o audio model will be a lot faster at conversations because there's no delay to convert from audio to text and back to audio again. I'm also looking forward to using voice models for language learning.

pavlov
1 replies
1d9h

It's a voice-to-text-to-voice approach, as implied by this description:

"host transcription, LLM inference, and voice generation all together in one place"

I think there are some benefits to going through text rather than using a voice-to-voice model. It creates a 100% reliable paper trail of what the model heard and said in the conversation. This can be extremely important in some applications where you need to review and validate what was said.

isaacfung
0 replies
1d7h

There are way more text training data than voice data. It also allows you to use all the benchmarks and tool integrations that have already been developed for LLMs.

luke-stanley
3 replies
1d9h

A cross-platform browser VAD module is: https://github.com/ricky0123/vad. This is an ONNX port of Silero's VAD network. By cross-platform, I mean it works in Firefox too. It doesn't need a WebRTC session to work, just microphone access, so it's simpler. I'm curious about the browser providing this as a native option too.

There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.

GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.

I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!

I do wonder how needed or optimal a single combined model is for latency and cost optimisation.

The breakdown provided is interesting.

I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?

regularfry
0 replies
1d1h

If you do stt and tts on the device but everything else remains the same, according to these numbers that saves you 120ms. The remaining 639ms is hardware and network latency, and shuffling data into and out of the LLM. That's still slower than you want.

Logically where you need to be is thinking in phonemes: you want the output of the LLM to have caught up with the last phoneme quickly enough that it can respond "instantly" when the endpoint is detected, and that means the whole chain needs to have 200ms latency end-to-end, or thereabouts. I suspect the only way to get anywhere close to that is with a different architecture, which would work somewhat more like human speech processing, in that it's front-running the audio stream by basing its output on phonemes predicted before they arrive, and only using the actual received audio as a lightweight confirmation signal to decide whether to flush the current output buffer or to reprocess. You can get part-way there with speculative decoding, but I don't think you can do it with a mixed audio/text pipeline. Much better never to have to convert from audio to text and back again.

phkahler
0 replies
1d2h

> I'm curious about the browser providing this as a native option too.

IMHO the desktop environment should provide voice to text as a service with a standard interface to applications - like stdin or similar but distinct for voice. Apps would ignore it by default since they aren't listening, but the transcriber could be swapped out and would be available to all apps.

charlesyu108
0 replies
1d1h

Lol this announcement blows what ive been working on out of the water but i have a simple assistant implementation with rick0123/VAD + Websockets.

https://github.com/charlesyu108/voiceai-js-starter

c0brac0bra
2 replies
1d7h

I've been developing with Deepgram for a while, and this is one of the coolest demos I've seen with it!

I am curious about total cost to run this thing, though. I assume that on top of whatever you're paying Cerebrium for GPU hosting you're also having to pay for Deepgram Enterprise in order to self-host it.

To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?

za_mike157
1 replies
1d5h

Hey! From the Cerebrium team here!

So our costs are based on the infra you use to run your application and we charge per millisecond of compute.

Some things to note that we might do differently to other providers: 1. You can specify your EXACT requirements and we charge you only for that. Eg: if you want 2 vCPU, 12GB Memory and 1 A10 GPU we charge you for that which is 35% less if you rented a whole A10 2. We have over 10 variety of GPU chips so you can choose the price/performance trade-off 3. While you can extend this on the Cerebrium platform, it cannot be used commercially. We are speaking to Deepgram to see how we can offer it to customers. Hopefully I can provide more updates on this soon

c0brac0bra
0 replies
1d3h

Excellent; thanks for the info.

SubiculumCode
2 replies
1d4h

A chatbot that interrupts me even faster. Sorry for the sarcasm. maybe im just slow, but when I'm trying to formulate a question on the spot, I pause a lot. having the chatbot jump in and interrupt is frustrating. Humans recognize the difference between someone still planning on saying something, and when they've finished. I even tried to give it a rule where it shouldn't respond until I said "The End", and of course it couldn't follow that instruction.

SubiculumCode
1 replies
1d4h

ps. The speed is impressive, but the key to a useful voice chatbot (which I've never seen) is one that adapts to your speaking style, identifies and employs turn-taking signals.

I acknowledge there are multiple viable patterns of social interaction, some talk over each other, and find that fun and engaging, while others think that's just the worst, and wait for a clear signal for their turn to speak and expect the same. I am of the latter.

SubiculumCode
0 replies
1d1h

I'm sure that, with an annotated dataset, a model could learn to pick up on the right cues.

yjftsjthsd-h
1 replies
1d14h

Dumb question - I see 2 opus encodes and decodes for a total around 120ms; is opus the fastest option?

kwindla
0 replies
1d13h

Yes, Opus is the fastest and best option for real-time audio. It was designed to be flexible and to encode/decode at fairly low latencies. It sounds good for narrow-band (speech) at low bitrates but also works well at higher bitrates for music. And forward error correction is part of the codec standard.

It's possible to tweak the Opus settings to reduce that encode/decode latency substantially. Which might actually be worth doing for this use case. But there isn't quite a free lunch, here. The default Opus frame size is 20ms. Smaller frames lower the encoding/decoding latency, but increase the bitrate. The implementation in libwebrtc is very well tested and optimized for the default 20ms frame sizes and maybe not so much at other frame sizes. Experience has made me leery of taking the less-trodden-paths without a lot of manual testing.

yalok
1 replies
1d6h

you may be double counting opus encoding/decoding delay - usually, you can run it with 20ms frame, and both encoder and decoder take less than 1ms of realtime for their operation - so it should be ~ 21ms, instead of 30+30ms for 1 direction.

kwindla
0 replies
1d4h

You are right! Thank you. I went back and looked at actual benchmark numbers from a couple of years ago and the numbers I got were ~26ms one-way. I rounded up to 30 to be conservative, but then double-counted in the table above. Will fix in the technical write-up. I don't think I can edit the Show HN.

hackerbob
1 replies
1d11h

This is indeed fast! Also seems to be no issue interrupting it while speaking. Is this using WebRTC echo cancellation to avoid microphone and speaker audio mix ups?

makeitmore
0 replies
1d10h

Yes, echo cancellation via the browser (and maybe also at OS-level too, if you're on a Mac with Sonoma.) The accuracy of speech detection vs. noise is largely thanks to Silero, which runs on the client via WASM. I'm surprised at how well it works, even in noisy environments (and a reminder that I should experiment more with AudioWorklet stuff in the future!)

andrewmcwatters
1 replies
23h56m

I love it when engineers worth their salt actually do the back-of-the-envelope calculations for latency, etc.

Tangentially related, I remember years ago when Stadia and other cloud gaming products were being released doing such calculations and showing a buddy of mine that even in the best case scenario, you'd always have high enough input latency to make even casual multiplayer FPS games over cloud gaming services not feasible, or rather, comfortable, to play. Other slower-paced games might work, but nothing requiring serious twitch gameplay reaction times.

The same math holds up today because of a combination of fundamental limits and state of the art limits.

Flumio
0 replies
21h22m

The calculations I was reading at the time suggested it would work for casual due to the gaming PC being very close to the game servers and running inside the best network available (googles).

Google also said that the controller would send the input straight to the server.

And a fast stadia server should have good fps combined with a little bit of brain prediction

amluto
1 replies
7h2m

Maybe silly question:

jitter buffer [40ms]

Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.

Olreich
0 replies
5h12m

Almost any gap in audio is detectable and sounds really bad. 40ms is a lot, but sending 40ms of silence is probably worse

_def
1 replies
1d8h

This was fun to try out. Earlier this week I tried june-va and the long response time kind of killed the usefulness. It's a great feature to get fast responses, this feels much more like a conversation. Funny enough, I asked it to tell me a story and then it only answered with one sentence at a time, requiring me to say "yes", "aha", "please continue" to get the next line. Then we had the following funny conversation:

"Oh I think I figured out your secret!"

"Please tell me"

"You achieve the short response times by keeping a short context"

"You're absolutely right"

danielbln
0 replies
1d6h

That works for me, to be honest. not the short context, but definitely the short replies. Contrast that with the current implementation of ChatGPT's voice mode, where you ask something and then get a minute worth of GPT bla bla.

trueforma
0 replies
21h8m

I too am excited about voice inferencing. I wrote my own Websocket Faster whisper implementation before OpenAI's gpt4o release . They steamrolled my interview coach concept https://intervu.trueforma.ai and https://sales.trueforma.ai - sales pitch coach implementations. I defaulted to Push to talk implementation as I couldn't get VAD to work reliably. I run it all on a panda Latte :) Was looking to implement Groq's hosted whisper. I love the idea of having Llama3 uncensored on Groq as the LLM as I'm tired of the boring corporate conversations. I hope to reduce my latency and learn from your examples - Kudos to your efforts. I wish I could try the demo - seems to be over subscribed as I can't get in to talk to the bot. I'm sure my latte Panda would melt if just 3 people try to inference at the same time :)

sumedh
0 replies
1d7h

This is very impressive, me and my kid had fund talking about space.

spark_chicken
0 replies
1d4h

i have tried it. it is really fast! I know making a real-time voice bot is not easy with this low latency. which LLM did you use? how large LLM to make the conversation efficient?

realyashnag
0 replies
1d3h

This was scary fast. Neat interface and (almost) indistinguishable from a human over the phone / internet. Kudos @cerebrium.ai.

preciousoo
0 replies
1d12h

This is so cool!

p_frank
0 replies
1d4h

Amazing to see the metrics of each part that is involved! I've wondererd why you could not introduce a small sound that overplays the waiting time? Like an "hmm" to skip a few 100ms of the response time? Could be pregenerated (like 500 different versions) and play after 200ms of the last users input.

mmcclure
0 replies
1d10h

Wow, Kwin, you’ve outdone yourself! The speed makes an even bigger difference than I expected going in.

Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).

mdbackman
0 replies
1d7h

Very, very impressive! It's incredibly fast, maybe too fast, but I think that's the point. What's most impressive though is how the VAD and interruptions are tuned. That was, by far, the most natural sounding conversation I've had with an agent. Really excited to try this out once it's available.

jaybrendansmith
0 replies
1d14h

This thing is incredible. It finished a sentence I was saying.

isoprophlex
0 replies
1d9h

Jesus fuck that's fast, and I had no idea speed mattered that much. Incredible. Feels like an entirely different experience than the 5+ seconds latency with openai.

ftth_finland
0 replies
1d7h

This is excellent!

Perfect comprehension and no problem even with bad accents.

dijit
0 replies
1d12h

I’m genuinely shocked by how conversational this is.

I think you hit a very important nail on the head here; I feel like that scene in iRobot where the protagonist talks to the hologram, or in the movie “AI” where the protagonist talks to an encyclopaedia called “Dr Know”

asjir
0 replies
1d3h

Personally, I use https://github.com/foges/whisper-dictation with llama-70b on groq. I start talking, navigate to website, and by the time it's loaded, and I picked llama-70b I finish talking, so 0 overhead. I read much faster than listen, so it works for me perfectly.

anonzzzies
0 replies
1d13h

This is pretty amazing ; it’s very fast indeed. I don’t really care about the voice responding sounding robotic; low latency is more important for whatever I do. And you can interrupt it too. Lovely.

andruby
0 replies
1d5h

This is really good. I'm blown away by how important the speed is.

And this was from a mobile connection in Europe, with a shown latency of just over 1s.

andrewstuart
0 replies
1d11h

Damned impressive.

Apple's Siri still can't allow me to have a conversation in which we aren't tripping over each other and pausing and flunking and the whole thing degrades into me hoping to get the barest minimum from it.

_DeadFred_
0 replies
19h53m

This is super cool. Thanks for sharing. And I'm excited it encourage other to share. I'm excited to spend some time this weekend looking at the different ways people in this thread implemented solutions.

Borborygymus
0 replies
7h48m

It /was/ nice and quick. Thanks for putting the demo online. It was quick to tell me complete nonsense. Apparently 7122 is the atomic number of Barium.